Avoiding ‘fake news’ during AI-driven drug discovery

April 2, 2021

Initial phases of drug discovery and development and preclinical testing involves disease-associated target Identification, screening of small molecule libraries for HIT selection, de novo drug design and DTI prediction. Using deep learning AI strategy, Natural Language Processing (NLP) to derive useful semantic associations and relationships between protein targets, drug molecules, genes and signalling pathways is key in these phases. NLP is the automated text mining technology where large volumes of text-data are transformed into actionable structured data that can be rapidly analysed for new meaningful findings. In simple terms the computer “reads” text simulating the human ability to understand concepts, context and relationships. It aims at consistent, unbiased processing to establish patterns/findings that can be reproduced to guide decision-making in drug R&D.

AI-driven drug discovery has a few considerations: Fake news and false predictions, transparency, justification, biases and discrimination. The semantic relationships and hidden patterns between drugs-targets-genes-signalling and expression pathways and new target findings/predictions emerging like a Network of interconnected biological entities and relationships that can be understood as ‘True/Actual Associations’ when they can be correctly justified by words connecting them. Such findings/predictions can also be false, thus need to be verified.

‘False Positives’ are when an association is determined true while it’s actually false E.g. non-interaction pairs are wrongly predicted as positive. This is ‘False Alarm’ and should be avoided. When such associations are incorrect and cannot be reproduced and are a result of misinterpretation, they are termed ‘Co-occurrence’. ‘False Negatives’ are when something is determined to be false while it’s actually present or true. This means that an association or predictions was missed and remained undetected by the system. E.g. actual DTIs pairs remain unpredicted. These kinds of fake news are highly impactful in determining how confident scientists are in their decisions. Another question that follows AI-assisted predictions is ‘Can the findings be reproduced in different datasets?’. All of these shortcomings can be circumvented by optimal training of AI algorithms. But determining “optimal” level of training is crucial and well validated AI infrastructure like IBM watson helps you to pinpoint that limit.

Confidence and guarantee of the best scientific learnings from AI are achieved through the use of proven technology and optimal supervised training of the system. Thus we use IBM infrastructure for our platform. Its features like ‘Knowledge Graph’ and ‘Labelled Property Graph’ provide a holistic summary of all the entities and semantic relationships picked up by the AI from all the documents. This allows for easy-querying and analysis. The ‘confidence score or connection strength’ predicted by the system is a good indicator of ‘how confident is the identified entity or relationship?’. This indicated connection strength can be leveraged to filter out fake connections and determine the most confident relationship that can be used to test our hypotheses.

Optimal supervised training of the AI is a key strategy followed at Keystonemab to achieve best results. During supervised training, the data is first split into training and test data. The model is initially trained using the carefully selected training data. Afterwards, the performance of the model is then tested using the test data. The more the training set, the more the model learns from the data.Keystonemab ensures the use of large, high-quality multi-omics and clinical data which is non-biased to train our AI. If it is biased or otherwise flawed, that will be reflected in the performance, and false predictions of targets and relationships will likely be counterproductive. At Keystonemab, both data from repositories and scientific publications along with artificially constructed sentences where unnecessary background noise and incidents of doubt has been eliminated are used to ensure that the training data isn’t biased or flawed.