Journal of Cheminformatics

Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network
Macedo B, Vaz IR and Gomes TT
Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces > 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery. SCIENTIFIC CONTRIBUTION: In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connected, and novel molecules under identical settings while maintaining diversity and drug-likeness. This demonstrates that adaptive reward strategies can jointly enhance molecular topology and property optimization at scale.
Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses
Nikolopoulos A, van Herwerden D, Turkina V, Kruve A, Baerenfaenger M and Samanipour S
Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography-high resolution mass spectrometry (LC-HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.
All-atom protein sequence design using discrete diffusion models
Villegas-Morcillo A, Admiraal GJ, Reinders MJT and Weber JM
Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.
DeepRNA-DTI: a deep learning approach for RNA-compound interaction prediction with binding site interpretability
Bae H and Nam H
RNA-targeted therapeutics represent a promising frontier for expanding the druggable genome beyond conventional protein targets. However, computational prediction of RNA-compound interactions remains challenging due to limited experimental data and the inherent complexity of RNA structures. Here, we present DeepRNA-DTI, a novel sequence-based deep learning approach for RNA-compound interaction prediction with binding site interpretability. Our model leverages transfer learning from pretrained embeddings, RNA-FM for RNA sequences and Mole-BERT for compounds, and employs a multitask learning framework that simultaneously predicts both presence of interactions and nucleotide-level binding sites. This dual prediction strategy provides mechanistic insights into RNA-compound recognition patterns. Trained on a comprehensive dataset integrating resources from the Protein Data Bank and literature sources, DeepRNA-DTI demonstrates superior performance compared to existing methods. The model shows consistent effectiveness across diverse RNA subtypes, highlighting its robust generalization capabilities. Application to high-throughput virtual screening of over 48 million compounds against oncogenic pre-miR-21 successfully identified known binders and novel chemical scaffolds with RNA-specific physicochemical properties. By combining sequence-based predictions with binding site interpretability, DeepRNA-DTI advances our ability to identify promising RNA-targeting compounds and offers new opportunities for RNA-directed drug discovery. The codes and data are publicly available at https://github.com/GIST-CSBL/DeepRNA-DTI/.
The consolidation of open-source computer-assisted chemical synthesis data into a comprehensive database
Hasic H and Ishida T
Over the past decade, computer-assisted chemical synthesis has resurfaced as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is heavily dependent on data that suffers from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. This research addresses these issues by consolidating all relevant open-source computer-assisted chemical synthesis data into a comprehensive database, providing a practical overview of the state of data in the process. The computer-assisted chemical synthesis or CaCS database is designed to be a central repository for storing and analyzing data, with the primary objective being easy integration and utilization within existing research projects. It provides the users with a programmatic interface to retrieve the data required for various tasks like predicting the outcomes of chemical synthesis and retrosynthetic analysis or retrosynthesis, estimating the synthesizability of chemical compounds, and planning and optimizing the chemical synthesis routes. The database archives the original data to ensure reusability and traceability in downstream tasks and stores the processed data in a more efficient manner. The advantages and disadvantages are highlighted through a realistic case study of how such a database would be utilized within a computer-assisted chemical synthesis research project today. The code and documentation relevant to the CaCS database are available on GitHub under the MIT license at https://github.com/neo-chem-synth-wave/ncsw-data.Scientific contribution: The primary scientific contribution of this research is the consolidation of all relevant open-source computer-assisted chemical synthesis data into a comprehensive database. The database archives the original data to ensure reusability and traceability in downstream tasks, efficiently stores the processed data, and provides the users with a programmatic interface to manage and query the stored data. Rather than improving the existing or introducing new data, such a database provides a systematic overview of the existing open data sources and an easily reproducible environment for transparent processing and benchmarking purposes.
NOCTIS: open-source toolkit that turns reaction data into actionable graph networks
Lopanitsyna N, Pasquini M and Stenta M
Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.
Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning
Lou X, Cai J and Siu SWI
In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT's pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.
HighFold-MeD: a Rosetta distillation model to accelerate structure prediction of cyclic peptides with backbone N-methylation and D-amino acids
Cao Z, Cao S, Wang L, Wang Z, Mao Q, Guo J and Duan H
Cyclic peptides with backbone N-methylated amino acids (BNMeAAs) and D-amino acids (D-AAs) have gained increasing attention for their stability, membrane permeability, and other therapeutic potentials. Currently, Rosetta simple_cycpep_predict (SCP) can predict their structures through energy-based calculations, but this approach is computationally intensive and time-consuming. Moreover, the available crystal structures of such cyclic peptides remain highly limited, hindering the development of data-driven structure prediction models. To address these challenges, we propose HighFold-MeD, a deep learning-based framework that distills knowledge from Rosetta SCP by fine-tuning the AlphaFold model. First, a cyclic peptide structure dataset is constructed using Rosetta SCP by sampling massive conformations for cyclic peptides with BNMeAAs and D-AAs and evaluating their energy scores. The AlphaFold model is then fine-tuned to incorporate the extended 56 BNMeAAs and D-AAs. Besides, a relative position cyclic matrix is introduced to explicitly model head-to-tail cyclization. Finally, a force field is employed to minimize steric clashes in the predicted structures. Empirical experiments demonstrate that HighFold-MeD achieves accuracy comparable to that of Rosetta based on the sampled datasets by the SCP module of Rosetta, with the key parameters that nstruct = 500 and cyclic_peptide: genkic_closure_attempts = 1000, while accelerating structure prediction by 50-fold, thereby significantly expediting the development of cyclic peptide-based therapeutics. SCIENTIFIC CONTRIBUTION: We propose HighFold-MeD, which provides a rapid and relatively accurate approach for predicting the structures of cyclic peptides containing backbone N-methylated amino acids and D-amino acids-key building blocks in peptide drug design. By distilling the knowledge of Rosetta SCP under specific parameters into a fine-tuned AlphaFold framework, our method achieves a 50-fold acceleration while maintaining relatively high accuracy, thereby enabling large-scale cyclic peptide drug design.
C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification
Aerts R, Tavernier J, Kerstjens A, Ahmad M, Gómez-Tamayo JC, Tresadern G and De Winter H
Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient estimator2generative wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application C2PO (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.Scientific contribution: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.
Predicting the critical micelle concentration of binary surfactant mixtures using machine learning
Choudhary A, Desai S, Kamruzzaman M, Landera A, Ghosh K and Poorey K
Surfactant mixtures play a critical role in industries such as drug delivery, cosmetics, firefighting foams, and lubrication, serving as foundational components of the global economy. Their performance hinges on micelle formation, a self-assembly process governed by the critical micelle concentration (CMC), which enables key functions like solubilization, emulsification, and targeted molecular delivery. However, rapidly and accurately predicting the CMC of mixtures remains a significant challenge due to the chemical diversity and nonlinear interactions between surfactants. Here, we introduce an artificial neural network (ANN)-based machine learning framework to predict the CMC of binary surfactant mixtures. Our workflow leverages cheminformatics-derived molecular descriptors for each surfactant component, which are then aggregated using strategies such as concatenation, arithmetic mean, and harmonic mean. We find that pairing the arithmetic mean strategy with ANN yields the best performance, effectively capturing complex molecular interactions and enabling dual predictive capabilities: (1) precise interpolation of CMC values at untested mole fractions within known mixtures, and (2) accurate prediction of complete CMC-composition profiles for entirely novel surfactant combinations. SHAP-based interpretability analysis highlights that features such as hydrophobic surface area, electronic topological descriptors, and headgroup basicity drive model predictions, aligning with core principles of surfactant chemistry and reinforcing the mechanistic validity of our model. Overall, this framework accelerates data-driven surfactant design by reducing experimental burden and enabling rapid, rational optimization of formulations across pharmaceuticals, personal care, environmental remediation, and enhanced oil recovery.Scientific contributionThis study presents a novel machine learning framework that, for the first time, predicts full critical micelle concentration (CMC)-composition profiles for binary surfactant mixtures, including untrained systems. By strategically combining the features of individual components of mixtures using arithmetic mean, our artificial neural network model deciphers nonlinear interactions between chemically distinct surfactants, enabling accurate and generalizable CMC predictions. Beyond performance gains, this framework facilitates rapid and systematic exploration of formulation space via inverse design and high-throughput screening, establishing a powerful foundation for the rational development of next-generation surfactants with applications in energy, environmental remediation, pharmaceuticals, and biomedical science.
Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information
Park M, Shin Y, Kim H and Nam H
The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.Scientific contribution: We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model's applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.
How evaluation choices distort the outcome of generative drug discovery
Özçelik R and Grisoni F
"How to evaluate the de novo designs proposed by a generative model?" Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh - critical and constructive - perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics - such as uniqueness and distributional similarity - that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches. SCIENTIFIC CONTRIBUTION: Our work takes a step toward enhancing the robustness and reliability of evaluation practices in generative drug discovery. We systematically analyze current evaluation practices using approximately one billion designs from deep learning models. We find that the number of designs, often an overlooked parameter, can distort scientific outcomes related to distributional similarity and diversity. Moreover, we show that using larger design libraries than are typically adopted helps to avoid this pitfall, and we develop efficient algorithms to enable large-scale studies. We also propose guidelines for prospective molecule selection and uncover inherent constraints in diversifying molecular designs.
NPBS Atlas: a comprehensive data resource for exploring the biological sources of natural products
Xu T, Dai J, Li Y, Zhou J, Zhao Y, Chen W and Xue XS
Natural products continue to play a pioneering role in drug discovery due to their extraordinary chemical and biological diversity. However, their full therapeutic potential remains largely underutilized, hindered by the fragmented documentation of biological origins in existing data resources. Here, we present natural product and biological source atlas (NPBS Atlas), a data resource covers over 218,000 natural products fully annotated with comprehensive biological sources, bioactivities, and references. The database established through systematic text mining and expert manual curation, places special emphasis on curating source organism data through the information of scientific nomenclature, taxonomic classification, source parts, and the source of Traditional Chinese Medicines. NPBS Atlas represents significant advancement in natural product data resources through its unique content, specialized annotations, and featured data, thereby enabling unprecedented exploration of nature-derived chemical diversity through biological context. The web interface of NPBS Atlas is freely available at https://biochemai.cstspace.cn/npbs/ .
Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors
Singh B, Kumari N, Upadhyay A, Pahuja B, Covernton E, Kalia K, Tuteja K, Paul PR, Kumar R, Zarkar MS and Bhardwaj A
Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40-70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO's 'R&D Blueprint list of priority diseases' (2018) indicates that there is an urgent need for accelerated research & development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range-as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/ . The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.Scientific contributionThe NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.
How to build machine learning models able to extrapolate from standard to modified peptides
Fernández-Díaz R, Ochoa R, Hoang TL, Lopez V and Shields DC
Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.
Accurate structure-activity relationship prediction of antioxidant peptides using a multimodal deep learning framework
Duy HA and Srisongkram T
Antioxidant peptides (AOPs) have emerged as promising peptide agents due to their efficacy in counteracting oxidative stress-related diseases and their applicability in functional food and cosmetic industries. In this study, we developed a comprehensive quantitative structure-activity relationship (QSAR) utilizing a multimodal deep learning framework that integrates 6 sequence-based structure representations with stacking ensemble neural architectures-convolutional neural networks, bidirectional long short-term memory, and Transformer-to enhance predictive accuracy. Additionally, we employed a generative model to design novel AOP candidates, which were subsequently evaluated using the best-performing QSAR model. Remarkably, the stacking models using one-hot encoding achieved outstanding predictive metrics, with accuracy, AUROC, and AUPRC values surpassing 0.90, and the MCC above 0.80, demonstrating a highly accurate and robust QSAR model. SHAP analysis highlighted that proline, leucine, alanine, tyrosine, and glycine are the top five residues that positively influence antioxidant activity, whereas methionine, cysteine, tryptophan, asparagine, and threonine negatively impact antioxidant activity. Finally, 604 high-confidence AOPs were computationally identified. This study demonstrates that the multimodal framework improves the prediction accuracy, robustness, and interpretability of the AOP. It also enables the efficient discovery of high-potential AOPs, thereby offering a powerful pipeline for accelerating peptide discovery in pharmaceutical and functional applications.
Beyond performance: how design choices shape chemical language models
Fender I, Gut JA and Lemmin T
Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.
HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions
de Souza Ferrari B, Giro R and Steiner MB
Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer's properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign . SCIENTIFIC CONTRIBUTION: The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.
Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework
Ganeeva V, Khrabrov K, Kadurin A and Tutubalina E
The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the understanding of how chemistry works. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for the assessment of chemistry LMs of different natures. The framework is based on SMILES augmentations that maintain a foundational chemical structure. The proposed method facilitates the similarity between the embedding representations of the molecule, its SMILES variation, and that of another molecule. Experiments indicate that the tested ChemLLMs are still not robust to different SMILES representations. We evaluated the models on various tasks, including the molecular captioning on ChEBI-20 benchmark and classification and regression tasks of MoleculeNet benchmark. We show that the results' change after SMILES strings variations align with the proposed AMORE framework.
Enhancing molecular property prediction through data integration and consistency assessment
Parrondo-Pizarro R, Menestrina L, Garcia-Serna R, Fernández-Torras A and Mestres J
Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.
Efficient decoy selection to improve virtual screening using machine learning models
Victoria-Muñoz F, Menke J, Sanchez-Cruz N and Koch O
Machine learning models using protein-ligand interaction fingerprints show promise as target-specific scoring functions in drug discovery, but their performance critically depends on the underlying decoy selection strategies. Recognizing this critical role in model performance, various decoy selection strategies were analyzed to enhance machine learning models based on the Protein per Atom Score Contributions Derived Interaction Fingerprint (PADIF). We explored three distinct workflows for decoy selection: (1) random selection from extensive databases like ZINC15, (2) leveraging recurrent non-binders from high-throughput screening (HTS) assays stored as dark chemical matter, and (3) data augmentation by utilizing diverse conformations from docking results. Active molecules from ChEMBL, combined with these decoy approaches, were used to train and test different machine learning models based on PADIF. The final validation was done by confirming experimentally determined inactive compounds from the LIT-PCBA dataset. Our findings reveal that models trained with random selections from ZINC15 and compounds from dark chemical matter closely mimic the performance of those trained with actual non-binders, presenting viable alternatives for creating accurate models in the absence of specific inactivity data. Furthermore, all models showed an enhanced ability to explore new chemical spaces for their specific target and enhanced the top active compound selection over classical scoring functions, thereby boosting the screening power in molecular docking. These findings demonstrate that appropriate decoy selection strategies can maintain model accuracy while expanding applicability to targets even when lacking extensive experimental data.