Journal of Bioinformatics and Computational Biology

Early lifespan prediction in via contrastive learning and channel attention
Jin M, Chen W and Pan Y
Early lifespan prediction in   faces the challenges of indistinct discriminative signals, subtle and localized key features, difficulty in data annotation, and poor generalization. We propose Contrastive Learning-guided Channel Attention Modulation (CLCAM), in which supervised contrastive learning clusters individuals with the same lifespan and separates different classes. The resulting embedding drives channel-wise gains that are additively coupled to the backbone, thereby amplifying subtle morphological cues. At inference, the contrastive branch is removed, keeping FLOPs essentially unchanged with a modest runtime cost on our hardware. On a public dataset, CLCAM achieves an AUC-ROC of 0.84, showing a consistent improvement over the EfficientNet-B3 baseline (0.82) and a substantial gain over the prior WormNet model (0.61). Grad-CAM indicates attention focused on the pharynx and body-wall musculature, supporting the biological plausibility of the model's decisions. CLCAM offers a clear, low-overhead paradigm for early lifespan phenotyping. CLCAM code is available at https://github.com/JMM502/CLCAM/tree/master/clcam.
Predicting ncRNA-Protein Interactions with a Graph Attention Model Exploiting Personalized Subgraphs
Khoushehgir F and Noshad Z
Predicting interactions between ncRNAs and proteins is crucial for advancing our understanding of gene regulation, disease mechanisms, targeted drug design, and biomarker discovery, thereby driving innovation in research and therapeutic development. Numerous computational methods, particularly those employing machine learning and deep learning, have been proposed to address this challenge. Recent studies show that graph neural networks (GNNs) enhance ncRNA-protein interaction prediction accuracy by capturing intricate relationships and structural details in molecular data. However, current GNN approaches frequently rely on fixed-hop subgraphs for structural analysis, limiting their capacity to capture diverse interaction patterns fully. This fixed-hop approach may omit crucial nodes and edges outside the predefined neighborhood, potentially reducing prediction accuracy. To overcome this constraint, we introduce a novel method for ncRNA-protein interaction prediction by extracting the most informative subgraphs around each interaction using the personalized subgraph selection framework. These subgraphs are then utilized in a graph attention network (GAT) to learn node representations. -mer frequencies are used to capture sequence-level features, while node2vec embeddings capture structural information, providing the GNN with a robust set of features. Experimental results on relevant datasets indicate a significant improvement in predicting ncRNA-protein interactions, with the algorithm maintaining an acceptable level of computational complexity even on large datasets. By integrating both sequence and structural insights through personalized subgraphs, this approach delivers a more accurate and scalable solution for predicting ncRNA-protein interactions.
COLDLNA: Enhancing long-range node features extraction to improve robust generalization ability of drug-target binding affinity prediction in cold-start scenarios
Xu T, Jiang S, Ding W and Wang P
Recent advances in deep learning have driven significant progress in drug-target affinity (DTA) prediction. However, many models do not effectively utilize drug molecular graphs or capture long-range protein features, limiting their predictive accuracy. To address these limitations, a novel COLDLNA model is designed for robust DTA prediction. The model employs the Long-range Node Attention Module to refine drug structure representations, while leveraging the Convolutional Attention Module to elucidate critical binding sites by extracting pivotal long-range information from protein amino acid sequences. Compared with the baseline model GraphDTA, COLDLNA reduced the MSE by 12.2% and 11.5% on the Davis and KIBA datasets, respectively. Additionally, its strong generalization ability was further validated on the Human dataset, C. elegans dataset, and in cold-start scenarios.
Editorial: Guidelines for Credible Machine Learning in Computational Biology
Wong L
Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework
Wang JH, Guo TY, Pai YY, Hou PL, Kumari H and Chan MWY
Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.
The novel design of a multi-epitope vaccine candidate against the dengue virus using advanced immunoinformatics and structural analysis
Fath MK
Dengue virus (DENV) remains a major public health challenge with limited vaccine options, and current licensed vaccines exhibit restricted efficacy and safety concerns in certain populations. Advanced immunoinformatics approaches offer opportunities for designing multi-epitope vaccines targeting conserved and immunogenic regions of viral proteins. To design and computationally evaluate a novel multi-epitope vaccine targeting the Envelope (E) and Non-Structural protein 1 (NSP1) of DENV-1 and DENV-2 using integrated immunoinformatics and structural bioinformatics. CTL, HTL, and B-cell epitopes were predicted from the E and NSP1 proteins and screened for antigenicity, non-allergenicity, and non-toxicity. High-affinity epitopes were linked with appropriate spacers and adjuvants (human [Formula: see text]-defensin-3 or 50S ribosomal protein L7/L12) to construct two vaccine candidates. Molecular docking with TLR2/TLR4, molecular dynamics (MD) simulations, MM/GBSA binding free energy analysis, population coverage assessment, codon optimization, and immune simulations were conducted. Control docking using scrambled peptides was included to evaluate binding specificity. Both vaccine constructs were predicted to be stable, soluble, non-allergenic, and non-toxic. Vaccine 2 showed higher antigenicity (VaxiJen: 0.6127) and stronger TLR2 binding ([Formula: see text]: -110.37[Formula: see text]kcal/mol), whereas vaccine 1 demonstrated better solubility and TLR4 interaction stability. Control docking with scrambled peptides produced less favorable binding energies, supporting specificity. MD simulations confirmed structural stability, and immune simulations predicted robust humoral and cellular responses with high IFN-[Formula: see text] production. Population coverage exceeded 98% in most regions. The designed multi-epitope vaccines demonstrate promising immunogenic potential in silico. Experimental validation is required to confirm safety, efficacy, and protective capability against multiple DENV serotypes.
Modelling and optimizing combination therapeutic strategies for KRAS- and EGFR-mutant lung cancer
Wu L, Yu R, Yao M, Rahaman MM and Fang Z
Non-small cell lung carcinoma (NSCLC) is well-known for its high incidence (about 80% of lung cancer) and genetic heterogeneity. Personalized driver mutations such as EGFR and KRAS have established targeted therapies with kinase inhibitors, whereas immune checkpoint inhibitors (ICIs) have revolutionized immunotherapy. However, challenges such as frequent drug resistance and low response rates highlight the need for novel therapeutic strategies. Boolean network modeling is a powerful mathematical tool to simulate complex biological processes and optimize potential treatment strategies. This study developed a Boolean network model for NSCLC patients with different mutational backgrounds and evaluated the therapeutic effects by incorporating key kinase mutation inhibitors and immunological interventions. Simulations in both the Boolean network model and another quantitative model consistently suggested that the optimal therapeutic strategy involves a combination of KRAS inhibitor and ICI for KRAS-mutant patients, which is also in line with mouse model studies and the KRYSTAL-7 phase-2 clinical trial data. It would be reasonable to expect further validations from the recently announced KRYSTAL-7 phase-3 clinical trial comparing the combined therapy over pembrolizumab monotherapy in the future. Our approach highlights the value of computational modeling to evaluate and refine therapeutic strategies for precision oncology.
Metagenomic sequence classification based on local sensitive hashing and Bi-LSTM
Qian Y, Xiao L, Zhou Y and Deng L
Current metagenomic classification methods are limited by short -mer lengths and database dependency, resulting in insufficient taxonomic resolution at the species and genus level. This study proposes the first method integrating Locality-Sensitive Hashing (LSH) and Bidirectional Long-Short Term Memory (Bi-LSTM) networks for metagenomic sequence classification. The approach reduces runtime reliance on reference databases by learning discriminative features directly from sequences, while supporting long -mers. The method consists of three key steps: (1) -mer representation via locality-sensitive hashing, (2) -mer embedding implementation using the skip-gram model, (3) label assignment to embedded vectors, followed by training in a Bi-LSTM network. Experimental results demonstrate superior classification performance at the genus level compared to existing models. Future work will explore the application of this method in the rapid detection of clinical pathogens.
NanoporeInspect: An interactive tool for evaluating nanopore sequencing quality and ligation efficiency
Grigoryeva MA, Khrenova MG and Zvereva MI
In nanopore sequencing, especially in SELEX-based aptamer discovery, the correct ligation of artificial sequences (primers, adapters, barcodes) is crucial for library quality. Errors at this stage can lead to misidentification of sequences and loss of valuable information. Existing quality control tools lack focused capabilities to assess the positioning and prevalence of these artificial sequences. NanoporeInspect is a web-based tool designed to fill this gap by providing targeted insights into ligation efficacy and systematic biases within sequencing data. NanoporeInspect operates as a user-friendly, web-based platform that leverages a modern software stack with Flask, Celery and Redis to handle scalable and asynchronous task processing, and Plotly to deliver interactive visualizations. Evaluation of NanoporeInspect on various nanopore datasets demonstrated its effectiveness in discerning differences in ligation quality. Libraries with inefficient ligation showed irregular adapter and barcode distributions, indicating preparation issues, while high-quality libraries displayed uniform patterns, reflecting effective ligation.
Parameter estimation analysis of the glioblastoma immune model
Liu B, Shen M and Zhao M
In exploring optimal strategies for immunotherapy in glioblastoma (GBM), one of the main challenges is enhancing treatment response. To better understand the dynamics of tumor-immune interactions, one applied Bayesian methods to estimate the parameters of glioblastoma immune model by using experimental data. One compared the effects of using uniform prior distributions versus improved prior distributions, which were adjusted based on posterior information, during parameter estimation. In addition, a comparative analysis of the results obtained by using four Markov Chain Monte Carlo (MCMC) sampling algorithms which respectively are Metropolis, DEMetropolis, DEMetropolisZ and NUTS, were performed. The results showed that the improved prior distribution significantly enhanced the accuracy of the model parameter estimates, and reduced the variance of the posterior distribution, but increased computational time and resource demands. Furthermore, DEMetropolisZ provided such efficient sampling and narrower confidence intervals within a shorter time frame, which outperformed the others. In contrast, the efficiency and stability of the Metropolis method were relatively poor. Therefore, the importance of selecting appropriate prior distributions and sampling algorithms to improve both the accuracy and efficiency of model inference were studied. The study provides valuable insights for optimizing GBM immunotherapy strategies and serves as a reference for modeling and parameter estimation of complex biological systems.
Visual-SELEX: A technology ensemble for evaluating aptamer structural similarity via 3D visual spatial conformational analysis
Wang N
To date, the study of single-stranded DNA (ssDNA) similarity has focused mainly on the similarity of bases in the same position in the nucleic acid sequence. However, focusing only on the similarity of base sequences has limitations. This similarity evaluation considers only the one-dimensional similarity of ssDNA and cannot fully capture the three-dimensional (3D) structural consistency of aptamers for nucleic acids with 3D structures. Therefore, it is necessary to develop a program that can quickly and accurately evaluate the 3D spatial consistency of ssDNA. To this end, we designed a Visual-SELEX rapid response program, which uses a screening ssDNA sequence set enriched in the DKK1 protein for analysis. The program directly generates a stable 3D structure of ssDNA through coarse-grained simulation and molecular dynamics (MD) simulation, converts the structure into a point cloud model, and then analyzes the similarity of the spatial structure of ssDNA through point cloud model alignment and superposition. The analysis results show that Visual-SELEX can accurately match ssDNAs with dissimilar base fragments but similar 3D spatial structures, providing richer 3D spatial similarity information than sequence similarity comparison alone.
Fractal dimensionality of a coiled helical coil
Kak S
The helical coil is ubiquitous in biological and natural systems, and it is often the basic form in complex structures. This paper considers the question of its dimensionality, , in biological information as the helical coil goes through recursive coiling as in DNA and RNA molecules in chromatin, in which the -value is a function of the lengthening of the curve. It is shown that the dimensionality of coiled coils is virtually equal to . Of the three forms of DNA, the dimensionality of the B-form is nearest to the optimal value, and this might be the reason why it is most common.
An algorithm for peptide de novo sequencing from a group of SILAC labeled MS/MS spectra
Han F and Zhang K
Shotgun proteomics coupled with high-performance liquid chromatography and mass spectrometry has been instrumental in identifying proteins in complex mixtures. Effective computational approaches are required to automate the spectra interpretation process to handle the vast amount of data collected in a single Liquid Chromatography-Tandem Mass Spectrometry (LC-MS/MS) run. De novo sequencing from MS/MS has emerged as a vital technology for peptide sequencing in proteomics. To enhance the accuracy and practicality of de novo sequencing, previous algorithms have utilized multiple spectra to identify peptide sequences. Here, our study focuses on de novo sequencing of multiple tandem mass spectra of peptides with stable isotope labeling with amino acids in cell culture (SILAC) by incorporating different isotope-labeled amino acids into newly synthesized proteins. Multiple MS/MS spectra for the same peptide sequence are produced by the spectrometer after the SILAC samples undergo processing by LC-MS/MS shotgun proteomics. Taking into consideration the factors such as retention time and precursor ion mass, we aim to identify the peptide sequence with specific SILAC modifications and their locations. To do so, we propose de novo sequencing algorithms to compute the potential candidate peptide sequence by using similarity scores, followed by refinement algorithms to evaluate them. We also use real experimental data to test the algorithms.
A brief review and comparative analysis of RNA secondary structure prediction tools
Ballaney P, Saha G, Kulshrestha V, Thaker PH, Hasija P, Talukdar I and Aduri R
Ribonucleic acid (RNA) lies at the heart of the central dogma. It spans the breadth of biological functions, from information storage to gene regulation and catalysis. RNA molecules must attain specific structures to perform these functions, and their structures depend on their sequences. Predicting the structure of RNA has been a central problem in computational biology. Various methods have been developed for this purpose - while some consider the thermodynamics of folding, others abstract away the details behind neural networks (NN). This paper presents a brief overview of the existing tools for predicting RNA secondary structures from a given single RNA sequence. Furthermore, a comparative analysis of the different prediction software packages is also presented. Performance is analyzed by running each of the available software packages on a novel dataset developed using 3D crystal structures of RNA. Software packages considered include those that can predict pseudoknots along with those that cannot. Variation in software performance based on the length and type of RNA is described.
Deep learning inference of miRNA expression from bulk and single-cell mRNA expression
Ripan RC, Athaya T, Li X and Hu H
Studying miRNA activity at the single-cell level presents a significant challenge due to the limitations of existing single-cell technologies in capturing miRNAs. To address this, we introduce two deep learning models: Cross-modality (CM) and single-modality (SM), both based on encoder-decoder architectures. These models predict miRNA expression at both bulk and single-cell levels using mRNA data. We evaluated the performance of CM and SM against the state-of-the-art miRSCAPE approach, using both bulk and single-cell datasets. Our results demonstrate that both CM and SM outperform miRSCAPE in accuracy. Furthermore, incorporating miRNA target information substantially enhanced performance compared to models that utilized all genes. These models provide powerful tools for predicting miRNA expression from single-cell mRNA data.
Analysis of clonal evolution in cancer: A computational perspective
Ribeiro PH and Simao A
Cancer is a complex disease that progresses through Darwinian evolution in cells with genetic mutations, leading to the development of multiple distinct cell populations within tumors, a process known as clonal evolution. While computational methods aid in the analysis of clonal evolution in cancer samples using genetic sequencing data, accurately identifying the clonal structure of tumor samples remains one of the biggest challenges in Cancer Genomics. Several computational methods for analyzing clonal evolution in cancer have been developed in recent years. However, the algorithms of these computational methods are complex and often described at a high level of abstraction. This paper provides a detailed explanation of some computational methods for clonal evolution analysis from a computational perspective, aiding in understanding their mechanisms. Additionally, some methods have been implemented on an online platform, enabling researchers to easily run and analyze the algorithms, as well as adapt these methods to their specific needs.
Computational modeling and dynamical analysis for B. subtilis competence genic regulation circuit with multiple time delays and external noisy regulation
Zhao N, Liu H and Yan F
Bacillus subtilis (B. subtilis), a bacterium known to enter a competent state spontaneously, has garnered significant attention due to its intricate internal regulatory mechanisms. This study proposes a six-dimensional continuous delay differential equation (DDE) model incorporating two-time delays and a stochastic model that accounts for noise, aimed at delving deeper into the dynamic behaviors of the B. subtilis competence gene regulation circuit. Our investigation reveals that time delays play a crucial role in inducing oscillatory behavior within the continuous DDE model. Analyzing the dynamics of multiple time delays proves to be more intricate than studying a single delay. Furthermore, certain parameter adjustments significantly influence the system's dynamic characteristics. The introduction of noise also triggers oscillations, with the irregular oscillation patterns closely aligning with real-world observations. Intriguingly, the effects of parameters and noise regulation undergo significant changes when time delays are jointly considered. This analysis offers a fresh perspective on understanding B. subtilis competence and provides essential theoretical support for subsequent experimental endeavors in this domain of biomathematics.
Prediction and annotation of alternative transcription starts and promoter shift in the chicken genome
Grushina VA, Yevshin IS, Gusev OA, Kolpakov FA, Stanishevskaya OI, Fedorova ES, Zinovieva NA and Pintus SS
Promoter shifting, characterized by alterations in Transcription Start Site (TSS) coordinates, is a well-documented phenomenon. The impact and statistical significance of promoter shifting can be assessed through analysis of Cap Analysis of Gene Expression (CAGE) data. This phenomenon is associated with developmental stage transitions, tissue differentiation, and cellular responses to environmental stimuli. Differential promoter usage suggests nonconstitutive expression of the regulated gene, indicative of focused promoter utilization. Conversely, housekeeping genes are typically characterized by stable expression levels driven by multiple dispersed promoters and are commonly expressed across a wide range of tissues. However, our findings demonstrate that many ubiquitously expressed genes utilize single, focused promoters and undergo significant promoter shifting, adding a layer of complexity to the definition of a housekeeping gene. Differential gene expression is commonly used to study gene responses to external stimuli in cells and tissues. Here, we employ an alternative approach based on differential promoter usage, identifying genes exhibiting significant promoter shifting as signatures of tissue response and phenotypic effects. Our results suggest that variations in chicken growth rate are regulated by nutrient metabolism rates, mediated through differential promoter usage of relevant genes.
M-20M: A large-scale multi-modal molecule dataset for AI-driven drug design and discovery
Guo S, Wang L, Jin C, Wang J, Peng H, Shi H, Li W, Guan J and Zhou S
This paper introduces M-20M, a large-scale dataset that contains over molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.
DDINet: Drug-drug interaction prediction network based on multi-molecular fingerprint features and multi-head attention centered weighted autoencoder
Soni Sharmila K, Revathi S T and Sree PK
Drug-drug interactions (DDIs) pose a major concern in polypharmacy due to their potential to cause unexpected side effects that can adversely affect a patient's health. Therefore, it is crucial to identify DDIs effectively during the early stages of drug discovery and development. In this paper, a novel DDI prediction network (DDINet) is proposed to enhance the predictive performance over conventional DDI methods. Leveraging the DrugBank dataset, drugs are represented using the Simplified Molecular Input Line-Entry System (SMILES), with the RDKit software pre-processing the SMILES strings into their canonical forms. Multiple molecular fingerprinting techniques such as Extended Connectivity Fingerprints (ECFPs), Molecular ACCess System keys (MACCSkeys), PubChem Fingerprints, 3D molecular fingerprints (3D-FP), and molecular dynamics fingerprints (MDFPs) are employed to encode drug chemical structures into feature vectors. Drug similarities are computed using the Tanimoto coefficient (TC), and the final Structural Similarity Profile (SSP) is obtained by averaging the five molecular fingerprint types. The novelty of the approach lies in the integration of a Multi-head Attention centered Weighted Autoencoder (Mul_WAE) as the interaction prediction module, which leverages the Multi-head Attention (MHA) layer to focus on the most significant input features. Furthermore, we introduce the Upgraded Bald Eagle Search Optimization (UBesO) algorithm, which optimally selects the learnable parameters of the Mul_WAE based on cross-entropy loss, improving the model's convergence and performance. The proposed DDINet model achieves an accuracy of 99.77%, 99.66% of AUC, 99.5% average precision, 99.4% precision, and 99.49% recall, providing a comprehensive evaluation of the model's robustness. Beyond high accuracy, DDINet offers advantages in scalability, making it well suited for handling large datasets due to its efficient feature extraction and optimization processes. The unique combination of multiple molecular fingerprinting methods with the MHA layer and UBesO algorithm highlights the innovative aspects of our model and significantly improves prediction performance compared to existing approaches.
Cross-cellular analysis of chromatin accessibility markers H3K4me3 and DNase in the context of detecting cell-identity genes: An "all-or-nothing" approach
Low BH, Kaliskar KK, Perna S and Lee B
Cell identity is often associated to a subset of highly-expressed genes that define the cell processes, as opposed to essential genes that are always active. Cell-specific genes may be defined in opposition to essential genes, or via experimental means. Detection of said cell-specific genes is often a primary goal in the study of novel biosamples. Chromatin accessibility markers (such as DNase and H3K4me3) help identify actively transcribed genes, but data can be difficult to come by for entirely novel biosamples. In this study, we investigate the possibility of associating the cell-specificity status of genes with chromatin accessibility markers from different cell lines, and we suggest that the number of cell lines in which a gene is found to be marked by DNase/H3K4me3 is predictive of the essentiality status itself. We define a measure called the Cross-cellular Chromatin Openness (CCO) level, and show that it is associated with the essentiality status using two differentiation experiments. We then compare the CCO-level predictive power to existing scRNA-Seq and bulk RNA-Seq methods, showing it has good concordance when applicable.