Journal of Biomedical Semantics

BabelFSH-a toolkit for an effective HL7 FHIR-based terminology provision
Wiedekopf J, Ohlsen T, Kock-Schoppenhauer AK and Ingenerf J
HL7 FHIR terminological services (TS) are a valuable tool towards better healthcare interoperability, but require representations of terminologies using FHIR resources to provide their services. As most terminologies are not natively distributed using FHIR resources, converters are needed. Large-scale FHIR projects, especially those with a national or even an international scope, define enormous numbers of value sets and reference many large and complex code systems, which must be regularly updated in TS and other systems. This necessitates a flexible, scalable and efficient provision of these artifacts. This work aims to develop a comprehensive, extensible and accessible toolkit for FHIR terminology conversion, making it possible for terminology authors, FHIR profilers and other actors to provide standardized TS for large-scale terminological artifacts.
The CLEAR Principle: organizing data and metadata into semantically meaningful types of FAIR Digital Objects to increase their human explorability and cognitive interoperability
Vogt L
Ensuring the FAIRness (Findable, Accessible, Interoperable, Reusable) of data and metadata is an important goal in both research and industry. Knowledge graphs and ontologies have been central in achieving this goal, with interoperability of data and metadata receiving much attention. This paper argues that the emphasis on machine-actionability has overshadowed the essential need for human-actionability of data and metadata, and provides three examples that describe the lack of human-actionability within knowledge graphs.
A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data
Ansari A, Conte M, Flynn A and Paturkar A
For clinical care and research, knowledge graphs with patient data can be enriched by extracting parameters from a knowledge graph and then using them as inputs to compute new patient features with pure functions. Systematic and transparent methods for enriching knowledge graphs with newly computed patient features are of interest. When enriching the patient data in knowledge graphs this way, existing ontologies and well-known data resource standards can help promote semantic interoperability.
Three-layered semantic framework for public health intelligence
Guru Rao S, Rokkam P, Zhang B, Sargsyan A, Kaladharan A, Sethumadhavan P, Jacobs M, Hofmann-Apitius M and Tom Kodamullil A
Disease surveillance systems play a crucial role in monitoring and preventing infectious diseases. However, the current landscape, primarily focused on fragmented health data, poses challenges to contextual understanding and decision-making. This paper addresses this issue by proposing a semantic framework using ontologies to provide a unified data representation for seamless integration. The paper demonstrates the effectiveness of this approach using a case study of a COVID-19 incident at a football game in Italy.
BASIL DB: bioactive semantic integration and linking database
Jackson D, Groth P and Harmouch H
Bioactive compounds found in foods and plants can provide health benefits, including antioxidant and anti-inflammatory effects. Research into their role in disease prevention and personalized nutrition is expanding, but challenges such as data complexity, inconsistent methods, and the rapid growth of scientific literature can hinder progress. To address these issues, we developed BASIL DB (BioActive Semantic Integration and Linking Database), a knowledge graph (KG) database that leverages natural language processing (NLP) techniques to streamline data organization and analysis. This automated approach offers greater scalability and comprehensiveness than traditional methods such as manual data curation and entry.
Mapping between clinical and preclinical terminologies: eTRANSAFE's Rosetta stone approach
van Mulligen EM, Parry R, van der Lei J and Kors JA
The eTRANSAFE project developed tools that support translational research. One of the challenges in this project was to combine preclinical and clinical data, which are coded with different terminologies and granularities, and are expressed as single pre-coordinated, clinical concepts and as combinations of preclinical concepts from different terminologies. This study develops and evaluates the Rosetta Stone approach, which maps combinations of preclinical concepts to clinical, pre-coordinated concepts, allowing for different levels of exactness of mappings.
medicX-KG: a knowledge graph for pharmacists' drug information needs
Farrugia L, Azzopardi LM, Debattista J and Abela C
The role of pharmacists is evolving from medicine dispensing to delivering comprehensive pharmaceutical services within multidisciplinary healthcare teams. Central to this shift is access to accurate, up-to-date medicinal product information supported by robust data integration. Leveraging artificial intelligence and semantic technologies, Knowledge Graphs (KGs) uncover hidden relationships and enable data-driven decision-making. This paper presents medicX-KG, a pharmacist-oriented knowledge graph supporting clinical and regulatory decisions. It forms the semantic layer of the broader medicX platform, powering predictive and explainable pharmacy services. medicX-KG integrates data from three sources, including, the British National Formulary (BNF), DrugBank, and the Malta Medicines Authority (MMA) that addresses Malta's regulatory landscape and combines European Medicines Agency alignment with partial UK supply dependence. The KG tackles the absence of a unified national drug repository, reducing pharmacists' reliance on fragmented sources. Its design was informed by interviews with practising pharmacists to ensure real-world applicability. We detail the KG's construction, including data extraction, ontology design, and semantic mapping. Evaluation demonstrates that medicX-KG effectively supports queries about drug availability, interactions, adverse reactions, and therapeutic classes. Limitations, including missing detailed dosage encoding and real-time updates, are discussed alongside directions for future enhancements.
A fourfold pathogen reference ontology suite
Beverley J, Babcock S, Benson C, De Colle G, Cohen S, Diehl AD, Challa RANR, Mavrovich RA, Billig J, Huffman A and He Y
Infectious diseases remain a critical global health challenge, and the integration of standardized ontologies plays a vital role in managing related data. The Infectious Disease Ontology (IDO) and its extensions, such as the Coronavirus Infectious Disease Ontology (CIDO), are essential for organizing and disseminating information related to infectious diseases. The COVID-19 pandemic highlighted the need for updating IDO and its virus-specific extensions. There is an additional need to update IDO extensions specific to bacteria, fungus, and parasite infectious diseases.
Semantic classification of Indonesian consumer health questions
Hanami RN, Mahendra R and Wicaksono AF
Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.
Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis
Wang Z, Li X, Zheng J and He Y
Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.
Sentences, entities, and keyphrases extraction from consumer health forums using multi-task learning
Naufal T, Mahendra R and Wicaksono AF
Online consumer health forums offer an alternative source of health-related information for internet users seeking specific details that may not be readily available through articles or other one-way communication channels. However, the effectiveness of these forums can be constrained by the limited number of healthcare professionals actively participating, which can impact response times to user inquiries. One potential solution to this issue is the integration of a semi-automatic system. A critical component of such a system is question processing, which often involves sentence recognition (SR), medical entity recognition (MER), and keyphrase extraction (KE) modules. We posit that the development of these three modules would enable the system to identify critical components of the question, thereby facilitating a deeper understanding of the question, and allowing for the re-formulation of more effective questions with extracted key information.
The SPHN Schema Forge - transform healthcare semantics from human-readable to machine-readable by leveraging semantic web technologies
Touré V, Unni D, Krauss P, Abdelwahed A, Buchhorn J, Hinderling L, Geiger TR and Österle S
The Swiss Personalized Health Network (SPHN) adopted the Resource Description Framework (RDF), a core component of the Semantic Web technology stack, for the formal encoding and exchange of healthcare data in a medical knowledge graph. The SPHN RDF Schema defines the semantics on how data elements should be represented. While RDF is proven to be machine readable and interpretable, it can be challenging for individuals without specialized background to read and understand the knowledge represented in RDF. For this reason, the semantics described in the SPHN RDF Schema are primarily defined in a user-accessible tabular format, the SPHN Dataset, before being translated into its RDF representation. However, this translation process was previously manual, time-consuming and labor-intensive.
Gene expression knowledge graph for patient representation and diabetes prediction
Sousa RT and Paulheim H
Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the number of patients in expression datasets is usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration, and to learn uniform patient representations for subjects contained in different incompatible datasets. Different strategies and KG embedding methods are explored to generate vector representations, serving as inputs for a classifier. Extensive experiments demonstrate the efficacy of our approach, revealing weighted F1-score improvements in diabetes prediction up to 13% when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.
New and revised gene ontology biological process terms describe multiorganism interactions critical for understanding microbial pathogenesis and sequences of concern
Godbold G, Proescher J and Gaudet P
There is a new framework from the United States government for screening synthetic nucleic acids. Beginning in October of 2026, it calls for the screening of sequences 50 nucleotides or greater in length that are known to contribute to pathogenicity or toxicity for humans, regardless of the taxa from which it originates. Distinguishing sequences that encode pathogenic and toxic functions from those that lack them is not simple.
Enriched knowledge representation in biological fields: a case study of literature-based discovery in Alzheimer's disease
Pu Y, Beck D and Verspoor K
In Literature-based Discovery (LBD), Swanson's original ABC model brought together isolated public knowledge statements and assembled them to infer putative hypotheses via logical connections. Modern LBD studies that scale up this approach through automation typically rely on a simple entity-based knowledge graph with co-occurrences and/or semantic triples as basic building blocks. However, our analysis of a knowledge graph constructed for a recent LBD system reveals limitations arising from such pairwise representations, which further negatively impact knowledge inference. Using LBD as the context and motivation in this work, we explore limitations of using pairwise relationships only as knowledge representation in knowledge graphs, and we identify impacts of these limitations on knowledge inference. We argue that enhanced knowledge representation is beneficial for biological knowledge representation in general, as well as for both the quality and the specificity of hypotheses proposed with LBD.
Digital evolution: Novo Nordisk's shift to ontology-based data management
Tan SZK, Baksi S, Bjerregaard TG, Elangovan P, Gopalakrishnan TK, Hric D, Joumaa J, Li B, Rabbani K, Venkatesan SK, Valdez JD and Kuriakose SV
The amount of biomedical data is growing, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organization's digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery.
Standardizing free-text data exemplified by two fields from the Immune Epitope Database
Duesing S, Bennett J, Overton JA, Vita R and Peters B
While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): "age" and "data-location" (the part of a paper in which data was found).
Semantics in action: a guide for representing clinical data elements with SNOMED CT
Ehrsam J, Gaudet-Blavignac C, Mattei M, Baumann M and Lovis C
Clinical data is abundant, but meaningful reuse remains lacking. Semantic representation using SNOMED CT can improve research, public health, and quality of care. However, the lack of applied guidelines to industrialise the process hinders sustainability and reproducibility. This work describes a guide for semantic representation of data elements with SNOMED CT, addressing challenges encountered during its application. The representation of the institutional data warehouse started with the guidelines proposed by SNOMED International and other groups. However, the application at large scale of manual expert-driven representation led to the development of additional rules.
Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features
Ikeda S, Aoki-Kinoshita KF, Chiba H, Goto S, Hosoda M, Kawashima S, Kim JD, Moriya Y, Ohta T, Ono H, Takatsuki T, Yamamoto Y and Katayama T
TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.
FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis
Liao X, Ederveen THA, Niehues A, de Visser C, Huang J, Badmus F, Doornbos C, Orlova Y, Kulkarni P, van der Velde KJ, Swertz MA, Brandt M, van Gool AJ and 't Hoen PAC
We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.
Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI)
Toro S, Anagnostopoulos AV, Bello SM, Blumberg K, Cameron R, Carmody L, Diehl AD, Dooley DM, Duncan WD, Fey P, Gaudet P, Harris NL, Joachimiak MP, Kiani L, Lubiana T, Munoz-Torres MC, O'Neil S, Osumi-Sutherland D, Puig-Barbe A, Reese JT, Reiser L, Robb SM, Ruemping T, Seager J, Sid E, Stefancsik R, Weber M, Wood V, Haendel MA and Mungall CJ
Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.