JOURNAL OF MEDICAL SYSTEMS

The Power of Terminology in Wound Care: a Critical Look at "Hard-to-Heal"
Marques R and Alves PJP
From Predictive Accuracy to Public Health Impact: Navigating the Challenges of Implementing a Hypertension Risk Model in Indonesia
Sheng T, Liang Z and Luo G
From Research to Practice in Days, not Decades: Why Leaders Must Act now
Peltonen LM, Topaz M and Zhang Z
Radiological Image and Text-Based Medical Concept Detection in Social Networks Using Hybrid Deep Learning
Bayrakdar S and Yucedag I
Nowadays, the presence of health-related content on social networks is rapidly increasing. With the effect of these networks, a large number of medical images, diagnosed and interpreted by various experts, are shared online. Therefore, concept detection and image classification from medical images remains a challenging task. In recent years, deep learning-based models have become increasingly popular for addressing these challenges. The primary objective of this study is to perform multi-label classification of radiological images shared on a social network by automatically assigning relevant medical concepts. These concepts are derived from the Unified Medical Language System (UMLS). In this study, Convolutional Neural Network (CNN) combined with feed forward neural networks and various image encoders, including VGG-19, DenseNet-121, ResNet-101, Xception, Efficient-B7, to predict the appropriate concepts. The proposed hybrid deep learning models were trained and evaluated using the ImageCLEF 2019 dataset. Further evaluation was performed using a custom dataset (Rdpd_Test_Ds) composed of radiological images and their associated comments collected from a social network. The performance of the models was assessed using precision, recall, and F1-score metrics. The evaluation results are promising, demonstrating high performance. To the best of our knowledge, this research is the first to apply deep learning-based models to radiological data collected from a social network, representing a novel and impactful contribution to the field.
Requirements for Manual Knowledge Acquisition Tools: Systematic Literature Review and Expert Panel Consensus
van Brummelen N, Leopold JH and Medlock S
Knowledge acquisition tools facilitate the creation and maintenance of decision support content, but thus far there is little formal investigation of the requirements and desiderata for such tools. This leads to researchers re-inventing the wheel when building knowledge acquisition tools. This study aims to draw up a requirements list for manual knowledge acquisition tools. A systematic review of the literature combined with the opinion of experts on the research team was used to write the initial requirements list. This requirements list was presented to a group of experts and revised according to their feedback. Embase, MEDLINE, IEEE, and ACM were searched, leading to a total of 1628 records. After screening, 199 records were sought for retrieval and 3 studies were included in the review. The initial requirements list consisted of 24 requirements and 15 sub-requirements, covering multiple aspects of knowledge acquisition tools. The expert consensus sessions resulted in 13 new requirements and 4 new sub-requirements, but no changes to the major categories. The final requirements list contained 37 requirements and 21 sub-requirements, divided over 4 categories. This study has shown that much is still unknown about requirements for KA tools , making it complex for developers to design effective and efficient tools. While our results are preliminary, they might prove to be a valuable starting point for developers and researchers.
Cost-Effectiveness of a Mobile Health Program for Pre-elderly Adults
Bae E, Moon A, Baek S, Kim JH and Jang S
In South Korea, over half the adults are insufficiently active. Mobile health (mHealth) interventions can increase physical activity. This study evaluated the cost-effectiveness of a smartwatch and smartphone application to promote moderate-intensity physical activity among South Korean adults aged 45–64 years. A Markov cohort model with seven health states was developed to compare intervention and non-intervention scenarios. Transition probabilities, utility values, and costs in the model were calculated using data from the National Health Insurance Service and the Korea National Health and Nutrition Examination Survey. The intervention effect was measured through changes in moderate-intensity physical activity (≥ 150 min/week) over a 12-week prospective study ( = 304). The analyses used 10-year and lifetime horizons with comprehensive sensitivity. The intervention yielded an incremental gain of 0.0077 quality-adjusted life years (QALYs) at an additional cost of 6.3 USD/person, resulting in an incremental cost-effectiveness ratio (ICER) of 829 USD/QALY gained over 10 years. Benefits were greater in the 55–64 age group (455 USD/QALY) than in the 45–54 age group (1,595 USD/QALY). Probabilistic sensitivity analysis confirmed robust cost-effectiveness at a willingness-to-pay threshold of USD 20,000/QALY. The program using smartwatches and smartphones was cost-effective for middle-aged South Korean adults, particularly those aged 55–64 years. These findings support integrating mHealth solutions into national physical activity promotion strategies as an economically feasible approach to addressing physical inactivity and preventing chronic diseases in aging populations.
Predictive Performance of Raman Spectroscopy in Osteoarthritis: A Systematic Review
Yesmean M, Shakya BR, Mannerkorpi M, Saarakkala S and Jansson M
Early diagnosis of osteoarthritis (OA) remains a critical unmet need due to the lack of reliable detection methods. Detecting OA at an early stage provides a valuable clinical window for implementing effective intervention strategies. Raman spectroscopy (RS) holds promise for improving predictive accuracy in detecting osteoarthritic changes at the molecular level, monitoring disease progression, and assessing severity. This study aimed to systematically evaluate the predictive performance of RS in OA assessment in human samples, thereby highlighting current advancements in the field. The search included PubMed/Medline, Scopus, Web of Science, and IEEE for studies published up to July 31, 2024. Two authors individually screened the studies using Covidence software, and data extraction was based on predefined criteria. The Prediction Model Risk of Bias Assessment Tool was employed to evaluate the bias and applicability of the included studies. Ten studies met the inclusion criteria. Near-infrared excited RS was the most used RS technique. All included studies reported predictive accuracy ranging from 73% to 100% in preclinical settings for OA assessment. Although all studies performed internal validation, most had a high risk of bias and none reported external validation, which limits the generalizability of their findings. These findings underscore both the potential and current limitations of RS in OA assessment. Future research should prioritize larger sample sizes, external validation, and standardized RS protocols to improve reproducibility across diverse clinical settings.
Logic-based Approach and Visualization for the Nuclear Medicine Rescheduling Problem
Marte C, Mochi M, Dodaro C, Galatà G and Maratea M
The Nuclear Medicine Scheduling problem consists of assigning patients to a day, on which the patient will undergo the medical check, the preparation, and the actual image detection process. The schedule of the patients should consider their different requirements and the available resources, e.g., varying time required for different diseases and radiopharmaceuticals used, number of injection chairs, and tomographs available. Recently, this problem has been solved using a logic-based approach using the Answer Set Programming (ASP) methodology. However, it may be the case that a computed schedule can not be implemented due to a sudden emergency and/or unavailability of resources, thus rescheduling is needed. In this paper, we present an ASP-based approach to solve such a situation, which we call the Nuclear Medicine Rescheduling problem. Experiments on three scenarios in which rescheduling may be needed, and employing real data from a medium size hospital in Italy, show that our rescheduling solution provides satisfying results even when the concurrent number of emergencies and unavailability is significant. We finally present the design and implementation of a web application for the easy usage of our solutions.
RAG-Enhanced Open SLMs for Hypertension Management Chatbots
Aguzzi G, Magnini M, Farahmand A, Ferretti S, Pengo MF and Montagna S
Chronic disease management requires continuous monitoring, lifestyle modification and therapy adherence, thus requiring constant support from healthcare professionals. Chatbots have proven to be a promising approach for engaging patients in managing their health condition at home and for offering continuous assistance by being readily available to answer questions. While large language models offer an impressive solution for chatbot implementation, third-party systems raise privacy concerns, and computational requirements limit small-scale deployment. We address these challenges by developing a chatbot for hypertensive patients based on open-source small language models (SLMs), specifically designed for running on personal resource-constrained devices and for providing assistance in QA tasks. In order to guarantee comparable conversational performances with respect to larger language models, we exploited retrieval-augmented generation (RAG) with a local knowledge base. This ensures data privacy by deploying models locally while achieving competitive accuracy and maintaining low computational costs suitable for end-user devices. We experimented with eight SLMs, two prompt configurations, and different RAG strategies - both in the embedding and retrieval components - to identify the most effective solution. The evaluation of our solution grounds on both reference metrics and expert evaluation. Our findings suggest that RAG-enhanced SLMs can improve response clarity and content accuracy. However, our results also indicate that newer SLMs like Qwen3 demonstrate strong performance even without RAG, suggesting a potential shift in the necessity for complex retrieval mechanisms with rapidly evolving model architectures.
Diffusion Models for Neuroimaging Data Augmentation: Assessing Realism and Clinical Relevance
Mallardi G, Calefato F, Lanubile F, Logroscino G and Tafuri B
Data scarcity remains a major obstacle to the application of deep learning techniques in medical imaging, particularly for rare neurodegenerative diseases. This study investigates the use of denoising diffusion probabilistic models (DDPMs) to generate synthetic 3D T1-weighted brain MRI images in this context. Addressing the dual challenges of limited training data and structural fidelity, we propose a generative pipeline trained on a multicenter dataset of healthy subjects. The model suggests the potential to produce anatomically coherent synthetic scans with realistic variability. Quantitative evaluation based on Maximum Mean Discrepancy confirms the similarity between real and generated data distributions, while visual assessments highlight the preservation of global and local brain structures. Despite limitations in high-frequency detail reconstruction, the results suggest that DDPMs hold promise as a tool for augmenting neuroimaging datasets and supporting downstream tasks such as classification and segmentation. This work lays the foundation for future research aimed at improving resolution and adapting generative models to the specific challenges of rare disease imaging.
Advancing the K-Operator Framework: Reflections on Methodological Limitations and Future
Li S, Yu Z and Cheng W
Clinical Accuracy and Safety Concerns Following GPT-5 Public Demonstration in Cancer Care
Capobianco I, Della Penna A, Mihaljevic AL, Bitzer M, Eickhoff C and Stifini D
OpenAI's GPT-5 demonstration showed a patient uploading pathology reports to guide treatment decisions, though privacy implications were not addressed. We evaluated GPT-5 against 100 gastrointestinal oncology cases with tumor-board validation and found identical 85% concordance to GPT-4o, contradicting superiority claims. We recommend mandatory accuracy disclosures and regulatory oversight for AI health demonstrations to protect patient safety and privacy.
Towards A Fair Duel: Reflections on the Evaluation of DeepSeek-R1 and ChatGPT-4o in Chinese Medical Education
Li S
The recent study by Wu et al. (2025) comparing DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination (CNMLE) provides an important contribution to understanding large language model (LLM) performance in non-English medical contexts. While their findings highlight the potential of LLMs in medical knowledge assessment, several methodological issues merit further discussion. First, the exclusive use of Chinese-language items without bilingual comparison may favor DeepSeek-R1, which demonstrates strong performance in Chinese, over ChatGPT-4o, whose training corpus is predominantly English-based. Second, the evaluation was conducted before the release of GPT-5, leading to potential disparities in reasoning capabilities between models. Third, the restriction to multiple-choice questions limits the assessment to factual recall rather than higher-order reasoning or clinical judgment. We commend the authors for initiating this valuable cross-linguistic analysis and suggest that future studies incorporate bilingual testing, ensure model functional parity, and include open-ended clinical items to more comprehensively evaluate LLMs' reasoning and interpretive competence in real-world medical education contexts.
Lightweight Hybrid Deep Learning Models for Accurate Classification of Respiratory Conditions from Raw Lung Sounds
Lweesy K, Abuqran S and Fraiwan L
In recent years, progress in artificial intelligence, particularly in the realm of deep learning, has resulted in substantial enhancements in the diagnosis of various medical conditions. This study introduces a framework that leverages multiple lightweight deep learning models to assess their effectiveness in analyzing raw lung auscultation sounds - no feature engineering or preprocessing - to detect eleven different respiratory pathologies. The objective was to enhance the accuracy of respiratory disease diagnoses and conduct a comparative analysis of these models to pinpoint the most efficient model. The models were assessed based on their performance across two distinct datasets, one in its original form and the other after augmentation. The outcomes underscore the successful utilization of the deep learning framework, because it achieves remarkable accuracy in the detection of respiratory pathologies through the analysis of raw lung sounds alone. Furthermore, all the deep learning models proposed in the framework exhibited accuracy rates exceeding 99%, with the hybrid convolutional neural network (CNN)-long short-term memory (LSTM) model, which combines CNN for feature extraction and LSTM for temporal modeling, emerging as the top performer across all datasets. The augmentation process was also proven to be effective, leading to performance enhancements in deep-learning models. Finally, the lightweight hybrid CNN-LSTM model, which is less complex with only 15 layers, outperformed the standalone CNN and LSTM architectures, achieving up to 100% accuracy on the augmented dataset. These results suggest that raw auscultation sounds can be used to reliably detect multiple respiratory pathologies using lightweight and deployable deep learning models. The reported performance metrics reflect in-dataset evaluation only, and external validation on data from additional clinical datasets will be required to assess generalization.
From Prediction To Action in a Two-Year Diabetes Mortality Model: A Commentary
Liu S
A Dual-stage Deep Learning Framework for Breast Ultrasound Image Segmentation and Classification
Bruno P, Macrì M and Dodaro C
Deep Learning methods have become a powerful tool in medical imaging, with great potential to improve diagnostic accuracy and support early disease detection. This is especially critical for breast cancer, one of the most common cancers among women, where early detection of abnormal tissue is crucial to improving survival rates. In this paper, we explore the application of Deep Learning techniques to segment and classify breast masses as malignant or benign using ultrasound images, aiming to support breast cancer diagnosis. We propose a modular dual-stage pipeline that first segments suspicious regions and then classifies them into benign or malignant categories. The framework is designed to flexibly integrate different backbone architectures, allowing adaptation to task- or dataset-specific requirements. Experimental results show that, within this pipeline, DeepLabV3+ with a ResNet34 encoder provided the most accurate segmentation, while lightweight classifiers such as MobileNetV3-Small and EfficientNet-B0 yielded the best classification performance. Moreover, an ablation study was conducted to tune parameters and determine their optimal configuration. Finally, our approach was tested on two breast ultrasound datasets, and the results show promising improvements in diagnostic accuracy, demonstrating the potential of our method to enhance early breast cancer detection.
Automated Bone Age Assessment and Adult Height Prediction from Pediatric Hand Radiographs via a Cascaded Deep Learning Framework
Pei N, Zhuang Y, Su Z, Wang F, Liu Y, Li X, Su H and Zeng H
Bone age assessment and adult height prediction are essential for evaluating pediatric growth. Traditional methods rely on manual radiographic interpretation, which is subjective, time-consuming, and prone to inter-observer variability. This study presents an automated approach using a cascaded deep learning model to assess bone age and predict adult height from pediatric hand radiographs, aiming to improve diagnostic objectivity and efficiency. A total of 8,242 left-hand radiographs from Chinese children were retrospectively collected. Bone age was annotated by experienced pediatric endocrinologists using the China-05 standard. The model employed Yolact for instance segmentation to detect and classify bone structures, followed by parallel ResNet-18 subnetworks to grade ossification centers in the radius, ulna, and metacarpal/phalangeal bones. Predicted grades were integrated using a standardized scoring system to estimate bone age. A regression model then predicted adult height based on these features. The model achieved a Pearson correlation of 0.98 ([Formula: see text]) for bone age and 0.94 ([Formula: see text]) for adult height predictions. Bland-Altman analysis showed minimal bias and narrow limits of agreement. Mean absolute errors were 0.25 years for bone age and 1.75 cm for adult height. Average inference time was 7.8 seconds, significantly enhancing clinical efficiency. The proposed cascaded deep learning model delivers accurate, efficient, and reliable bone age assessment and adult height prediction, offering strong potential for clinical integration in pediatric growth evaluation.
Efficient Vision Transformers for Ophthalmic Images Classification: A Comparative Study of Supervised, Semi-Supervised, and Unsupervised Learning Approaches
Al-Wassiti AS, Mutar MT, Al Sakini AS, Rasheed LS, Yosif W, Abbas MA, Raouf NR and Al-Shammari AS
This study explored the integration of supervised, semi-supervised, and unsupervised learning strategies to classify ophthalmic images under label-scarce conditions. Given the high cost of annotations in medical imaging, the goal was to improve diagnostic performance using minimal labeled data and robust feature representations. A dataset of 18,767 multimodal ophthalmic images was collected - 1,877 labeled and 16,890 unlabeled. Three transformer-based architectures -ViT-Base, DeiT-Base, and MaxViT-L-were used for supervised learning. Semi-supervised learning employed pseudo-labeling with a confidence threshold ≥ 0.98. For unsupervised learning, SimCLR-based contrastive learning and K-means clustering were implemented on extracted features. Performance was evaluated using classification accuracy, AUC, F1-score, clustering indices (Silhouette Score, DBI, CH Index), and computational metrics. In supervised learning, ViT-Base achieved the highest accuracy (92.47%), followed by DeiT-Base (89.38%) and MaxViT-L (85.27%). After pseudo-labeling, MaxViT-L achieved the best accuracy (97.49%) and AUC (0.9982). Contrastive learning significantly improved feature clustering, with MaxViT-L reaching a Silhouette Score of 0.556 and a reduced DBI of 0.541. However, computational analysis revealed that MaxViT-L exhibited the highest computational complexity (81,713 MFLOPs) and longest inference (~ 102 ms), while ViT-Base and DeiT-Base showed considerably lower FLOPs (39,120.6 MFLOPs) and faster inference (~ 52 ms). On external validation set, MaxViT demonstrated the best overall performance. Although ViT-Base achieved the highest accuracy in supervised training, MaxViT-L demonstrated the most favorable trade-off between performance and model generalization in semi- and unsupervised settings, Despite its higher computational complexity and longer inference time, MaxViT-L consistently achieved strong accuracy and clustering performance. This approach minimizes dependence on expert annotations, supporting scalable and automated ophthalmic diagnosis.
Computational Framework for Structuring and Analyzing Clinical Trial Criteria for AI-Guided Fine-grained Matching
Habib DRS, Mahajan I, Evancha B, Micheel C and Fabbri D
While artificial intelligence (AI) has demonstrated potential in automating clinical trial matching, most existing solutions rely on high-level structured data or oversimplified criteria. This study introduces a framework to structure and analyze eligibility criteria across three real-world trial protocols, aiming to inform more granular AI-driven trial matching strategies. Trial criteria from three protocols were decomposed into individual variables and evaluated based on data type, scope, and dependency. Complexity was assessed using a novel formula incorporating the number of independent and dependent variables, alongside the Flesch-Kincaid reading grade level. Quantitative analysis explored variation across trials. Protocols contained between 22-160 eligibility variables, with 4-22% showing interdependence. Reading grade levels ranged from sixth grade to first-year college. Complexity scores varied significantly, with some trials exhibiting particularly high cognitive and logical burdens. Recursive and hierarchical structures were prevalent in high-complexity protocols. This study reveals the substantial variability and structural complexity of clinical trial criteria, highlighting challenges for AI matching systems. A standardized approach to measuring trial complexity can enhance algorithm transparency, scalability, and interpretability. These findings underscore the need for structured, computable frameworks to improve equity and efficiency in clinical trial recruitment.
Evaluating the Performance of DeepSeek-R1 as a Patient Education Tool
Hu J, Wang J, He L, Qiu Z, Sun S and Peng F
The cost-effective open-source artificial intelligence (AI) model DeepSeek-R1 in China holds significant potential for healthcare applications. As a health education tool, it could help patients acquire health science knowledge and improve health literacy. Low back pain (LBP), the most common musculoskeletal problem globally, has seen increasing use of large language model (LLM)-based AI chatbots by patients to access health information, making it critical to further examine the quality of such information. This study aimed to evaluate the response quality and readability of answers generated by DeepSeek-R1 to common patient questions about LBP. Ten questions were formulated using inductive methods based on literature analysis and Baidu Index data, which were presented to DeepSeek-R1 on March 10, 2025. The evaluation spanned readability, understandability, actionability, clinician assessment, and reference assessment. Readability was measured using the Flesch-Kincaid Grade Level, Flesch Reading Ease Scale, Gunning Fog Index, Coleman-Liau Index, and Simple Measure of Gobbledygook (SMOG Index). Understandability and actionability were assessed via the Patient Education Materials and Assessment Tool for Printable Materials (PEMAT-P). Clinicians evaluated accuracy, completeness, and correlation. A reference evaluation tool was used to assess reference quality and the presence of hallucinations. Readability analysis indicated that DeepSeek's responses were overall "difficult to read", with Flesch-Kincaid Grade Level (mean 12.39, SD 1.91), Flesch Reading Ease Scale (mean 19.55, Q1 12.94, Q3 29.78), Gunning Fog Index (mean 13.95, SD 2.61), Coleman-Liau Index (mean 17.46, SD 2.30), and SMOG Index (mean 11.04, SD 1.37). PEMAT-P revealed good understandability but weak actionability. Consensus among five clinicians confirmed satisfactory accuracy, completeness, and relevance. References Assessment identified 9 instances (14.8%) of hallucinated references, while Supporting was rated as moderate, with most references sourced from authoritative platforms. Our study demonstrates the potential of DeepSeek-R1 in the educational content for patients with LBP. It can be employed as a supplement to patient education tools rather than substituting for clinical judgment.
Infectious, Allergic, and Immune-Mediated Disease Data Resources: a Landscape Overview and Subset Assessment
Pokutnaya D, Mayer LM, Foote S, Hartwick M, Mazrouee S, Van Panhuis WG and Shabman R
The Data Management and Sharing (DMS) Policy issued by the National Institutes of Health (NIH) requires most grant applications to include a DMS Plan, detailing data type(s), resources (e.g., data repositories, knowledgebases, portals) for data sharing, and a dissemination timeline. Researchers face challenges navigating the complex data landscape to identify data resources to fulfill the DMS Policy requirements. The National Institute of Allergy and Infectious Diseases (NIAID) aims to support researchers in preparing DMS Plans for applications that align with its mission areas. To support depositing and accessing infectious, allergic, and immune-mediated disease (IID) data, we compiled a list of IID data resources. The list was developed by reviewing online resources and collecting recommendations from subject matter experts. Additionally, we developed a questionnaire based on NIH recommendations and community best practices to characterize a subset of IID data resources that support data submissions. We identified 303 data resources, 58 of which focused on IID data. Most were categorized as General Infectious Diseases and Pathogens (n = 29, 50%), followed by Respiratory Pathogens (n = 10, 17%). Scientific content included "omics" (n = 37, 64%), clinical (n = 21, 36%), and biological assay data (n = 20, 34%). Open access data was common (n = 39, 67%), with fewer offering controlled access (n = 20, 34%) or required registration (n = 4, 7%). Among 19 resources accepting data submissions, eight (42%) required registration, seven (37%) needed additional approvals, and four (21%) required network membership. Fifteen (79%) resources provided metadata access, with 11 (58%) assigning persistent identifiers. Twelve (63%) offered APIs, 13 (68%) provided analytical tools, and 10 (53%) featured workspaces. Risk management documentation was available for 10 (53%), and five (26%) provided data retention policies. We assessed 58 data resources in the IID domain, identifying 19 that support data submission and are therefore suitable for NIH DMS Plans. Our findings reveal both the breadth of available resources, and the challenges related to inconsistent data submission requirements and data management practices. Enhancing transparency and standardization across data resources will support more effective data sharing, enhance findability, and aid researchers in selecting appropriate resources for DMS Plans and secondary data analysis.