IEEE Transactions on Neural Networks and Learning Systems

Model-Based Offline Reinforcement Learning With Adversarial Data Augmentation
Cao H, Feng F, Huo J, Yang S, Fang M, Yang T and Gao Y
Model-based offline reinforcement learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rolling out conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversarial data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor (DF) is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on the D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.
Spatiotemporal Topology-Informed Multiagent Reinforcement Learning Framework for Structured Multiprocess Collaborative Optimization
Liu D, Wang Y, Liu C, Luo B and Huang B
Industrial multiprocess collaborative optimization presents significant challenges due to the intricate spatiotemporal dependencies inherent in modern process industries. Traditional optimization and reinforcement learning often treat subprocesses as independent entities, neglecting the fine-grained interdependencies among operational variables across different subprocesses. To fundamentally address this limitation, we introduce, a novel spatiotemporal topology-informed multiprocess collaborative optimization (STI-MCO) framework, which pioneers action-level interdependency modeling through an innovative spatiotemporal graph architecture. Rather than treating subprocesses as monolithic entities, STI-MCO operates at the operational variable level, enabling precise representation of both interprocess relationships and intraprocess dependencies through a hierarchical two-stage decision framework. This approach enables more precise coordination through fine-grained variable interactions, better temporal consistency via dynamic graph structures, and enhanced scalability compared with conventional agent-level methods. This paradigm shift from subprocess-level to variable-level collaboration, combined with dynamic graph-based coordination, enables extensive simulations and experiments conducted across three benchmark environments with progressively complex topologies to demonstrate that STI-MCO consistently outperforms baseline methods, achieving up to 38.9% improvement over centralized methods and 171.9% improvement over existing multiagent strategies. In addition, STI-MCO exhibits superior convergence efficiency, requiring significantly fewer training steps to achieve high performance. Its practical applicability is further validated through deployment in a real-world Salt Lake chemical process. By fundamentally shifting the optimization paradigm from holistic subprocess control to fine-grained variable-level collaboration, this work establishes a new framework for more effective optimization in complex industrial processes, particularly those with strong interunit coupling.
Accurate Protein-Protein Interaction Prediction: Based on Multiview Heterogeneous Graph Autoencoders and Random Masking
Chen S, Tang Z, You L and Yu-Chian Chen C
Protein-protein interaction (PPI) and their interaction sites [PPI site (PPIS)] hold immense potential for elucidating cellular mechanisms and advancing targeted drug development. While deep learning has driven progress in PPI research by capturing protein features, it remains limited by its overreliance on sequence information and inability to effectively integrate protein internal structural features. To address these challenges, we propose MEGAE, a novel model capable of achieving high-precision prediction of PPI and PPIS. MEGAE reconstructs amino acid microenvironments through a vector quantization autoencoder, integrating physicochemical properties, structural details, and sequence data to provide a comprehensive representation of proteins. We innovatively introduce a multiview random masking training strategy, introducing controlled randomness during the reconstruction process to enhance the robustness of microenvironment embeddings. The model combines these fused embeddings with protein graphs and protein interaction networks, leveraging graph neural networks (GNNs) to capture multilevel relationships from local amino acid interactions to global signal network connections-thereby achieving precise predictions. Experimental results demonstrate that MEGAE outperforms state-of-the-art sequence- and structure-based methods across multiple datasets, exhibiting higher accuracy in predicting interaction types and interaction sites. This advancement underscores the potential of microenvironment-aware modeling in uncovering complex protein interactions.
Population Historical Information-Driven Evolutionary Multitask Neural Architecture Search
Yu K, Tang H, Liang J, Li C and Yu M
Neural architecture search (NAS) has achieved significant success in automating neural network design, particularly through evolutionary NAS. To address the critical need for efficient architecture discovery across diverse scenarios, such as computer vision and natural language processing, multitask NAS (MT-NAS) methods have emerged. Nevertheless, existing MT-NAS approaches still face critical challenges, including redundant search arising from insufficient exploitation of population historical information across generations and negative transfer caused by unguided interactions between tasks. To address these limitations, a population historical information-driven evolutionary multitask neural architecture search (HIMT-NAS) algorithm is proposed. For each generation, the population historical information is recorded, which includes the operation information and the topology information. In the search process, systematic utilization of population historical information to guide evolutionary search directions, preventing redundant search. Furthermore, the proposed method adjusts cross-task knowledge transfer probability by measuring task similarity through patterns in population historical information, and then updates transfer probabilities when the information proves useful across multiple tasks. Extensive experiments on MedMNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate consistent advantages of the proposed method over both single-task NAS methods and recent MT-NAS methods.
Universal Set of Observables for Forecasting Physical Systems Through Causal Embedding
Manjunath G, de Clercq A and Steynberg MJ
We show how a pair of points can uniquely represent a left-infinite sequence obtained from observations of an underlying dynamical system through a phenomenon called causal embedding. A driven dynamical system creates such pairs, and a function can be learned on them that can reconstruct the underlying dynamics as in Takens delay embedding. The approach assures embedding stability unlike Takens delay embedding, and learnability, which can be absent, while stability can be present in the current reservoir computing framework. This accurately models underlying systems where recent methods like the next-generation reservoir computing fail. We demonstrate results and compare with the other methods, including SINDY-PI.
Interlayer Sparse Compression-Based Deep Echo State Network Model and Its Application in Time-Series Forecasting
Wang Y, Zheng M, Shang Y, Yuan M and Zhao H
Aiming at the problems of redundant information accumulation, low computational efficiency, and fuzzy feature allocation in multiscale time-series prediction of traditional deep echo state network (DeepESN), this article proposed an interlayer sparse compression-based DeepESN model (ICS-DESN). The model uses the sparse sampling technology of deep fusion compressive sensing and the hierarchical dynamic feature extraction mechanism of DeepESN, introduces the adaptive compressed sampling module between the layers, and uses the Gaussian observation matrix to reduce the dimension of the high-dimensional state, which effectively inhibits the stacking of redundant information in the deep network, and explicitly allocates the multiscale temporal features. Through theoretical analysis, it is proven that ICS-DESN satisfies the stability condition of the echo state property (ESP) by constraining the weighted spectral radius of the reservoir. In the experiment, we used multiscenario time-series datasets, such as logistic chaotic systems, Lorenz attractors, sunspot data, NASDAQ stock index, ETTh1 dataset, and weather dataset to validate the effectiveness of the model. The results showed that compared with traditional comparison models, ICS-DESN significantly reduced prediction errors [mean squared error (MSE) and mean absolute error (MAE)], demonstrating higher computational efficiency and robustness. This research provides an efficient theoretical framework for complex time-series modeling and has potential application value in resource-constrained scenarios, such as edge computing.
LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking
Ma Y, Wei T, Zhong N, Mei J, Hu T, Wen L, Yang X, Shi B and Liu Y
While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this article, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes-including appearance, motion patterns, and associated risks-LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module mimicking the human-driving learning process. The system consists of an analytic process (System-II) that accumulates driving experience through logical reasoning and a heuristic process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared with camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/.
Predicting What Matters: Training AI Models for Better Decisions
Anand AS, Sawant S, Peter Reinhardt D and Gros S
Artificial intelligence (AI) models that predict the future behavior of real-world systems and processes (also known as predictive AI models) are central to intelligent decision-making. They are often employed in model-based decision-making frameworks to optimize decisions for real-world tasks based on their predictions. However, decisions optimized using such predictive AI models often result in suboptimal performance when applied in the real world. This is primarily because these models are typically constructed to best fit the behavior of the real-world system, and hence to predict the most likely future rather than to optimize the best possible decisions for a given task. Due to this objective mismatch, their predictions cannot be guaranteed to support optimal decision-making in theory or in practice. In fact, there is increasing empirical evidence and consensus that predictive models must be tailored to decision-making objectives to achieve optimal real-world performance. Supporting this observation, we establish formal (necessary and sufficient) conditions that a predictive model (AI-based or not) must satisfy for a decision-making policy derived using that model to achieve optimal performance in the real world. We then discuss their implications for building predictive AI models for optimal sequential decision-making.
PL: Patent Prediction With Prompt Learning
Lu YH, Lai PY, Chen MS, Cai HT, Wang ZH, Liu SY, Dai QY and Wang CD
Patents are crucial for protecting technological innovations and fostering competitive advancements in industry. Patent prediction, a novel task in the field of patent mining, aims to forecast future technological trends, providing valuable insights for strategic planning and innovation in the industry. However, the complexity of patent data and the diversity of technological fields make effective patent prediction a significant challenge. Existing methods for predicting scientific research trends struggle to effectively model patent structures and capture dependencies between patents, resulting in suboptimal patent trend predictions. In this article, we propose a novel method, patent prediction with prompt learning (PL), to achieve effective and accurate prediction of future patent developments based on a pretrained language model (PLM). PL includes a patent similarity path extraction module to extract multiple patent development paths from extensive datasets. Following this, we design a patent prompt learning approach that integrates patent development paths, keywords, and patent similarities into the prompts. To mitigate potential noise introduced by this integration, we introduce an attention mask matrix for prompt denoising. Finally, we introduce three patent datasets with rich structures, and conduct extensive experiments on these datasets as well as a public dataset, demonstrating the superiority of the proposed method. The dataset and code have been made publicly available at https://github.com/AllminerLab/P3L.
Spectral-Guided Multiscale Feature-Aware Transformer for Hyperspectral Image Classification
Shu Z, Zeng K, Wang Y, Tang S, Yu Z and Xiao L
Transformer-based methods have recently shown remarkable success in hyperspectral image classification (HSIC). However, their applications, in practice, still face two significant challenges. First, although the multihead mechanism in self-attention improves model robustness during training, it may overlook the continuity of spectral bands. Second, existing methods often struggle to effectively balance global and local information during multiscale feature extraction, limiting further improvements in classification performance. To address these issues, we propose a novel spectral-guided multiscale feature-aware Transformer (SMFAT) framework for HSIC. Specifically, a global low-rank spectral learning (GLSL) module is introduced to project hyperspectral image patches into a low-rank subspace, reducing spectral redundancy and capturing global spectral correlations. Furthermore, we introduce the multiscale feature-aware self-attention (MFASA) mechanism, which dynamically integrates fine- and coarse-grained features to enhance multiscale feature modeling. Finally, a spectral-guided fusion (SGF) module leverages the global spectral information extracted by the GLSL module to guide MFASA in more effectively capturing interspectral correlations and spectral continuity. This approach facilitates a more effective integration of spectral and spatial features in HSIs. Experiments on three well-known HSI datasets verify that the proposed SMFAT method significantly outperforms several state-of-the-art approaches in real-world HSIC tasks. The source code for this work is available at https://github.com/stellaZ77/SMFAT.
Efficient and Scalable Point Cloud Generation With Sparse Point-Voxel Diffusion Models
Romanelis I, Fotis V, Kalogeras A, Alexakos C, Munteanu A and Moustakas K
We propose a novel point cloud U-Net diffusion architecture for 3-D generative modeling capable of generating high-quality and diverse 3-D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all nondiffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art point-voxel diffusion (PVD), measured on the same hardware setting. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation, which allows our network to produce high-quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture's performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at https://github.com/JohnRomanelis/SPVD.
A Refreshed Similarity-Based Upsampler for Direct High-Ratio Feature Upsampling
Zhou M, Wang H, Zheng Y and Meng D
Feature upsampling is a fundamental and indispensable ingredient of almost all current network structures for dense prediction tasks. Very recently, a popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution (HR) feature as guidance to help upsample the low-resolution (LR) deep feature based on their local similarity. Albeit achieving promising performance, this pipeline has specific limitations in methodological designs: 1) HR query and LR key features are not well aligned in a controllable manner; 2) the similarity between query-key features is computed based on the fixed inner product form, lacking flexibility; and 3) neighbor selection is coarsely operated on LR features, resulting in mosaic artifacts. These shortcomings make the existing methods along this pipeline primarily applicable to hierarchical network architectures with iterative features as guidance, and they are not readily extended to a broader range of structures, especially for a direct high-ratio upsampling. Against these issues, we thoroughly refresh this pipeline and meticulously optimize every methodological design. Specifically, we first propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives and then construct a parameterized paired central difference convolution block for flexibly calculating the similarity between the well-aligned query-key features. Besides, we develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts. Based on these careful designs, we systematically construct a refreshed similarity-based feature upsampling framework named ReSFU. Based on 13 types of network backbones, comprehensive experiments substantiate that only in a simple and direct high-ratio upsampling manner, our ReSFU consistently achieves satisfactory performance on six tasks, including semantic segmentation, medical image segmentation, instance segmentation, panoptic segmentation, object detection, and monocular depth estimation, showing superior generality and ease of deployment beyond the existing upsamplers. Codes are available at https://github.com/zmhhmz/ReSFU.
Chain-of-Detection: Enhancing Cross-Granularity Robotic Perception for Object Manipulation
Xu T, Gao H, Chen C, Li Y, Xu S, Guo S and Chen F
In robotic perception, cross-granularity object detection is essential for identifying and localizing targets at varying levels of detail. Traditional detection methods often struggle to bridge the gap between coarse object detection and fine-grained component localization, limiting their ability to associate parts, such as a cup and its handle. Vision-language models (VLMs), while effective in spatial reasoning, face challenges in fine-grained detection due to the scarcity of annotated datasets. To address these issues, we first propose the chain-of-detection (CoD) framework, which focuses on guiding detection in a step-by-step manner from coarse recognition to fine-grained localization. During this process, we observe that existing detectors still lack sufficient capability in recognizing fine-grained components. To overcome this limitation, we further combine the CoD framework with Monte Carlo tree search (MCTS) to automatically generate fine-grained datasets, eliminating the need for manual labeling and significantly improving detector performance. Experiments show that our approach achieves an average improvement of 17.31% in robotic manipulation success rates for common objects, 51.39% for larger object operations, and about 50% in simulated environments. These results demonstrate the effectiveness of CoD in advancing cross-granularity detection and enhancing precise robotic manipulation. The implementation is publicly available at https://github.com/tinnel123666888/CoD and the CoD dataset is released at https://huggingface.co/datasets/tinnel123/CoD_dataset.
Unlocking Pseudolabel Potential and Alignment for Unpaired Cross-Modality Adaptation in Remote Sensing Image Segmentation
Xu Z, Geng J, Jiang W and Song S
With the growth of multisource sensor technology, multimodal learning has become pivotal in remote sensing (RS) image segmentation. Despite its potential, current methods face challenges in acquiring large-scale paired samples. When annotated optical images are available, but synthetic aperture radar (SAR) images lack annotations, learning discriminative features for SAR images from optical images becomes difficult. Unsupervised domain adaptation (UDA) offers a potential solution to this challenge, which we refer to as unpaired cross-modality UDA. In this article, we propose unlocking pseudolabel potential and alignment (ULPA) for unpaired cross-modality adaptation in RS image segmentation, a novel one-stage adaptation framework designed to enhance cross-modality knowledge transfer. Our approach employs a prototypical multidomain alignment (PMDA) strategy, which reduces the modality gap through contrastive learning between features and prototypes of identical classes across different modalities. In addition, we introduce the unreliable-sample-guided feature contrast (UFC) loss to address the underutilization of unreliable pixels during training. This strategy separates reliable and unreliable pixels based on prediction confidence, assigning unreliable pixels to a category-wise queue of negative samples, thus ensuring all candidate pixels contribute to the training process. Extensive experiments show that the integration of PMDA and UFC loss can lead to more effective cross-modality domain alignment and substantially boost the model's generalization capability.
User Isolation Poisoning on Decentralized Federated Learning: An Adversarial Message-Passing Graph Neural Network Approach
Li K, Liang Y, Lio P, Ni W, Dressler F, Crowcroft J and Akan OB
This article proposes a new cyberattack on decentralized federated learning (DFL), named user isolation poisoning (UIP). While following the standard DFL protocol of receiving and aggregating benign local models, a malicious user strategically generates and distributes compromised updates to undermine the learning process. The objective of the new UIP attack is to diminish the impact of benign users by isolating their model updates, thereby manipulating the shared model to reduce the learning accuracy. To realize this attack, we design a novel threat model that leverages an adversarial message-passing graph (MPG) neural network. Through iterative message passing, the adversarial MPG progressively refines the representations (also known as embeddings or hidden states) of each benign local model update. By orchestrating feature exchanges among connected nodes in a targeted manner, the malicious users effectively curtail the genuine data features of benign local models, thereby diminishing their overall influence within the DFL process. The MPG-based UIP attack is implemented in PyTorch, demonstrating that it effectively reduces the test accuracy of DFL by 49.5% and successfully evades existing cosine similarity- and Euclidean distance-based defense strategies.
D2Vformer: A Flexible Time-Series Prediction Model Based on Time-Position Embedding
Song X, Wang H, Deng L, Wang D, Qiu H, He Y, Cao W and Leung CS
Existing time-series forecasting methods often struggle to adapt to dynamic scenarios and lack flexibility in prediction. They typically require retraining the model when the prediction length or position changes. Moreover, these methods still face challenges in effectively capturing and utilizing time-position embeddings (PEs). To address these limitations, this article proposes a novel model called D2Vformer. Unlike conventional prediction methods that rely on fixed-length predictors, D2Vformer can directly handle scenarios with arbitrary prediction lengths. In addition, it significantly reduces training resource consumption and proves highly effective in real-world dynamic environments. In D2Vformer, the Date2Vec (D2V) module is devised to leverage timestamp information and feature sequences to generate time PEs. Subsequently, D2Vformer introduces an innovative fusion module that leverages an attention mechanism to capture the mapping between input and target time PEs, thereby enabling flexible prediction. Extensive experiments on six datasets demonstrate that D2V outperforms other time-PE methods, while D2Vformer surpasses state-of-the-art approaches in both fixed-length and arbitrary-length prediction tasks. The code for D2Vformer is available at: https://github.com/TeamofHaoWang/D2Vformer.
Spatiotemporal Dynamics Modeling of Brain Activity for Human-Robot Cognitive Interaction: ADistributed-Lumped Parameter System Framework
Zhang J, Zhang L, Mu F, Huang Z, Zou C, Huang R, Wang C and Cheng H
This article investigates the system modeling problem for the dynamical process of human brain activity in human-robot cognitive interaction (HRCI). An important novelty of the proposed approaches is to build a computational model of a human-distributed robot-lumped parameter system (HDRLPS) that describes the inherent dynamical principle of human brain activity (with spatiotemporal-varying characteristic) undergoing the interaction between the intrinsic cognitive dynamics and extrinsic robot stimuli. A deterministic learning (DL)-based spatiotemporal dynamics identification scheme is proposed to accurately identify the spatiotemporal dynamics of HDRLS and obtain the associated knowledge as a constant radial basis functional neural network (RBF NN) model. A spatiotemporal dynamics estimator is designed with this model, which can accurately evaluate and monitor the dynamical process of human brain activity in real-time HRCI by the generated dynamics-synchronized state. The effectiveness and practicability of the approaches in the dynamics identification and evaluation for the human brain activity in HRCI are validated by the thorough analysis, including the mathematical proof, the simulation study, and the brain-computer interface (BCI) experiment using publicly available datasets. Our method is compared with state-of-the-art (SOTA) methods, such as LGGNet, EEGNet, Tsception, EEG-Deformer, EEG-Transformer, and EEGViT. The results show that our method can outperform these methods with better recognition accuracy and macro- $F1$ scores. The source code can be found at: https://github.com/alonexing/source_code/tree/master.
SKIP: A Prototype-Based Scalable Knowledge Graph Representation Learning Method
Liu Y, Liang K, Xia J, Liu M, Yang X, Liu X, Zhou S and Li SZ
The field of knowledge graph representation learning (KGRL) has been rapidly expanding. To effectively apply KGRL models to large real-world knowledge graphs (KGs), anchor-based methods have been proposed. These methods aim to reduce computational costs and parameter requirements by encoding entities using a small set of entity anchors. However, existing anchor selection approaches are often rudimentary and sometimes yield suboptimal results. In this article, we propose a scalable anchor-based KGRL method called SKIP. By leveraging prototype information, our method selects representative entities as anchors. The SKIP method consists of two main steps. First, pretraining models are employed to encode entities by utilizing the topological structure and textual information in KGs. Second, the prototype learning module (PLM) extracts entity prototypes, which are then used to sample entity anchors that contain valuable prototype information. These settings enable SKIP to identify representative and reasonable entity anchors, leading to improved performance while requiring fewer computational resources. Extensive experiments conducted on various downstream tasks using KGs of different scales demonstrate the superiority and effectiveness of SKIP. Particularly, on the large OGB WikiKG 2 dataset, our method achieves comparable performance while reducing running time by approximately 21.28% and requiring 21.43% fewer model parameters compared to the baseline. This indicates the superior scalability of SKIP.
DCTC-Net: Dual-Branch Cross-Fusion Transformer-CNN Architecture for Medical Image Segmentation
Sun R
Hybrid architectures that combine convolutional neural networks (CNNs) with Transformers have emerged as a promising approach for medical image segmentation. However, existing networks based on this hybrid architecture often encounter two challenges. First, while the CNN branch effectively captures local image features through convolution operations, vanilla convolution lacks the ability to achieve adaptive feature extraction. Second, although the Transformer branch can model global image information, conventional self-attention (SA) primarily focuses on spatial relationships, neglecting channel and cross-dimensional attention, leading to suboptimal segmentation results, particularly for medical images with complex backgrounds. To address these limitations, we propose a dual-branch cross-fusion Transformer-CNN architecture for medical image segmentation (DCTC-Net). Our network provides two key advantages. First, a dynamic deformable convolution (DDConv) is integrated into the CNN branch to overcome the limitations of adaptive feature extraction with fixed-size convolution kernels and also eliminate the issue of shared convolution kernel parameters across different inputs, significantly enhancing the feature expression capabilities of the CNN branch. Second, a (shifted)-window adaptive complementary attention module ((S)W-ACAM) and compact convolutional projection are incorporated into the Transformer branch, enabling the network to comprehensively learn cross-dimensional long-range dependencies in medical images. Experimental results demonstrate that the proposed DCTC-Net achieves superior medical image segmentation performance compared to state-of-the-art (SOTA) methods, including CNN and Transformer networks. In addition, our DCTC-Net requires fewer parameters and lower computational costs and does not rely on pretraining.
Reinforcement Active Modeling for Flexible Needle Shape Prediction in Multilayer Tissues
Ren F, Wang X, Fang Y, Yu N and Han J
The complex interactions between flexible needles and tissues present significant challenges in predicting the needle shape during the puncture procedure. In particular, the accurate prediction of flexible needle shape during insertion into complex multilayer tissues, especially when measurement feedback involves non-Gaussian noise, remains an open problem. In this article, we develop a novel reinforcement learning-based active modeling scheme to predict the deflection of the robotic flexible needle. First, the active modeling scheme is constructed by deriving an extended Kalman filter under the maximum correntropy criterion to enhance insensitivity to non-Gaussian noise. Subsequently, based on this scheme, the reinforcement active modeling (RAM) framework is built by incorporating reinforcement learning to compensate for the modeling residuals. Specifically, the theoretical convergence of the proposed scheme is proved by using the Banach fixed-point theorem, thereby ensuring the reliability of needle shape prediction. Finally, a series of comparative experiments is carried out on a self-built robotic flexible needle. The experimental results demonstrate the superior performance of the proposed deflection predictor. Under non-Gaussian noise conditions, the proposed RAM scheme achieves a generalization prediction error reduction of 46.4% in RMSE and over 76.1% in Var during insertion into unknown multilayer tissue.
Leader-Based Multiexpert Neural Network for High-Level Visual Tasks
Zuo F, Liu J, Chen Z, Shen X, Wang L and Wen Z
Remarkable progress has been achieved in the detection and segmentation of the baseline; however, for high-level visual tasks in complex scenes (e.g., dense, occlusion, scale diversity, high background noise, etc.), existing frameworks often fail to provide satisfactory performance. To further improve the object recognition ability, this article introduces a leader-based multiexpert mechanism into the detection and segmentation tasks. In this work, we first design a leader-based attention learning layer to fully integrate multilevel features from the backbone network, which can effectively obtain global semantics and assign instructions to detection experts. Then, we propose multiple feature pyramids with dual fusion paths to replace the traditional single pipeline using semantic and spatial allocators. With this strategy, we can further establish deep supervision for multiple experts during training and sufficiently utilize the multiexpert detection results from leaders' assignments during reasoning, thereby comprehensively improving the performance of the model in complex scenarios. In the experiment, we established ablation studies and performance comparisons on COCO 2017 detection and segmentation tasks. Finally, we demonstrated the model's performance in three complex application scenarios (remote sensing, autonomous driving, and industrial fields), and the results showed our advantages.