On Finite Difference Jacobian Computation in Deformable Image Registration
Producing spatial transformations that are diffeomorphic is a key goal in deformable image registration. As a diffeomorphic transformation should have positive Jacobian determinant everywhere, the number of voxels with has been used to test for diffeomorphism and also to measure the irregularity of the transformation. For digital transformations, is commonly approximated using a central difference, but this strategy can yield positive 's for transformations that are clearly not diffeomorphic-even at the voxel resolution level. To show this, we first investigate the geometric meaning of different finite difference approximations of . We show that to determine if a deformation is diffeomorphic for digital images, the use of any individual finite difference approximation of is insufficient. We further demonstrate that for a 2D transformation, four unique finite difference approximations of 's must be positive to ensure that the entire domain is invertible and free of folding at the pixel level. For a 3D transformation, ten unique finite differences approximations of 's are required to be positive. Our proposed criteria solves several errors inherent in the central difference approximation of and accurately detects non-diffeomorphic digital transformations. The source code of this work is available at https://github.com/yihao6/digital_diffeomorphism.
RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation
Deep Neural Networks are often vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the tools for developing novel deep neural architectures, demonstrates superior performance in prediction accuracy in various machine learning applications. However, the performance of a neural architecture discovered by NAS against adversarial attacks has not been sufficiently studied, especially under the regime of knowledge distillation. Given the presence of a robust teacher, we investigate if NAS would produce a robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer knowledge distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student-teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental results demonstrate the effectiveness of RNAS-CL and show that RNAS-CL produces compact and adversarially robust neural architectures. Our results point to new approaches for finding compact and robust neural architecture for many applications. The code of RNAS-CL is available at https://github.com/Statistical-Deep-Learning/RNAS-CL.
Improved 3D Markerless Mouse Pose Estimation Using Temporal Semi-Supervision
Three-dimensional markerless pose estimation from multi-view video is emerging as an exciting method for quantifying the behavior of freely moving animals. Nevertheless, scientifically precise 3D animal pose estimation remains challenging, primarily due to a lack of large training and benchmark datasets and the immaturity of algorithms tailored to the demands of animal experiments and body plans. Existing techniques employ fully supervised convolutional neural networks (CNNs) trained to predict body keypoints in individual video frames, but this demands a large collection of labeled training samples to achieve desirable 3D tracking performance. Here, we introduce a semi-supervised learning strategy that incorporates unlabeled video frames via a simple temporal constraint applied during training. In freely moving mice, our new approach improves the current state-of-the-art performance of multi-view volumetric 3D pose estimation and further enhances the temporal stability and skeletal consistency of 3D tracking.
Making the Invisible Visible: Toward High-Quality Terahertz Tomographic Imaging via Physics-Guided Restoration
Terahertz (THz) tomographic imaging has recently attracted significant attention thanks to its non-invasive, non-destructive, non-ionizing, material-classification, and ultra-fast nature for object exploration and inspection. However, its strong water absorption nature and low noise tolerance lead to undesired blurs and distortions of reconstructed THz images. The diffraction-limited THz signals highly constrain the performances of existing restoration methods. To address the problem, we propose a novel multi-view Subspace-Attention-guided Restoration Network (SARNet) that fuses multi-view and multi-spectral features of THz images for effective image restoration and 3D tomographic reconstruction. To this end, SARNet uses multi-scale branches to extract intra-view spatio-spectral amplitude and phase features and fuse them via shared subspace projection and self-attention guidance. We then perform inter-view fusion to further improve the restoration of individual views by leveraging the redundancies between neighboring views. Here, we experimentally construct a THz time-domain spectroscopy (THz-TDS) system covering a broad frequency range from 0.1 to 4 THz for building up a temporal/spectral/spatial/material THz database of hidden 3D objects. Complementary to a quantitative evaluation, we demonstrate the effectiveness of our SARNet model on 3D THz tomographic reconstruction applications.
Rethinking Portrait Matting with Privacy Preserving
Recently, there has been an increasing concern about the privacy issue raised by identifiable information in machine learning. However, previous portrait matting methods were all based on identifiable images. To fill the gap, we present P3M-10k, which is the first large-scale anonymized benchmark for Privacy-Preserving Portrait Matting (P3M). P3M-10k consists of 10,421 high resolution face-blurred portrait images along with high-quality alpha mattes, which enables us to systematically evaluate both trimap-free and trimap-based matting methods and obtain some useful findings about model generalization ability under the privacy preserving training (PPT) setting. We also present a unified matting model dubbed P3M-Net that is compatible with both CNN and transformer backbones. To further mitigate the cross-domain performance gap issue under the PPT setting, we devise a simple yet effective Copy and Paste strategy (P3M-CP), which borrows facial information from public celebrity images and directs the network to reacquire the face context at both data and feature level. Extensive experiments on P3M-10k and public benchmarks demonstrate the superiority of P3M-Net over state-of-the-art methods and the effectiveness of P3M-CP in improving the cross-domain generalization ability, implying a great significance of P3M for future research and real-world applications. The dataset, code and models are available here (https://github.com/ViTAE-Transformer/P3M-Net).
OpenMonkeyChallenge: Dataset and Benchmark Challenges for Pose Estimation of Non-human Primates
The ability to automatically estimate the pose of non-human primates as they move through the world is important for several subfields in biology and biomedicine. Inspired by the recent success of computer vision models enabled by benchmark challenges (e.g., object detection), we propose a new benchmark challenge called OpenMonkeyChallenge that facilitates collective community efforts through an annual competition to build generalizable non-human primate pose estimation models. To host the benchmark challenge, we provide a new public dataset consisting of 111,529 annotated (17 body landmarks) photographs of non-human primates in naturalistic contexts obtained from various sources including the Internet, three National Primate Research Centers, and the Minnesota Zoo. Such annotated datasets will be used for the training and testing datasets to develop generalizable models with standardized evaluation metrics. We demonstrate the effectiveness of our dataset quantitatively by comparing it with existing datasets based on seven state-of-the-art pose estimation models.
Building 3D Generative Models from Minimal Data
We propose a method for constructing generative models of 3D objects from a single 3D mesh and improving them through unsupervised low-shot learning from 2D images. Our method produces a 3D morphable model that represents shape and albedo in terms of Gaussian processes. Whereas previous approaches have typically built 3D morphable models from multiple high-quality 3D scans through principal component analysis, we build 3D morphable models from a single scan or template. As we demonstrate in the face domain, these models can be used to infer 3D reconstructions from 2D data (inverse graphics) or 3D data (registration). Specifically, we show that our approach can be used to perform face recognition using only a single 3D template (one scan total, not one per person). We extend our model to a preliminary unsupervised learning framework that enables the learning of the distribution of 3D faces using one 3D template and a small number of 2D images. Our approach is motivated as a potential model for the origins of face perception in human infants, who appear to start with an innate face template and subsequently develop a flexible system for perceiving the 3D structure of any novel face from experience with only 2D images of a relatively small number of familiar faces.
The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels
Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets (CUB-200 and Oxford-102 Flowers) and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: https://github.com/subhc/clever.
In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond
Predicting human's gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global-local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).
A Deeper Analysis of Volumetric Relightable Faces
Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3 while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.
Deep Learning Based Prediction of Pulmonary Hypertension in Newborns Using Echocardiograms
Pulmonary hypertension (PH) in newborns and infants is a complex condition associated with several pulmonary, cardiac, and systemic diseases contributing to morbidity and mortality. Thus, accurate and early detection of PH and the classification of its severity is crucial for appropriate and successful management. Using echocardiography, the primary diagnostic tool in pediatrics, human assessment is both time-consuming and expertise-demanding, raising the need for an automated approach. Little effort has been directed towards automatic assessment of PH using echocardiography, and the few proposed methods only focus on binary PH classification on the adult population. In this work, we present an explainable multi-view video-based deep learning approach to predict and classify the severity of PH for a cohort of 270 newborns using echocardiograms. We use spatio-temporal convolutional architectures for the prediction of PH from each view, and aggregate the predictions of the different views using majority voting. Our results show a mean F1-score of 0.84 for severity prediction and 0.92 for binary detection using 10-fold cross-validation and 0.63 for severity prediction and 0.78 for binary detection on the held-out test set. We complement our predictions with saliency maps and show that the learned model focuses on clinically relevant cardiac structures, motivating its usage in clinical practice. To the best of our knowledge, this is the first work for an automated assessment of PH in newborns using echocardiograms.
From Forest to Zoo: Great Ape Behavior Recognition with ChimpBehave
This paper addresses the significant challenge of recognizing behaviors in non-human primates, specifically focusing on chimpanzees. Automated behavior recognition is crucial for both conservation efforts and the advancement of behavioral research. However, it is often hindered by the labor-intensive process of manual video annotation. Despite the availability of large-scale animal behavior datasets, effectively applying machine learning models across varied environmental settings remains a critical challenge due to the variability in data collection contexts and the specificity of annotations. In this paper, we introduce , a novel dataset comprising over 2 h and 20 min of video (approximately 215,000 frames) of zoo-housed chimpanzees, annotated with bounding boxes and fine-grained locomotive behavior labels. Uniquely, aligns its behavior classes with those in PanAf, an existing dataset collected in distinct visual environments, enabling the study of cross-dataset generalization - where models are trained on one dataset and tested on another with differing data distributions. We benchmark using state-of-the-art video-based and skeleton-based action recognition models, establishing performance baselines for both within-dataset and cross-dataset evaluations. Our results highlight the strengths and limitations of different model architectures, providing insights into the application of automated behavior recognition across diverse visual settings. The dataset, models, and code can be accessed at: https://github.com/MitchFuchs/ChimpBehave.
Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition
This paper presents a novel approach for Fine-Grained Visual Classification (FGVC) by exploring Graph Neural Networks (GNNs) to facilitate high-order feature interactions, with a specific focus on constructing both inter- and intra-region graphs. Unlike previous FGVC techniques that often isolate global and local features, our method combines both features seamlessly during learning via graphs. Inter-region graphs capture long-range dependencies to recognize global patterns, while intra-region graphs delve into finer details within specific regions of an object by exploring high-dimensional convolutional features. A key innovation is the use of shared GNNs with an attention mechanism coupled with the Approximate Personalized Propagation of Neural Predictions (APPNP) message-passing algorithm, enhancing information propagation efficiency for better discriminability and simplifying the model architecture for computational efficiency. Additionally, the introduction of residual connections improves performance and training stability. Comprehensive experiments showcase state-of-the-art results on benchmark FGVC datasets, affirming the efficacy of our approach. This work underscores the potential of GNN in modeling high-level feature interactions, distinguishing it from previous FGVC methods that typically focus on singular aspects of feature representation. Our source code is available at https://github.com/Arindam-1991/I2-HOFI.
Ricci Curvature Tensor-Based Volumetric Segmentation
Existing level set models employ regularization based only on gradient information, 1D curvature or 2D curvature. For 3D image segmentation, however, an appropriate curvature-based regularization should involve a well-defined 3D curvature energy. This is the first paper to introduce a regularization energy that incorporates 3D scalar curvature for 3D image segmentation, inspired by the Einstein-Hilbert functional. To derive its Euler-Lagrange equation, we employ a two-step gradient descent strategy, alternately updating the level set function and its gradient. The paper also establishes the existence and uniqueness of the viscosity solution for the proposed model. Experimental results demonstrate that our proposed model outperforms other state-of-the-art models in 3D image segmentation.
A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data's inherent structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. In this work, we study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization for the various protocols and evaluate how robust correlations are for different kinds of dataset domain shifts. In addition, we challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.
EventEgo3D++: 3D Human Motion Capture from A Head-Mounted Event Camera
Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at https://eventego3d.mpi-inf.mpg.de.
A Physics-Informed Deep Learning Deformable Medical Image Registration Method Based on Neural ODEs
An unsupervised machine learning method is introduced to align medical images in the context of the large deformation elasticity coupled with growth and remodeling biophysics. The technique, which stems from the principle of minimum potential energy in solid mechanics, consists of two steps: Firstly, in the predictor step, the geometric registration is achieved by minimizing a loss function composed of a dissimilarity measure and a regularizing term. Secondly, the physics of the problem, including the equilibrium equations along with growth mechanics, are enforced in a corrector step by minimizing the potential energy corresponding to a Dirichlet problem, where the predictor solution defines the boundary condition and is maintained by distance functions. The features of the new solution procedure, as well as the nature of the registration problem, are highlighted by considering several examples. In particular, registration problems containing large non-uniform deformations caused by extension, shearing, and bending of multiply-connected regions are used as benchmarks. In addition, we analyzed a benchmark biological example (registration for brain data) to showcase that the new deep learning method competes with available methods in the literature. We then applied the method to various datasets. First, we analyze the regrowth of the zebrafish embryonic fin from confocal imaging data. Next, we evaluate the quality of the solution procedure for two examples related to the brain. For one, we apply the new method for 3D image registration of longitudinal magnetic resonance images of the brain to assess cerebral atrophy, where a first-order ODE describes the volume loss mechanism. For the other, we explore cortical expansion during early fetal brain development by coupling the elastic deformation with morphogenetic growth dynamics. The method and examples show the ability of our framework to attain high-quality registration and, concurrently, solve large deformation elasticity balance equations and growth and remodeling dynamics.
Of Mice and Mates: Automated Classification and Modelling of Mouse Behaviour in Groups Using a Single Model Across Cages
Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contribution is the novel Global Behaviour Model (GBM) which summarises the joint behaviour of groups of mice across cages, using a permutation matrix to match the mouse identities in each cage to the model. In support of the above, we also (a) developed the Activity Labelling Module (ALM) to automatically classify mouse behaviour from video, and (b) released two datasets, ABODe for training behaviour classifiers and IMADGE for modelling behaviour.
Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects
Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing multimodal representation, fusion, translation, alignment, and co-learning, the paper explores the transformative potential of multimodal models for clinical predictions. It also highlights the need for principled assessments and practical implementation of such models, bringing attention to the dynamics between decision support systems and healthcare providers and personnel. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist. We conclude with a discussion on principled innovation and collaborative efforts to further the mission of seamless integration of multimodal ML models into biomedical practice.
Through Hawks' Eyes: Synthetically Reconstructing the Visual Field of a Bird in Flight
Birds of prey rely on vision to execute flight manoeuvres that are key to their survival, such as intercepting fast-moving targets or navigating through clutter. A better understanding of the role played by vision during these manoeuvres is not only relevant within the field of animal behaviour, but could also have applications for autonomous drones. In this paper, we present a novel method that uses computer vision tools to analyse the role of active vision in bird flight, and demonstrate its use to answer behavioural questions. Combining motion capture data from Harris' hawks with a hybrid 3D model of the environment, we render RGB images, semantic maps, depth information and optic flow outputs that characterise the visual experience of the bird in flight. In contrast with previous approaches, our method allows us to consider different camera models and alternative gaze strategies for the purposes of hypothesis testing, allows us to consider visual input over the complete visual field of the bird, and is not limited by the technical specifications and performance of a head-mounted camera light enough to attach to a bird's head in flight. We present pilot data from three sample flights: a pursuit flight, in which a hawk intercepts a moving target, and two obstacle avoidance flights. With this approach, we provide a reproducible method that facilitates the collection of large volumes of data across many individuals, opening up new avenues for data-driven models of animal behaviour.
A Likelihood Ratio-Based Approach to Segmenting Unknown Objects
Addressing the Out-of-Distribution (OoD) segmentation task is a prerequisite for perception systems operating in an open-world environment. Large foundational models are frequently used in downstream tasks, however, their potential for OoD remains mostly unexplored. We seek to leverage a large foundational model to achieve robust representation. Outlier supervision is a widely used strategy for improving OoD detection of the existing segmentation networks. However, current approaches for outlier supervision involve retraining parts of the original network, which is typically disruptive to the model's learned feature representation. Furthermore, retraining becomes infeasible in the case of large foundational models. Our goal is to retrain for outlier segmentation without compromising the strong representation space of the foundational model. To this end, we propose an adaptive, lightweight unknown estimation module (UEM) for outlier supervision that significantly enhances the OoD segmentation performance without affecting the learned feature representation of the original network. UEM learns a distribution for outliers and a generic distribution for known classes. Using the learned distributions, we propose a likelihood-ratio-based outlier scoring function that fuses the confidence of UEM with that of the pixel-wise segmentation inlier network to detect unknown objects. We also propose an objective to optimize this score directly. Our approach achieves a new state-of-the-art across multiple datasets, outperforming the previous best method by 5.74% average precision points while having a lower false-positive rate. Importantly, strong inlier performance remains unaffected. The code and pre-trained models are available at: https://github.com/NazirNayal8/UEM-likelihood-ratio.
