IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

Floorplanning with I/O Assignment via Feasibility-Seeking and Superiorization Methods
Yu S, Censor Y and Luo G
The feasibility-seeking approach offers a systematic framework for managing and resolving intricate constraints in continuous problems, making it a promising avenue to explore in the context of floorplanning problems with increasingly heterogeneous constraints. The classic legality constraints can be expressed as the union of convex sets. However, conventional projection-based algorithms for feasibility-seeking do not guarantee convergence in such situations, which are also heavily influenced by the initialization. We present a quantitative property about the choice of the initial point that helps good initialization and analyze the occurrence of the oscillation phenomena for bad initialization. In implementation, we introduce a resetting strategy aimed at effectively reducing the problem of algorithmic divergence in the projection-based method used for the feasibility-seeking formulation. Furthermore, we introduce the novel application of the superiorization method (SM) to floorplanning, which bridges the gap between feasibility-seeking and constrained optimization. The SM employs perturbations to steer the iterations of the feasibility-seeking algorithm towards feasible solutions with reduced (not necessarily minimal) total wirelength. Notably, the proposed algorithmic flow is adaptable to handle various constraints and variations of floorplanning problems, such as those involving I/O assignment. To evaluate the performance of Per-RMAP, we conduct comprehensive experiments on the MCNC benchmarks and GSRC benchmarks. The results demonstrate that we can obtain legal floorplanning results 166× faster than the branch-and-bound (B&B) method while incurring only a 5% wirelength increase compared to the optimal results. Furthermore, we evaluate the effectiveness of the algorithmic flow that considers the I/O assignment constraints, which achieves an 6% improvement in wirelength. Besides, considering the soft modules with a larger feasible solution space, we obtain 15% improved runtime compared with PeF, the state-of-the-art analytical method. Moreover, we compared our method with Parquet-4 and Fast-SA on GSRC benchmarks which include larger-scale instances. The results highlight the ability of our approach to maintain a balance between floorplanning quality and efficiency.
CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs
Tang Y, Song Y, Elango N, Priya SR, Jones AK, Xiong J, Zhou P and Hu J
DNNs are rapidly evolving from streamlined single-modality single-task (SMST) to multi-modality multi-task (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e. deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10000X less than exhaustively searching the optimal solution.
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture
Dong P, Zhuang J, Yang Z, Ji S, Li Y, Xu D, Huang H, Hu J, Jones AK, Shi Y, Wang Y and Zhou P
While Vision Transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (< 1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency real-time requirement, even with quantized ViT models. Some approaches use pruning or sparsity to reduce model size and latency, but this often results in accuracy loss. To address the aforementioned constraints, in this work, we propose EQ-ViT, an end-to-end acceleration framework with novel algorithm and architecture co-design features to enable real-time ViT acceleration on AMD Versal Adaptive Compute Acceleration Platform (ACAP). The contributions are four-fold. First, we perform in-depth kernel-level performance profiling & analysis and explain the bottlenecks for existing acceleration solutions on GPU, FPGA, and ACAP. Second, on the hardware level, we introduce a new spatial and heterogeneous accelerator architecture, EQ-ViT architecture. This architecture leverages the heterogeneous features of ACAP, where both FPGA and artificial intelligence engines (AIEs) coexist on the same system-on-chip (SoC). Third, On the algorithm level, we create a comprehensive quantization-aware training strategy, EQ-ViT algorithm. This strategy concurrently quantizes both weights and activations into 8-bit integers, aiming to improve accuracy rather than compromise it during quantization. Notably, the method also quantizes nonlinear functions for efficient hardware implementation. Fourth, we design EQ-ViT automation framework to implement the EQ-ViT architecture for four different ViT applications on the AMD Versal ACAP VCK190 board, achieving accuracy improvement with 2.4%, and average speedups of 315.0x, 3.39x, 3.38x, 14.92x, 59.5x, 13.1x over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, and AMD ZCU102, U250 FPGAs. The energy efficiency gains are 62.2x, 15.33x, 12.82x, 13.31x, 13.5x, 21.9x.
Personalized Meta-Federated Learning for IoT-Enabled Health Monitoring
Jia Z, Zhou T, Yan Z, Hu J and Shi Y
Federated learning (FL) has been widely adopted in IoT-enabled health monitoring on biosignals thanks to its advantages in data privacy preservation. However, the global model trained from FL generally performs unevenly across subjects since biosignal data is inherent with complex temporal dynamics. The morphological characteristics of biosignals with the same label can vary significantly among different subjects (i.e., inter-subject variability) while biosignals with varied temporal patterns can be collected on the same subject (i.e., intra-subject variability). To address the challenges, we present the Personalized Meta-Federated learning (PMFed) framework for personalized IoT-enabled health monitoring. Specifically, in the federated learning stage, a novel momentum-based model aggregating strategy is introduced to aggregate clients' models based on domain similarity in the meta-federated learning paradigm to obtain a well-generalized global model while speeding up the convergence. In the model personalizing stage, an adaptive model personalization mechanism is devised to adaptively tailor the global model based on the subject-specific biosignal features while preserving the learned cross-subject representations. We develop an IoT-enabled computing framework to evaluate the effectiveness of PMFed over three real-world health monitoring tasks. Experimental results show that the PMFed excels at detection performances in terms of F1 and accuracy by up to 9.4% and 8.7%, and reduces training overhead and throughput by up to 56.3% and 63.4% when compared with the SOTA federated learning algorithms.
Dynamic Radial Placement and Routing in Paper Microfluidics
Potter J, Grover WH and Brisk P
The low cost, simplicity, and ease of use of paper microfluidic devices have made them valuable medical diagnostics for applications from pregnancy testing to COVID-19 screening. Meanwhile, the increasing complexity of paper-based microfluidic devices is driving the need to produce new tools and methodologies that enable more robust biological diagnostics and potential therapeutic applications. A new design framework is being used to facilitate both research and fabrication of paper-based microfluidic biological devices to accelerate the investigative process and reduce material utilization and manpower. In this work we present a methodology for this framework to dynamically place and route microfluidic components in a nondiscrete design space where fluid volume usage, surface area utilization, and the timing required to perform specified biological assays are accounted for and optimized while also accelerating the development of potentially lifesaving new devices.