ACM International Conference on Supercomputing 2026

6-9 July 2026 Belfast, Northern Ireland, United Kingdom

ICS 2026 Accepted Poster Submissions

Generated: 2026-06-11

Total accepted submissions: 20

1. A New Lightweight Cryptosystem for IoT in Smart City Environments

Author(s): Firas Hazzaa; Md Mahmudul Hasan, Akram Qashou, Sufian Yousef

Abstract: Internet of Things (IoT) devices, user interfaces (UI), software, as well as communication networks are all deployed within Smart Cities topology. The security approach designed for Internet of Things IoT should be able to prevent and detect both internal and external attacks. The problem in IoT network that not every linked node or device has an adequate amount of processing power. This means that data encryption and other related activities will be impossible and means that the security of any kind must be lightweight. A trustworthy security solution that stops illegal access to private data on the network is necessary for maintaining the privacy of information on the Internet of Things. Cryptographic processes need to be quicker and more compact without sacrificing security. The aim of this study is to reduce the execution time and power consumption of encryption processes without compromise the complexity of the encryption algorithm. This research presents a new lightweight cryptographic technique to protect various multimedia and real-time traffics across IoT network, by using two S-box in SubByte of encryption process, without affecting its performance. In this study, different audio samples will be used to test the new algorithm efficiency. Comparing the suggested method to the most advanced standard algorithm, it can reduce the cryptography process’s execution time as well as energy consumption while maintaining the required security level. The outcomes demonstrate good performance in terms of power usage and delay. The new technique consumed a roughly 0.2 µJ for encryption process while the typical AES algorithm consumed 0.29 µJ, this mean the new algorithm achieved (33% power savings), while maintaining a good complexity level (security) within the process of encryption according to the results in tables I, II, and the comparison in table III. The novelty of this work can be showed by using dual XOR S-box technique which increased the complexity of SubByte process making it more secure without overload the processing performance, in addition to the reduction in encryption rounds which contribute to enhance the performance without compromise the security. Making it more suited for the Internet of Things (IoT) used in smart city environments.


2. Decentralized Learning with Communication-Efficient Learned Gradient Sketches

Author(s): Zehua Cheng; Zehua Cheng, Wei Dai (FLock.io), Jiahao Sun (FLock.io)

Abstract: The proliferation of Internet of Things (IoT) devices and edge computing nodes has catalyzed a shift from centralized to decentralized machine learning, where agents collaborate to train a global model without a central parameter server. This paradigm is critical for applications requiring privacy preservation and massive scalability, such as autonomous vehicle fleets and collaborative healthcare diagnostics [1]. However, as these networks scale, the communication bottleneck— exacerbated by the high dimensionality of modern deep neural networks—becomes the primary impediment to performance, potentially leading to compromised model quality or deployment failures. While decentralized algorithms remove the central point of failure, they introduce a tension between achieving consensus across a bandwidth-constrained network and maintaining model quality.


3. LLM-Based Scheduling for Energy-Efficient Heterogeneous Computing in Industrial IoT

Author(s): Yuehua Liu; Wenjin Yu (Shanghai United Imaging Healthcare Advanced Technology Research Institute Co., Ltd.; Institute for Medical Imaging Technology (IMIT), Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China.)

Abstract: Energy consumption and cost constraints have become critical concerns in Industrial Internet of Things (IIoT) networks and smart manufacturing environments. Advanced edge-fog-cloud IIoT architectures promise low-latency intelligent services, but the increasing deployment of heterogeneous edge–fog–cloud computing architectures presents significant challenges for task orchestration, driven by the inherent heterogeneity of computing devices and the dynamic nature of workloads and resource availability. Traditional scheduling approaches often fail to adapt to these complexities, resulting in suboptimal performance, uneven energy consumption, and limited scalability. In this work, we first recharacterize the task orchestration problem by identifying two key dimensions: vertical heterogeneity of devices and horizontal dynamics of workloads. To address these challenges, we propose a two-fold solution. First, This paper proposes a three-tier edge-fog-cloud heterogeneous computing architecture to efficiently manage CPU and GPU resources for IIoT. Second, we introduce an LLM-based energy-efficient scheduler agent that leverages a prompt-to-Bash generation paradigm for dynamic, energy-aware task scheduling across these heterogeneous computing resources. Extensive evaluations demonstrate that our approach achieves superior energy efficiency, and improved system throughput compared to state-of-the-art methods. This study highlights the potential of integrating LLMs into resource orchestration frameworks to enable adaptive, high-performance, and energy-aware computing in heterogeneous environments.


4. Taskflow: A General-purpose Task-parallel Programming System

Author(s): Tsung-Wei Huang

Abstract: The Taskflow project addresses the long-standing challenge: “How can we make it easier for C++ developers to write complex parallel algorithms leveraging manycore CPUs and GPUs?” Taskflow introduces a simple yet powerful task graph programming model that enables efficient implementations of parallel decomposition strategies. Our model supports both static and dynamic task graph construction, allowing developers to express a wide range of computational patterns with ease. Taskflow also features an efficient work-stealing scheduler that dynamically balances workloads to ensure optimal utilization of available workers across both CPU and GPU tasks. We have applied Taskflow to accelerate many EDA applications. For instance, we have successfully used Taskflow to accelerate various static timing analysis tasks on both CPUs and GPUs, achieving more than 10x speedup over existing timing tools. Other users include AMD, Nvidia, Tesseract Robotics, ModuleWorks, Xanadu, etc. We believe Taskflow is highly relevant to the HPC community as it tackles one of the most persistent challenges in modern systems programming: expressing and managing parallelism in a way that is both scalable and easy to reason about. At the time of this abstraction submission, Taskflow has recorded a weekly downloads of over 20K, with the source available at https://taskflow.github.io/.


5. The XOR State Channel: A Two-Word Single-Writer Notification Primitive for Multi-Lane Host–Worker Coordination

Author(s): Antony Vladimir Montemayor Terrazas

Abstract: We formalize and characterize the XOR state channel (XSC), a two-word single- writer notification primitive used in the knitting runtime for coordinating work dispatch and result collection across 32 independent lanes between a host thread and a worker thread. This report is a formalization of a specific, existing signaling protocol— not a claim to a broadly novel synchronization primitive—and its contributions are definitional precision, formal invariant proofs, and design-space characterization. The XSC is not a lock: it provides no mutual exclusion, carries no payload, and does not serialize execution. Its sole function is to answer in O(1) (for the single-word, single-CLZ design) which of N independent lanes carry an unacknowledged state change—a property analogous to edge-triggered I/O readiness notification (epoll) applied to shared-memory channels. The primitive maintains two independent UInt32 control words, one owned by each endpoint, and derives joint lane state as their bitwise XOR: S = A ⊕ B. This representation eliminates all atomic read-modify-write operations on the control plane and achieves zero write-write coherence transfers on the control plane between cores. We establish three formal invariants—write-domain disjointness, state transition exclusivity, and conservative sender-side staleness—and characterize the protocol’s resource footprint within the class of N -lane, single-writer-each-side, fixed-width bitmask notification protocols. The conservative staleness invariant proves that stale observation of the receiver-owned word can delay lane reuse but cannot cause premature reuse: deferred acknowledgment batching is safe by construction at any batch size. Conditional correctness (P1). All invariants proved here assume Precondition P1: each lane carries at most one unacknowledged transition at a time. P1 is not enforced by the XOR encoding itself and is a load-bearing assumption for every guarantee in this report. The protocol cannot distinguish one outstanding transition from three on the same lane; readers should not import guarantees beyond what P1 permits.


6. Beyond BFS: A Comparative Study of Rooted Spanning Tree Algorithms on GPUs

Author(s): Srikar Donur; Abhijeet Sahu, Indian Institute of Technology Tirupati; Dr. Subhajit Sahu, SRM University-AP Andhra Pradesh

Abstract: Rooted spanning trees (RSTs) are a core primitive in parallel graph analytics, underpinning algorithms such as biconnected components and planarity testing. On GPUs, RST construction has traditionally relied on breadth-first search (BFS) due to its simplicity and work efficiency. However, BFS incurs an O(D) step complexity, which severely limits parallelism on high-diameter and power-law graphs. We present a comparative study of alternative RST construction strategies on modern GPUs. We introduce a GPU adaptation of the Path-Reversal RST (PR-RST) algorithm, optimizing its pointer-jumping and broadcast operations for modern GPU architecture. In addition, we evaluate an integrated approach that combines connectivity framework (GConn) with Eulerian tour–based rooting.

Across more than 10 real-world graphs, our results show that the GConn-based approach achieves up to 300x speedup over optimized BFS on high-diameter graphs. These findings indicate that the O(log n) step complexity of connectivity-based methods can outweigh their structural overhead on modern hardware, motivating a rethinking of RST construction in GPU graph analytics.

A separate challenge emerges from scale: all prior approaches assume that the input graph fits entirely within accelerator memory, an assumption that fails for real-world graphs that routinely exceed typical GPU memory capacities of 16–24 GB. We address this gap by extending the GConn-based RST approach to the out-of-memory setting, constructing the rooted spanning tree using only O(n) auxiliary space, thereby enabling RST computation on graphs that far exceed GPU memory capacity.


7. TAAgent: Bridging the Intent-Execution Gap in Heterogeneous Medical AI Clusters with LLM-Driven Orchestration

Author(s): Yan Wang

Abstract: Medical imaging and AI workloads in healthcare are often executed on on-premises heterogeneous clusters because data sensitivity limits cloud usage. In such environments, researchers frequently overestimate resource needs, creating an intent-execution gap in which over-provisioning increases contention and reduces overall cluster efficiency. We present TAAgent, a hierarchical orchestration framework that uses large language models (LLMs) to recommend feasible resource requests for submitted workloads. TAAgent combines historical task profiles with real-time cluster telemetry to adapt resource recommendations to changing system conditions. The framework includes a centralized routing module for rapid workload characterization and specialized modules for CPU-bound and accelerator-bound workloads. To ground LLM reasoning in physical resource limits, TAAgent incorporates structured telemetry directly into the decision process and applies an iterative verification step before issuing resource request recommendations. Preliminary results on representative medical imaging and AI workloads running on a production heterogeneous NVIDIA GPU cluster comprising NVIDIA H20, H800, and A100 GPUs show that TAAgent improves per-task resource efficiency while reducing over-provisioning without sacrificing request feasibility.


8. GPU Algorithms for Biconnected Components on Large Graphs

Author(s): Abhijeet Sahu; Srikar Donur and Indian Institute of Technology Tirupati; Dr, G Ramakrishna and Indian Institute of Technology Tirupati

Abstract: Efficient algorithms for graph biconnectivity are critical across diverse domains, including medical diagnostics and chip design. Recent work has explored accelerator-based approaches to handle large-scale graphs by leveraging parallelism. However, these methods are fundamentally limited by device memory capacity, rendering them ineffective in the Out-of-Memory (OOM) setting. This challenge necessitates a rethinking of both algorithm design and implementation.

In this work, we introduce a novel parallel edge-pruning technique that significantly reduces the memory footprint of biconnected component (BCC) decomposition. Unlike prior approaches that depend on constructing a computationally expensive BFS tree, we prove that certain non-tree edges with respect to an arbitrary spanning tree can be safely pruned without affecting biconnectivity. This insight enables the construction of an efficient O(n)-size skeleton while avoiding BFS overhead.

Building on this foundation, we propose two GPU-based parallel algorithms: FAST-BCC-FILTER for in-memory computation and PEA-BCC for the out-of-core setting. We evaluate our methods on 30 real-world datasets. For graphs that exceed device memory, our approach achieves an average speedup of 2.7x over the best CPU baseline. For graphs that fit in memory, we achieve a 71x average speedup over the state-of-the-art GPU baseline.


9. Heterogeneous Computing Workflow Accelerates End-to-end Medical Image AI Training

Author(s): Meilin Quan; Wenjin Yu, Shanghai United Imaging Healthcare Advanced Technology Research Institute Co., Ltd., Shanghai, China

Abstract: Medical image preprocessing is often a prerequisite for downstream model training and can become a major determinant of overall development efficiency. However, as dataset size increases, preprocessing frequently becomes the dominant bottleneck of the end-to-end pipeline, and its execution time often exceeds that of model training itself. Existing medical image frameworks, such as nnUNet and MONAI, provide preprocessing modules and support multi-process execution, but they do not support distributed preprocessing across different nodes. Moreover, even if such preprocessing can be achieved, overall efficiency cannot be fully improved without an integrated end-to-end workflow that seamlessly links distributed preprocessing and model training. To address this gap, we propose an integrated end-to-end workflow for medical image AI development, in which distributed CPU nodes perform parallel medical image preprocessing and continuously supply processed data to a GPU server for model training. On a real pathologically confirmed pancreatic ductal adenocarcinoma CTA dataset comprising 334,039 CT slices, the proposed workflow reduced preprocessing time from 923.5 s to 18 s (51.3× speedup)and shortened end-to-end training time from 20.8 h to 8.9 h (2.33× speedup). Moreover, these efficiency gains did not come at the cost of downstream task effectiveness, as test Dice/IoU improved from 0.64/0.50 to 0.67/0.54. These results show that our framework effectively improve preprocessing efficency as a major pipeline bottleneck and provides a practical integrated paradigm for efficient large-scale medical image AI development.


10. TAIN: A Tensor-Aware High-Performance Framework for Quantum Chemistry on Heterogeneous Architectures

Author(s): Wenhao Liang; Yingjin Ma Computer Network Information Center, Chinese Academy of Sciences Zhong Jin Computer Network Information Center, Chinese Academy of Sciences

Abstract: Accurately computing dynamic correlation energy (DCE) is essential for describing the electronic behavior and related properties of strongly correlated systems in quantum chemistry. However, DCE calculations require extensive tensor contractions that dominate the overall computational cost and form the primary performance bottleneck. These tensor-intensive workloads would, in principle, benefit from the high throughput provided by the lowprecision tensor units in modern heterogeneous architectures. Yet DCE workloads do not align well with these hardware capabilities. This mismatch arises from three key challenges: 1) strict precision requirements, 2) irregular workloads, and 3) limited scalability. To address these challenges, we propose TAIN, a tensor-based optimization framework for DCE computing. TAIN employs an orbital-aware tensor compression algorithm to identify that over 80% of tensor contractions have negligible impact on the final energy and reduces their cost through low rank approximation, while low precision quantization techniques effectively exploit GPU tensor units without compromising accuracy. For irregular workloads, TAIN introduces a heterogeneity-aware scheduling strategy that retains communication-bound tensors on CPUs and batches other tensor contractions on GPUs. TAIN further reduces redundant work through pruning and pipelined execution, and achieves load balancing through a decoupled task–load partitioning strategy. Experimental results show that TAIN achieves up to 36.7× speedup over the state-of-the-art library on a single node. TAIN reaches 97% strong and 93% weak scaling on 4,000 nodes of the ORISE supercomputer, reducing the time-to-solution for the bioluminescence system from days to a single hour. These results suggest that TAIN provides a practical reference for accelerating tensor-intensive scientific workloads on modern heterogeneous architectures.


11. Harnessing MPI mutations for AI error detection

Author(s): Asia AUVILLE; Tim Jammer (Technische Universität Darmstadt), Eric Petit (Intel), Pablo de Oliveira Castro (University of Versailles), Emmanuelle Saillard (Inria), Mihail Popov (Inria)

Abstract: MPI errors are challenging to identify despite the significant number of expert verification tools. Dynamic tools (i.e., requiring profiling) are computationally expensive and accurate in error detection, whereas static analysis (i.e., operating at source code or compilation) is computationally cheap but less accurate. Interestingly, the recent success of AI and LLMs offers an alternative to increase static analysis accuracy while preserving its low overhead. Yet current methods remain difficult to benchmark, too general, and poorly adapted to the specific challenges of high-performance computing.

In this paper, we investigate how AI-powered tools can efficiently and accurately detect errors in real-world MPI applications. We propose a novel MPI Mutated Dataset (MMD), constructed from MPI programs extracted from thousands of open-source GitHub projects. After sorting and filtering these files, we inject errors that realistically emulate developers’ mistakes using synthetic code mutations. We leverage the dataset to train different AI models and assess their generalization capabilities against standard verification tools. We train compact multilayer perceptrons (MLP) with AST-T5 embeddings (i.e., transformer only encoder network) to detect diverse MPI errors (from the MPI-BugBench (MBB) classification) including CallOrdering or InvalidParameter, achieving detection accuracies up to 92.0%. In some cases, our method outperforms the expert prompted Qwen-3 Coder LLM and provides competitive results compared to established dynamic verification tools. We further evaluate Claude Sonnet~4.5 across different prompting strategies, showing that engineered prompts improve InvalidParam detection accuracy from 84.8% to 93.6%, surpassing the dynamic tool ITAC (87.0%). The full agentic pipeline achieves 86.4% overall multi-class accuracy and 88.5% bug detection recall on MBB. This paves the way for large-scale verification efforts that can use our dataset as a foundation to thoroughly investigate each MPI error in real world applications.


12. Quantum-Powered Computational Multiphysics and Multiscale Modeling and Simulation

Author(s): Mohamed Labadi; Samir Abdelmalek (University of Chlef and National Higher School for Nanosciences and Nanotechnologies, Algeria), Abdelkader Krimi (Ecole Polytechnique de Montreal - University of Montreal, Canada)

Abstract: Quantum computing is emerging as a transformative computational paradigm with the potential to fundamentally reshape how complex multiphysics and multiscale systems are modeled and simulated. Classical high-performance computing has enabled remarkable progress in engineering simulation, yet the cost of increasing fidelity grows rapidly with spatial resolution, temporal resolution, and the number of interacting physical processes — a challenge that is particularly acute in turbulent flow, combustion, fluid-structure interaction, coupled thermo-mechanical systems, and computational aero-acoustics. Recent advances in quantum algorithms for partial differential equations, linear systems, and hybrid quantum-classical solvers suggest that some of these computational bottlenecks may be reduced through quantum acceleration. In this paper, we examine the potential of quantum computing to address these limitations from a computational science and engineering perspective, spanning finite element analysis, computational fluid dynamics, combustion modeling, and multiscale simulation. We discuss how quantum methods, combined with appropriate algorithms, may enable unprecedented gains in accuracy, speed, and problem complexity; eventually accelerating the design, testing, and development of complex engineered systems such as aircraft, spacecraft, and propulsion systems. Critically, we argue that practical progress depends not on asymptotic speedups alone, but on how quantum methods are integrated into realistic scientific workflows, including discretization, preconditioning, model reduction, and error control. Our work proposes a paradigm shift toward application-aware quantum scientific computing, where algorithms are designed with the structure of engineering models in mind, and outlines the opportunities and challenges that lie on this path.


13. DesertHPC: Toward Net-Zero Supercomputing Through Saharan Renewable Energy and Advanced Cooling Architectures

Author(s): Mohamed Labadi; Samir Abdelmalek, University of Chlef and National Higher School for Nanosciences and Nanotechnologies (Algeria)

Abstract: The accelerating computational demands of high-performance computing (HPC) and artificial intelligence (AI) have intensified global concerns over energy consumption, carbon emissions, and thermal management. Leadership-class supercomputers and large-scale AI workloads; including those underpinning large language models; place severe pressure on existing energy and cooling infrastructures, challenging the long-term sustainability of these technologies. This poster introduces the DesertHPC initiative, a visionary framework that proposes leveraging the unique environmental, geological, and climatic characteristics of Algeria’s vast Saharan desert regions to establish globally accessible, energy-efficient, and net-zero supercomputing and AI ecosystems. The poster examines four interconnected pillars of the DesertHPC concept. First, it discusses the energy landscape of modern HPC and AI systems, framing the urgency for decarbonization strategies. Second, it explores the untapped potential of the Algerian Sahara as a strategic HPC hosting environment, highlighting its exceptional solar irradiation levels, significant wind energy resources, pronounced diurnal temperature variations, the world’s largest fossil groundwater reserves, and thermally stable subterranean geology. Third, the poster presents a portfolio of advanced cooling architectures envisioned for desert-based HPC facilities; encompassing immersion cooling, geothermal cooling, and hybrid solar-assisted thermal management systems; each evaluated for suitability, efficiency, and environmental impact under arid conditions. Fourth, it outlines a decarbonization roadmap aligned with global net-zero sustainability goals, demonstrating how integrated renewable energy and AI-driven efficiency optimization can substantially reduce the carbon footprint of computational infrastructures. By repositioning underutilized desert regions as strategic assets for sustainable computing, DesertHPC offers a replicable model for green HPC deployment in arid zones worldwide. This poster aims to stimulate discussion on the intersection of geographic opportunity, Saharan renewable energy, advanced cooling engineering, and the future of net-zero supercomputing.


14. HPC-Enabled Evaluation of SARS-CoV-2 Vaccine Responses Across Platforms, Variants, and HIV Status in African Cohorts

Author(s): Shannon Ramkistan; Afrah Khairallah1,2; Shannon Ramkistan1; Kaelo Seatla3; Jumari Snyman1,2,9; Victoria Kasprowicz1; Alex Sigal1,4; Modisa Motswaledi5; Simani Gaseitsiwe3,6; Mosepele Mosepele3,7; Sikhulile Moyo3,6,8,9; Thumbi Ndung’u1,2,9,10,11;

1Africa Health Research Institute, Durban, South Africa.
2School of Medicine, University of KwaZulu-Natal, Durban, South Africa.
3Botswana Harvard Health Partnership, Gaborone, Botswana.
4 The Lautenberg Center for Immunology and Cancer Research, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.
5 Department of Medical Laboratory Sciences, School of Allied Health Sciences, Faculty of Medicine and Health Sciences, University of Botswana, Gaborone, Botswana.
6 Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA.
7 Department of Internal Medicine, School of Medicine, University of Botswana, Gaborone, Botswana.
8 Department Pathology, Division of Medical Virology, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa School of Health Systems and Public Health, University of Pretoria, Pretoria, South Africa.
9 HIV Pathogenesis Programme, The Doris Duke Medical Research Institute, University of KwaZulu-Natal, Durban, South Africa.
10 Ragon Institute of Massachusetts General Brigham, Massachusetts Institute of Technology, and Harvard University, Cambridge, MA, USA.
11 Institute of Infection, Immunity and Transplantation, University College London, London, UK.

Abstract: The SARS-CoV-2 pandemic prompted the rapid development of several vaccine platforms. In Africa, where the burden of coinfections such as HIV is high and vaccine rollout lagged, comparative studies of SARS-CoV-2 vaccine immunogenicity and durability remain limited. The evaluation of immunogenicity, durability and cross variant neutralisation requires the integration of laboratory immunology with high-performance computing (HPC) enabled bioinformatics workflows to analyse high-dimensional, longitudinal data generated across diverse cohorts, time points, and viral variants. Here, we assessed neutralizing antibody magnitude and durability in 255 participants from Botswana vaccinated with Pfizer BNT162b2 (mRNA), Janssen Ad26.COV2.S (adenoviral vector), AstraZeneca ChAdOx1-S (chimpanzee adenoviral vector), or Sinovac (inactivated virus). Neutralization was quantified using live virus neutralisation assays against representative SARS-CoV-2 variants, including the ancestral D614G strain and a recently emergent JN.1 sub-variant. Participants included people living with HIV (PLWH) on antiretroviral therapy and HIV-negative individuals. HPC-enabled analytical pipelines were used to manage and analyse large, multi-dimensional immunological datasets, including live-virus neutralization titre data together with participant-level clinical and immunological metadata. Analyses were stratified by vaccine platform, time post-vaccination, variant, and HIV status. Within three months of vaccination, BNT162b2 elicited the highest neutralizing antibody titres, significantly exceeding responses induced by ChAdOx1-S and Sinovac (P=0.03). However, beyond three months, BNT162b2 showed a more pronounced decline in neutralization compared with other platforms, indicating faster waning. Across all vaccine types, neutralization against the JN.1 variant was markedly reduced, often falling below detection limits relative to D614G. No significant differences were observed by HIV status or sex. These findings underscore platform-specific trade-offs between peak immunogenicity and durability and demonstrate how HPC-enabled analysis of immunological data is essential for resolving complex interactions between vaccine technology, viral evolution, host factors, and time. This work highlights the importance of scalable computational infrastructure for vaccine evaluation, enabling rapid, data-driven insights across intersecting epidemics.


15. Wave-Based Dispatch for Circuit Cutting in Hybrid HPC–Quantum Systems

Author(s): Ricard Santiago Raigada García; Sergio Iserte Agut, Barcelona Supercomputing Center (BSC)

Abstract: Hybrid High-performance Computing (HPC)–quantum workloads based on circuit cutting decompose large quantum circuits into independent fragments, but existing frameworks tightly couple cutting logic to execution orchestration, preventing HPC centers from applying mature resource management policies to Noisy Intermediate-Scale Quantum (NISQ) workloads. We present DQR (Dynamic Queue Router), a runtime framework that bridges this gap by treating circuit fragments as first-class schedulable units. The framework introduces a backend-agnostic fragment descriptor to expose structural properties without requiring execution layers to parse quantum code, a wave-based coordinator that achieves pipeline concurrency via non-blocking polling, and a production-ready implementation on the CESGA Qmio supercomputer integrating both QPUs local on-premises (Qmio) and remote cloud (IBM Torino) backends. Experiments on a 32-qubit Hardware-Efficient Ansatz (HEA) circuit demonstrate not only makespan improvements over a monolithic CPU baseline but also transparent per-fragment failover recovery—specifically rerouting tasks from the local QPU to classical simulators upon encountering hardware-level incompatibilities—without pipeline restart. For deeper circuits, the coordination residual accounts for only 5% of the total execution time, highlighting the framework’s scalability. These results show that DQR enables HPC centers to integrate NISQ workloads into existing production infrastructure while preserving the flexibility to adopt improved cutting algorithms or heterogeneous backend technologies.


16. Context-sensitive Floating-point Round-off Error Analysis

Author(s): Charlie Joseph Keaney; Hans Vandierendonck, Queen’s University Belfast

Abstract: Knowing the worst-case amount of round-off error present in the outputs of a floating-point program is safety critical in fields such as vehicle engineering and defense, and is necessary to ensure the accuracy of floating-point computed data for evidence-based decision-making. However, finding an upper-bound for the amount of round-off error in a floating-point function is a highly complex problem due to the properties of floating-point arithmetic. Floating-point arithmetic is not associative or distributive, and floating-point round-off error can propagate and accumulate pathologically. One widely overlooked property of floating-point is that floating-point round-off error is context-dependent, i.e. entirely dependent on the frame of reference. The same floating-point function or operation, depending on context, can result in wildly different quantities of round-off error from the frame of reference of the output. Present floating-point round-off error models are not designed to detect this, as they only consider floating-point functions and operations in isolation. Worse, present models typically consider each operation to occur in a worst-case context repeatedly, resulting in excessively pessimistic estimates of round-off error. We offer a distilled explanation that context-sensitive error bounds can be understood as identifying an ‘alignment’ between real functions that leads to a ‘subsumption’ phenomenon that partially or wholly negates error. We put forward the idea of a context-sensitive floating-point round-off error analysis that leverages alignment to obtain tighter error bounds and discuss our ground-work research into a mathematical framework that facilitates this analysis by generating alignment-proving inequalities. This framework can be used both for manual numerical analysis of floating-point functions and also for developing and improving automated static floating-point error analysis tools.


17. Faster Biobank-Scale Genomics through Sparse Linear Algebra on Genotype Representation Graph

Author(s): Yifan Li; Yifan Li, Qingyao Sun, Drew DeHaas, April Wei, Giulia Guidi (Cornell University)

Abstract: Recent advances in whole-genome sequencing have generated massive genetic datasets. Biobanks such as the UK Biobank and All of Us (US) have collected genotype information from hundreds of thousands of individuals. The recently released UK Biobank 500k dataset comprises 490,000 individuals and 700 million genetic variants.

Common matrix-based formats such as VCF store each variant independently, ignoring shared ancestry, and core analyses such as genome-wide association studies (GWAS) and principal component analysis (PCA) remain slow and memory-intensive at this scale, even when exploiting genotype matrix sparsity. Recently, the Genotype Representation Graph (GRG) addressed this by encoding shared ancestral structure in a multi-tree, replacing explicit matrix-vector multiplication with graph traversal, and achieving up to 50× runtime reduction. However, a key limitation persists: the traversal is inherently sequential, and the current implementation is single-threaded, leaving the parallelism of modern hardware unexploited.

In this work, we address this challenge by reformulating multi-tree traversal as a Sparse Triangular Solve (SpTRSV), which can be further decomposed into a height-ordered pipeline of Sparse Matrix-Vector multiplications (SpMVs). This reformulation converts an irregular graph computation into a structured linear algebra workload, enabling the use of optimized sparse solver libraries and GPU acceleration. On biobank-scale data, our approach achieves a 1.8× speedup on CPU and over 30× on GPU compared to the existing GRG implementation, substantially reducing the time required for large-scale genetic analyses and opening new possibilities for computational genomics methods.


18. HPC-enabled identification of host proteins associated with strong SARS-CoV-2 neutralising antibody responses

Author(s): Afrah Khairallah1, Zesuliwe Jule1, Alice Piller1, Mallory Bernstein1, Kajal Reedoy1, Yashica Ganga1, Sashkia R. Balla2, Prudence Kgagudi2, Bernadett I. Gosnell3, Farina Karim1, Yunus Moosa3, Thumbi Ndung’u1,4,5,6, Thandeka Moyo-Gwete2,7, Penny L. Moore2,7, Khadija Khan1, Alex Sigal1,8

1Africa Health Research Institute and University of KwaZulu-Natal, Durban, South Africa.
2National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa.
3Department of Infectious Diseases, Nelson R. Mandela School of Clinical Medicine, University of KwaZulu-Natal, Durban, South Africa.
4HIV Pathogenesis Programme, University of KwaZulu-Natal, Durban, South Africa.
5Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard University, Cambridge, MA, USA.
6Division of Infection and Immunity, University College London, London, UK.
7SAMRC Antibody Immunity Research Unit, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
8The Lautenberg Center for Immunology and Cancer Research, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.

Abstract: High-performance computing (HPC) is an essential resource for health research, enabling the analysis and integration of large-scale, high-dimensional biological data. Such data are important for identifying biomarkers, immune correlates of protection, and disease mechanisms underlying disease. In African research settings focused on a highly diverse cohorts with substantial burden of endemic infectious diseases, scalable HPC-enabled bioinformatics workflows provide a critical pathway for overcoming analytical constraints and accelerating discovery. Here, we investigated host proteins associated with SARS-CoV-2 neutralising antibody responses in a South African cohort. Seventy-one participants were enrolled, including individuals with moderate disease severity, 17 of whom required supplemental oxygen, but none were critically ill. Participants were stratified into high and low neutralisers based on convalescent plasma neutralisation capacity, with anti-spike antibody levels measured in parallel. Using large-scale SomaScan® plasma proteomics generated early after diagnosis, we applied HPC-enabled bioinformatics workflows for data processing, normalisation, and statistical analysis of high-dimensional proteomic data. The analyses identified host proteins associated with neutralisation capacity, spike-binding antibodies, and disease severity. Proteins linked to neutralisation showed strong concordance with spike-binding antibody responses (87%), but only partial overlap with disease severity (36%), highlighting the importance of separating protective immunity from clinical severity signals. Predictive modelling demonstrated that neutralisation status could be inferred from individual protein markers, with HSPA8 emerging as the strongest signal. This study revealed a distinct host proteomic program underlying neutralising antibody capacity that is strongly aligned with spike-binding responses and largely independent of disease severity. It also demonstrated how HPC-driven bioinformatics facilitates extraction of meaningful immune signatures from high-dimensional omics data. These findings provide new insight into host pathways influencing protective antibody responses and highlight the importance of scalable computational infrastructure for advancing health research.


19. A Robust Examination of the Impacts of Asynchronous Estimation on the Outcomes of Molecular Dynamics Simulations

Author(s): David Seaman; Prof. Hans Vandierendonck (QUB), Prof. Bronis de Supinski (QUB), Dr. Brian Dandurand (QUB)

Abstract: Molecular Dynamics (MD) simulations at practical scales are computationally demanding as they require the calculation of interactions between huge numbers of particles over increasingly long timescales, necessitating the use of HPC resources. Interactions between particles can be split between short and long-range with short-range calculations carried out between pairs of atoms within some neighbouring region and long-range interactions being calculated globally across the simulation domain. State-of-the-art applications, such as LAMMPS, heterogeneously map these computations to different processes, resulting in differing parallel scaling between the groups with periodic exchanges of information between them. This heterogeneity and the communication it requires provides a rich space for potential performance optimisations. Dandurand et. al. developed a novel approach that enhances parallel scalability by introducing asynchrony between the two groups of processes by relaxing the constraint that long-range forces from the current timestep are required to progress the simulation by estimating them based on historical values. Experimental performance measurements indicate a speedup of up to 80% is possible in simulations where waiting for these values is a major bottleneck. We have achieved greatly improved estimation accuracy by increasing the number of historical values used in prediction that has in turn enabled simulations with more stringent accuracy constraints to be feasible. We build upon these results by presenting a detailed analysis of the impact the asynchronous approach has on simulation accuracy assessing both static and dynamic properties across a diverse range of simulation setups. The results of our experiments identify areas where the use of asynchrony has minimal impact on simulation results even over millions of timesteps in some cases as well as highlighting limitations of this approach caused by the current estimation model.


20. The RISC-V VecR extension: Adding Internal Registers to RVV

Author(s): Mariusz Szczepaniak; Syed Waqar Nabi, University of Glasgow, Nikela Papadopoulou, University of Glasgow

Abstract: Accumulation-heavy vector kernels, such as matrix multiplication and sparse matrix-vector multiplication, repeatedly update partial sums and can incur significant overhead from architecturally visible accumulation state. In standard RISC-V Vector (RVV) implementations, partial sums must be written back to the vector register file (VRF) after every operation, accruing significant register file traffic. In this paper, we propose VecR, an extension to the RVV ISA that retains accumulation state in an internal register and materializes it only at explicit accumulation boundaries. VecR extends the concept of scalar internal registers, as introduced in the Rented Pipeline (R-) RISC-V extension, to vector processors. Specifically, VecR introduces four custom instructions supporting vector-vector and vector-float accumulation with scalar or vector internal register outputs. We implement VecR in the gem5 O3 CPU model and evaluate it on kernels from the RiVec, PolyBench, and SDV-CAB benchmark suites. The results show that VecR consistently improves unoptimized code by reducing explicit accumulation overhead, while for compiler-optimized code , results are more nuanced, exposing the tradeoff between reduced visible accumulation work and the maintained instruction throughput. Our work demonstrates that internal-register-based accumulation is a useful mechanism for RVV-based vector processors, albeit our prototype single-internal-register design, despite its low hardware complexity, restricts the achievable gains under compiler optimization on out-of-order CPUs.