ICS 2026 Accepted Poster Submissions

1. Taskflow: A General-purpose Task-parallel Programming System

Author(s): Tsung-Wei Huang (University of Wisconsin–Madison)

Abstract: The Taskflow project addresses the long-standing challenge: “How can we make it easier for C++ developers to write complex parallel algorithms leveraging manycore CPUs and GPUs?” Taskflow introduces a simple yet powerful task graph programming model that enables efficient implementations of parallel decomposition strategies. Our model supports both static and dynamic task graph construction, allowing developers to express a wide range of computational patterns with ease. Taskflow also features an efficient work-stealing scheduler that dynamically balances workloads to ensure optimal utilization of available workers across both CPU and GPU tasks. We have applied Taskflow to accelerate many EDA applications. For instance, we have successfully used Taskflow to accelerate various static timing analysis tasks on both CPUs and GPUs, achieving more than 10x speedup over existing timing tools. Other users include AMD, Nvidia, Tesseract Robotics, ModuleWorks, Xanadu, etc. We believe Taskflow is highly relevant to the HPC community as it tackles one of the most persistent challenges in modern systems programming: expressing and managing parallelism in a way that is both scalable and easy to reason about. At the time of this abstraction submission, Taskflow has recorded a weekly downloads of over 20K, with the source available at https://taskflow.github.io/.

2. The XOR State Channel: A Two-Word Single-Writer Notification Primitive for Multi-Lane Host–Worker Coordination

Author(s): Antony Vladimir Montemayor Terazas (Griffith College Cork)

Abstract: We formalize and characterize the XOR state channel (XSC), a two-word single- writer notification primitive used in the knitting runtime for coordinating work dispatch and result collection across 32 independent lanes between a host thread and a worker thread. This report is a formalization of a specific, existing signaling protocol— not a claim to a broadly novel synchronization primitive—and its contributions are definitional precision, formal invariant proofs, and design-space characterization. The XSC is not a lock: it provides no mutual exclusion, carries no payload, and does not serialize execution. Its sole function is to answer in O(1) (for the single-word, single-CLZ design) which of N independent lanes carry an unacknowledged state change—a property analogous to edge-triggered I/O readiness notification (epoll) applied to shared-memory channels. The primitive maintains two independent UInt32 control words, one owned by each endpoint, and derives joint lane state as their bitwise XOR: S = A ⊕ B. This representation eliminates all atomic read-modify-write operations on the control plane and achieves zero write-write coherence transfers on the control plane between cores. We establish three formal invariants—write-domain disjointness, state transition exclusivity, and conservative sender-side staleness—and characterize the protocol’s resource footprint within the class of N -lane, single-writer-each-side, fixed-width bitmask notification protocols. The conservative staleness invariant proves that stale observation of the receiver-owned word can delay lane reuse but cannot cause premature reuse: deferred acknowledgment batching is safe by construction at any batch size. Conditional correctness (P1). All invariants proved here assume Precondition P1: each lane carries at most one unacknowledged transition at a time. P1 is not enforced by the XOR encoding itself and is a load-bearing assumption for every guarantee in this report. The protocol cannot distinguish one outstanding transition from three on the same lane; readers should not import guarantees beyond what P1 permits.

3. TAAgent: Bridging the Intent-Execution Gap in Heterogeneous Medical AI Clusters with LLM-Driven Orchestration

Author(s): Yan Wang (United Imaging Healthcare)

Abstract: Medical imaging and AI workloads in healthcare are often executed on on-premises heterogeneous clusters because data sensitivity limits cloud usage. In such environments, researchers frequently overestimate resource needs, creating an intent-execution gap in which over-provisioning increases contention and reduces overall cluster efficiency. We present TAAgent, a hierarchical orchestration framework that uses large language models (LLMs) to recommend feasible resource requests for submitted workloads. TAAgent combines historical task profiles with real-time cluster telemetry to adapt resource recommendations to changing system conditions. The framework includes a centralized routing module for rapid workload characterization and specialized modules for CPU-bound and accelerator-bound workloads. To ground LLM reasoning in physical resource limits, TAAgent incorporates structured telemetry directly into the decision process and applies an iterative verification step before issuing resource request recommendations. Preliminary results on representative medical imaging and AI workloads running on a production heterogeneous NVIDIA GPU cluster comprising NVIDIA H20, H800, and A100 GPUs show that TAAgent improves per-task resource efficiency while reducing over-provisioning without sacrificing request feasibility.

4. Harnessing MPI mutations for AI error detection

Author(s): Asia Auville (Inria); Tim Jammer (Technische Universität Darmstadt), Eric Petit (Intel), Pablo de Oliveira Castro (University of Versailles), Emmanuelle Saillard (Inria), Mihail Popov (Inria)

Abstract: MPI errors are challenging to identify despite the significant number of expert verification tools. Dynamic tools (i.e., requiring profiling) are computationally expensive and accurate in error detection, whereas static analysis (i.e., operating at source code or compilation) is computationally cheap but less accurate. Interestingly, the recent success of AI and LLMs offers an alternative to increase static analysis accuracy while preserving its low overhead. Yet current methods remain difficult to benchmark, too general, and poorly adapted to the specific challenges of high-performance computing.

In this paper, we investigate how AI-powered tools can efficiently and accurately detect errors in real-world MPI applications. We propose a novel MPI Mutated Dataset (MMD), constructed from MPI programs extracted from thousands of open-source GitHub projects. After sorting and filtering these files, we inject errors that realistically emulate developers’ mistakes using synthetic code mutations. We leverage the dataset to train different AI models and assess their generalization capabilities against standard verification tools. We train compact multilayer perceptrons (MLP) with AST-T5 embeddings (i.e., transformer only encoder network) to detect diverse MPI errors (from the MPI-BugBench (MBB) classification) including CallOrdering or InvalidParameter, achieving detection accuracies up to 92.0%. In some cases, our method outperforms the expert prompted Qwen-3 Coder LLM and provides competitive results compared to established dynamic verification tools. We further evaluate Claude Sonnet~4.5 across different prompting strategies, showing that engineered prompts improve InvalidParam detection accuracy from 84.8% to 93.6%, surpassing the dynamic tool ITAC (87.0%). The full agentic pipeline achieves 86.4% overall multi-class accuracy and 88.5% bug detection recall on MBB. This paves the way for large-scale verification efforts that can use our dataset as a foundation to thoroughly investigate each MPI error in real world applications.

5. Wave-Based Dispatch for Circuit Cutting in Hybrid HPC–Quantum Systems

Author(s): Ricard Santiago Raigada García (Universitat Oberta de Catalunya (UOC)); Sergio Iserte Agut, Barcelona Supercomputing Center (BSC)

Abstract: Hybrid High-performance Computing (HPC)–quantum workloads based on circuit cutting decompose large quantum circuits into independent fragments, but existing frameworks tightly couple cutting logic to execution orchestration, preventing HPC centers from applying mature resource management policies to Noisy Intermediate-Scale Quantum (NISQ) workloads. We present DQR (Dynamic Queue Router), a runtime framework that bridges this gap by treating circuit fragments as first-class schedulable units. The framework introduces a backend-agnostic fragment descriptor to expose structural properties without requiring execution layers to parse quantum code, a wave-based coordinator that achieves pipeline concurrency via non-blocking polling, and a production-ready implementation on the CESGA Qmio supercomputer integrating both QPUs local on-premises (Qmio) and remote cloud (IBM Torino) backends. Experiments on a 32-qubit Hardware-Efficient Ansatz (HEA) circuit demonstrate not only makespan improvements over a monolithic CPU baseline but also transparent per-fragment failover recovery—specifically rerouting tasks from the local QPU to classical simulators upon encountering hardware-level incompatibilities—without pipeline restart. For deeper circuits, the coordination residual accounts for only 5% of the total execution time, highlighting the framework’s scalability. These results show that DQR enables HPC centers to integrate NISQ workloads into existing production infrastructure while preserving the flexibility to adopt improved cutting algorithms or heterogeneous backend technologies.

6. Context-sensitive Floating-point Round-off Error Analysis

Author(s): Charlie Joseph Keaney; Hans Vandierendonck (Queen’s University Belfast)

Abstract: Knowing the worst-case amount of round-off error present in the outputs of a floating-point program is safety critical in fields such as vehicle engineering and defense, and is necessary to ensure the accuracy of floating-point computed data for evidence-based decision-making. However, finding an upper-bound for the amount of round-off error in a floating-point function is a highly complex problem due to the properties of floating-point arithmetic. Floating-point arithmetic is not associative or distributive, and floating-point round-off error can propagate and accumulate pathologically. One widely overlooked property of floating-point is that floating-point round-off error is context-dependent, i.e. entirely dependent on the frame of reference. The same floating-point function or operation, depending on context, can result in wildly different quantities of round-off error from the frame of reference of the output. Present floating-point round-off error models are not designed to detect this, as they only consider floating-point functions and operations in isolation. Worse, present models typically consider each operation to occur in a worst-case context repeatedly, resulting in excessively pessimistic estimates of round-off error. We offer a distilled explanation that context-sensitive error bounds can be understood as identifying an ‘alignment’ between real functions that leads to a ‘subsumption’ phenomenon that partially or wholly negates error. We put forward the idea of a context-sensitive floating-point round-off error analysis that leverages alignment to obtain tighter error bounds and discuss our ground-work research into a mathematical framework that facilitates this analysis by generating alignment-proving inequalities. This framework can be used both for manual numerical analysis of floating-point functions and also for developing and improving automated static floating-point error analysis tools.

7. Faster Biobank-Scale Genomics through Sparse Linear Algebra on Genotype Representation Graph

Author(s): Yifan Li; Yifan Li, Qingyao Sun, Drew DeHaas, April Wei, Giulia Guidi (Cornell University)

Abstract: Recent advances in whole-genome sequencing have generated massive genetic datasets. Biobanks such as the UK Biobank and All of Us (US) have collected genotype information from hundreds of thousands of individuals. The recently released UK Biobank 500k dataset comprises 490,000 individuals and 700 million genetic variants.

Common matrix-based formats such as VCF store each variant independently, ignoring shared ancestry, and core analyses such as genome-wide association studies (GWAS) and principal component analysis (PCA) remain slow and memory-intensive at this scale, even when exploiting genotype matrix sparsity. Recently, the Genotype Representation Graph (GRG) addressed this by encoding shared ancestral structure in a multi-tree, replacing explicit matrix-vector multiplication with graph traversal, and achieving up to 50× runtime reduction. However, a key limitation persists: the traversal is inherently sequential, and the current implementation is single-threaded, leaving the parallelism of modern hardware unexploited.

In this work, we address this challenge by reformulating multi-tree traversal as a Sparse Triangular Solve (SpTRSV), which can be further decomposed into a height-ordered pipeline of Sparse Matrix-Vector multiplications (SpMVs). This reformulation converts an irregular graph computation into a structured linear algebra workload, enabling the use of optimized sparse solver libraries and GPU acceleration. On biobank-scale data, our approach achieves a 1.8× speedup on CPU and over 30× on GPU compared to the existing GRG implementation, substantially reducing the time required for large-scale genetic analyses and opening new possibilities for computational genomics methods.

8. A Robust Examination of the Impacts of Asynchronous Estimation on the Outcomes of Molecular Dynamics Simulations

Author(s): David Seaman; Prof. Hans Vandierendonck, Prof. Bronis de Supinski, Dr. Brian Dandurand (Queen’s University Belfast)

Abstract: Molecular Dynamics (MD) simulations at practical scales are computationally demanding as they require the calculation of interactions between huge numbers of particles over increasingly long timescales, necessitating the use of HPC resources. Interactions between particles can be split between short and long-range with short-range calculations carried out between pairs of atoms within some neighbouring region and long-range interactions being calculated globally across the simulation domain. State-of-the-art applications, such as LAMMPS, heterogeneously map these computations to different processes, resulting in differing parallel scaling between the groups with periodic exchanges of information between them. This heterogeneity and the communication it requires provides a rich space for potential performance optimisations. Dandurand et. al. developed a novel approach that enhances parallel scalability by introducing asynchrony between the two groups of processes by relaxing the constraint that long-range forces from the current timestep are required to progress the simulation by estimating them based on historical values. Experimental performance measurements indicate a speedup of up to 80% is possible in simulations where waiting for these values is a major bottleneck. We have achieved greatly improved estimation accuracy by increasing the number of historical values used in prediction that has in turn enabled simulations with more stringent accuracy constraints to be feasible. We build upon these results by presenting a detailed analysis of the impact the asynchronous approach has on simulation accuracy assessing both static and dynamic properties across a diverse range of simulation setups. The results of our experiments identify areas where the use of asynchrony has minimal impact on simulation results even over millions of timesteps in some cases as well as highlighting limitations of this approach caused by the current estimation model.

9. The RISC-V VecR extension: Adding Internal Registers to RVV

Author(s): Mariusz Szczepaniak; Syed Waqar Nabi, University of Glasgow, Nikela Papadopoulou (University of Glasgow)

Abstract: Accumulation-heavy vector kernels, such as matrix multiplication and sparse matrix-vector multiplication, repeatedly update partial sums and can incur significant overhead from architecturally visible accumulation state. In standard RISC-V Vector (RVV) implementations, partial sums must be written back to the vector register file (VRF) after every operation, accruing significant register file traffic. In this paper, we propose VecR, an extension to the RVV ISA that retains accumulation state in an internal register and materializes it only at explicit accumulation boundaries. VecR extends the concept of scalar internal registers, as introduced in the Rented Pipeline (R-) RISC-V extension, to vector processors. Specifically, VecR introduces four custom instructions supporting vector-vector and vector-float accumulation with scalar or vector internal register outputs. We implement VecR in the gem5 O3 CPU model and evaluate it on kernels from the RiVec, PolyBench, and SDV-CAB benchmark suites. The results show that VecR consistently improves unoptimized code by reducing explicit accumulation overhead, while for compiler-optimized code , results are more nuanced, exposing the tradeoff between reduced visible accumulation work and the maintained instruction throughput. Our work demonstrates that internal-register-based accumulation is a useful mechanism for RVV-based vector processors, albeit our prototype single-internal-register design, despite its low hardware complexity, restricts the achievable gains under compiler optimization on out-of-order CPUs.