ACM International Conference on Supercomputing 2026

6-9 July 2026 Belfast, Northern Ireland, United Kingdom

List of accepted papers

Note: Updates to titles and author order at camera ready submission may not have been applied yet. Last updated 28 April 2026. More updates will follow as additional paper acceptances are processed.

Cycle 1:

Title Authors

Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning R. Nadig, V. Arulchelvan, R. Bera, T. Shahroodi, G. Singh, A. Kakolyris, I. Yuksel, M. Sadrosadati, J. Park, O. Mutlu

Parametric Mappings for Distributed Tensor Computations B. Wu, M. Kong

X-HD: Fast Hausdorff Distance Computation with Ray Tracing L. Geng, Z. Yuan, R. Lee, X. Zhang, F. Wang

DANMP: Accelerating Multi-Scale Deformable Attention Using Near-Memory-Processing Architecture H. Li, Q. Wang, B. Gao, D. Chen, Y. Huang, X. Xin

G-PathGen: An Efficient GPU-Parallel k-Critical Path Generation Algorithm C. Chang, Y. Chung, C. Chiu, W. Lee, B. Zhang, U. Schlichtmann, I. Lin, X. Yu, T. Huang

DCSM: Enabling Inter-Batch Parallelism for Continuous Subgraph Matching on GPU Y. Wei, P. Jiang

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference Y. Huang, S. Di, G. Li

Wattchmen: Watching the Wattchers–High Fidelity, Flexible GPU Energy Modeling B. Tran, M. Sinclair, S. Venkataraman, M. Maiterth, W. Shin

TenProf: A Tensor-Centric Profiler for Deep Learning Workload Analysis and Optimization X. Ding, K. Zhou, Y. Hao, P. Su

StencilMD: Optimizing Communication in Molecular Dynamics Simulations R. Deng, T. Schardl

GRASP: Fine-grained and Adaptive Sampled Simulation for GPU Performance Modeling L. Chao, Z. Huang, P. Cai, J. Xue, T. Xiong, R. Xue

Agile QoS-aware Dynamic Power Management with eBPF Governors M. Rezvani, D. Wong

Cross-Architecture Autotuning for Single-Source Heterogeneous Programming Models H. Abram, N. Papadopoulou, J. Domke, M. Pericàs

Taming Dynamic Diffusion LLM Inference through Virtual Static Execution j. Zhu, H. Wu, Y. Li, H. Wang, R. Li, J. ZHAI

IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion Y. Ko, H. Park, H. Lee, H. Lee

Clone: A Collaborative Multi-device System for Retrieval-Augmented Generation over CXL S. Ko, W. Doh, E. Na, H. Shim, S. Yun, J. So, Y. Kwon, S. Park, S. Roh, M. Yoon, T. Song, E. Lee, J. Ahn

Anchoring Whole-System Persistence and Resilience in CXL Y. Zhou, J. Zeng, C. Jung

GPZ: GPU-Accelerated Lossy Compressor for Particle Data R. Li, Y. Huang, L. Zhang, Z. Yang, S. Di, B. Zhang, J. Huang, J. Liu, J. Tian, G. Li, F. Song, H. Guo, F. Cappello, K. Zhao

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving S. Gao, J. Yin, F. Wang, W. Dong

SmartCap: Coordinated CPU–GPU Power Capping for Performance-Assurance Energy Efficiency Z. Zheng, Z. Lan, X. Wu, V. Taylor, M. Papka

Mantis: Decoding HPC Telemetry Data for Robust System Prediction Y. Lu, J. Ren, S. Evgenia

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Q. Zhou, P. Yin, P. Zuo, C. Wang, J. Cheng

cuMIS: A Unified Scalable Framework for Computing Maximal Independent Sets on Trillion-Edge Graphs J. Nke, S. Kang, B. Rees, C. Lee

InferFast: Bridging the Gap Between Unstructured LLM Sparsity and Practical GPU Throughput Z. Shen, W. Bu, X. He, K. Sheng, H. Chen

BLEST: Blazingly Efficient BFS using Tensor Cores D. Elbek, K. Kaya

dLLM-Serve: Bridging the Memory Gap in Diffusion Language Model Serving J. Fan, Y. Zhang, X. Li, D. Nikolopoulos

HPMD: Enabling Hybrid Parallelism with Multi-Dimensional Adaptive DNN Training G. Yun, Y. Choi

MyT: Efficient Manycore based on Many Threading and Scalable Memory Parallelism A. Rajasukumar, R. Xu, T. Zhang, Y. Wang, T. Su, M. Nourian, J. Ding, J. Su, R. Khandelwal, A. Fell, D. Gleich, Y. Li, H. Hoffmann, A. Chien

PolyKAN: A High-Performance and Universal GPU Operator Library for Polynomial Kolmogorov-Arnold Networks m. yu, H. Zhong, J. Jiang, D. Huang, Y. Lu

Aurora: A Disaggregated GPU-PNM-PIM System for High-Throughput Mixed-Length LLM Inference H. Kim, S. Yu, M. Kim, J. Lee, H. Sung, E. Lee

SPPO: Making Million-Token LLM Training Practical on Modest GPU Clusters q. chen, S. Li, W. GAO, P. Sun, Y. Wen, T. Zhang

HOPO: Accelerating Multimodal Neural Networks Inference via Holistic Parallelism Optimization Y. Zheng, J. Sun, H. Li, G. Sun, J. Li

Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Z. Gong, R. Ran, F. Yao, W. Wen

EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU H. Bao, Z. Su, A. Setyaev, S. Kamenev, A. Gneushev, K. Zhao, J. Xiao, H. Lin, A. Bistrigova, S. Buzykanov, E. Tetin, G. Tan, B. Liu, X. Zou, Z. Dong, C. Korikov, X. Yu, Z. Hu

Scalable All-to-allv Algorithms for Dynamic and Irregular Communication Patterns C. Wei, A. Bhatele

Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents A. Sarkar, S. Ghosh, N. Tallent, A. Chadha, T. Roosta, A. Jannesari

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnect J. Bellavita, L. Pichetti, T. Pasquali, F. Vella, G. Guidi

Lock Shielding: A General Technique for Misuse-Resilient Locks V. Shahare, M. Chabbi, N. Hegde

FaaSlim: Partial Caching of Snapshot-based VMs for Serverless Computing S. Eom, C. Park, G. Lee, H. Moon, Y. Choi

CATS: Correlation-aware Task Scheduling for GPU Power Optimization in AI Data Centers S. Subramaniyan, X. Wang

MPMOS: Massively Parallel Multi-Objective Shortest Paths L. Gold, D. Sidoti, K. Pattipati, O. Khan

DynSpAttn: Efficient Attention via Dual-Side Dynamic Sparsity on Sparse Tensor Cores R. FAN, X. YU, Z. Li, W. Luo, G. Gong, X. Chu

CORE-BFS: Communication Optimized Rectangular Partitioned BFS Achieving 160.845 TeraTEPS on Frontier Supercomputer H. Yang, H. Lu, M. Matheson, F. Wang, H. Liu

Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding D. Tokuda, T. Kubo, I. Yuksel, A. Olgun, H. Luo, T. Nagatani, G. De Oliveira Junior, A. Yağlıkçı, M. Sadrosadati, O. Mutlu, S. Takamaeda-Yamazaki

CKTI: A Domain-Specific Compiler for Lowering CUDA Kernels to Triton-IR C. Shi, R. Chen, Y. Sun, Y. Sui, J. Zhang, Y. Xie, M. Wang, S. Ming, S. Zhang, Y. Zhang

Exploiting Hybrid Energy Storage to Minimize the Carbon Footprint of AI Data Centers S. Wu, X. Wang

AdaPolySI: Adaptive Polynomial Filtered Subspace Iteration for Hermitian Interior Eigenvalue Problems Y. Ni, X. Xu, S. Li, J. Zhang, J. Chen, J. Wang, J. Roman

Non-Delayed Cholesky Factorization Y. Luo, S. Zhang, W. Liu

Distributed Disjoint Weighted Matchings in Demand-Aware Reconfigurable Optical Datacenters S. Heck, K. Hanauer, S. Schmid

Cycle 2:

Title Authors

SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication C. Zhuang, L. Zhang, B. Brock, D. Wu, P. Chen, T. Endo, S. Matsuoka, M. Wahib

LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers E. Yu, D. Dong, Z. Zhang, Z. Bai, W. Yang, H. Wang, D. Li, Y. Wu, L. Xiangke

TADS: Trend-Aware Dynamic Load Balancing for Large-Scale SNN Simulations with Delay-Sharded Graph Infrastructure H. Huang, S. Pang, Y. Zeng, G. Feng, Z. Chen, Y. Lu

Rethinking Collision Detection on GPU Ray Tracing Architecture D. Mandarapu, I. Fuksman, A. Pelenitysn, G. Bernstein, M. Kulkarni

Barrier-Aware Task Scheduling for Bulk-Synchronous Parallel Architectures T. Noack, A. Koch

EZCache: A Hierarchical Memory System for Zoned Neutral Atom Quantum Computers J. Zhong, Y. Deng, H. Jiang, J. Feng

Closing the Efficiency Gap: AI Datacenter Co-design Roadmap for Scalable Training of LLMs J. Tithi, H. Wu, J. Park, A. Abuhatzera, F. Petrini, T. Krishna

GRASP: Optimizing VLIW Instruction Scheduling via Graph Reinforcement Learning Z. Wang, W. Tong, J. Fang, Y. Zhang, W. Wang, J. Ren, Z. Tang

Parallel Quadratic Selected Inversion in Quantum Transport Simulation V. Maillou, M. Bollhofer, O. Schenk, A. Ziogas, M. Luisier

MegaZK: A Memory Efficient GPU System Accelerating End-to-end Zero-Knowledge Proof M. Li, Y. Yu, B. Wang, X. Fan, M. Gao, S. Deng

SumcheckPIM: An Efficient HBM-Based PIM Architecture for Linear Complexity Zero Knowledge Proofs C. Kim, T. Kang, S. Shin, T. Suh, Y. Yang, G. Koo

GPIR: Enabling Practical Private Information Retrieval with GPUs H. Ji, H. Yu, J. Kim, W. Choi, G. Suh, J. Ahn

COMETS: Cost-effective Multi-node Efficient Training System with Memory Pooling and Sharing H. Chen, S. Yang, M. Soltaniyeh, S. Pei, A. Chang, B. Kim, C. Hao

CXL-CCL: Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool D. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y. Li, H. Hu, H. Zhang, D. Li, J. Jiang

DEFT: Joint Task Placement and DVFS for Energy-Efficient Multi-GPU Runtimes J. Chen, M. Pericàs

THAC: Unlocking Performance in Parallel HPC Applications via UQ-Aware Automated Approximation z. zhao, b. wang, B. yang, X. Chen, J. Liu, q. wang

WindStencil: Unleashing GPU Potential for High-Order Stencil Computation in High-Performance Inviscid CFD Simulations X. Zhang, H. Zhang, X. Liu, J. Li, R. Jin, J. Zhang, W. Yuan, S. Liang, Z. Lu

SVSIG: Incremental Streaming Graph Processing with Source Vertex Suppression J. Huang, X. Yan, D. Fu, H. Bian, T. Cao, Z. Li

ColdMap: Compaction-Aware Cost-Benefit Zone Cleaning for ZNS-Based Key-Value Stores S. Byeon, K. Min, J. Park, S. Lee, H. Kim, J. Han, J. Hwang, Z. Cao, Y. Kim

Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon Benefits D. Wong, A. Jahanshahi, S. Golrouye, O. Anderson, N. Yu

PACER: A Userspace Network Rate Controller in MPI with Adaptive Compression for Parallel Applications Y. Li, D. Ng, A. Kashyap, S. Di, G. Li, X. Lu

RPFC: A Router Partitioning and Forward Channel Routing Framework for 2.5D MCM System S. Tao, Z. Guo, T. Liu, J. Wang

GAAF: Fast and Scalable Graph-based Vector Similarity Search with Any-Match Label Filtering M. Ma, X. Yin, J. Qiu

Block-Aware Adaptive State Management for Optimistic Parallel Discrete Event Simulation X. Peng, Q. Wang, G. Liu, C. Hong, R. Xia, Z. Sun, X. Chen, Q. Zhang, J. Liu

Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs J. Ren, H. Ltaief, S. Zampini, D. Keyes

Continuation-Preserving Tiling for Pointer-Chasing Optimization in Structured Mutual Recursion A. Kumar, V. Singh, S. Biswas

The Performance-Power Frontier: A Model-Driven Approach to Energy-Aware Application Optimisation S. Pasupuleti, S. Wright

HoloGraph: Bridging the Throughput Gap in Heterogeneous Graph Pattern Matching via Workload-Aware Steering M. Haotian, W. Hsu, Y. Chung

SpinTune: Improving the Reliability of Quantum Sensor Networks for Practical Quantum-Classical Utility J. Ludmir, N. DiBrita, J. Han, T. Patel

TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency M. Hasanat, J. Ludmir, T. Patel, R. Roy

Harnessing MPI mutations for AI error detection A. Auville, T. Jammer, E. Petit, P. Castro, E. Saillard, M. Popov

C-3PQ: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations D. Popovici, H. Lee, N. Yoshioka, M. Ben, N. Ito, K. Klymko, D. Camps, A. Butko

Optimizing Streaming Tensor Decomposition on GPU W. Lin, J. Sheng, S. Feng, M. Dun, H. Cao, Q. Sun

S2VEC: Compiler-Driven Stream Specialization for Linearized Vectorization L. Crespo, A. Fernandes, G. Falcao, P. Tomás, N. Roma, N. Neves

TOTO: Transparent I/O Tuning for HPC Applications F. Boito, L. Teylo, M. Popov, L. Aimi, A. Bandet, L. Pilla, G. Pallez

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU Y. Li, G. Guidi

Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation S. Chundury, B. Burgstahler, J. Li, I. Suh, F. Mueller

GFAz: State-of-the-Art Graphical Fragment Assembly Compression T. Yang, Y. Liu, B. Jiang, X. Shi, S. Jin

ViSim: A Lightweight SpMV Performance Simulator via Statistical and Visual Residual Learning S. Zhu, W. Huangfu, G. Chu

StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams H. Nguyen, B. Nicolae, T. Bicer, A. Gueroudji, M. Dorier, K. Chard, I. Foster

DA-MLAD: Drift-Decomposed Meta-Learning for Continual Log Anomaly Detection in Supercomputing Systems K. Tan, Y. Du, D. Zhan, Y. Xie, H. Yu, B. Zhao, H. Liu

Latency-SLO-Aware Memory Offloading for Large Language Model Inference C. Ma, H. Zhao, Z. Ye, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, Y. Li, D. Zhou

DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning S Shovan, A. Khanda, S Ferdous, S. Das, M. Halappanavar

CipherSkip: Efficient Sparse Matrix Multiplication with FHE W. Xiong, H. Zhou, Y. Ye, R. Jin, L. Xu

Three Birds, One Stone: Fast, Accurate-aware and Cost-Efficient Accelerator for Ternary LLM W. Jung, J. Kang, S. Shin, H. Um, J. Lim, G. Koo, Y. Park, S. Park, T. Suh

OCTANE: Breaking the Neighbor-List Bottleneck in GPU Molecular Dynamics H. Toutouni, S. Chakraborty, Y. Tu, J. Huang