List of accepted papers

Note: Updates to titles and author order at camera ready submission may not have been applied yet. Last updated 28 April 2026. More updates will follow as additional paper acceptances are processed.

Cycle 1:

ID Title Authors

12 Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning R. Nadig, V. Arulchelvan, R. Bera, T. Shahroodi, G. Singh, A. Kakolyris, I. Yuksel, M. Sadrosadati, J. Park, O. Mutlu

25 Parametric Mappings for Distributed Tensor Computations B. Wu, M. Kong

27 X-HD: Fast Hausdorff Distance Computation with Ray Tracing L. Geng, Z. Yuan, R. Lee, X. Zhang, F. Wang

38 DANMP: Accelerating Multi-Scale Deformable Attention Using Near-Memory-Processing Architecture H. Li, Q. Wang, B. Gao, D. Chen, Y. Huang, X. Xin

44 G-PathGen: An Efficient GPU-Parallel k-Critical Path Generation Algorithm C. Chang, Y. Chung, C. Chiu, W. Lee, B. Zhang, U. Schlichtmann, I. Lin, X. Yu, T. Huang

47 DCSM: Enabling Inter-Batch Parallelism for Continuous Subgraph Matching on GPU Y. Wei, P. Jiang

56 Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference Y. Huang, S. Di, G. Li

63 Wattchmen: Watching the Wattchers–High Fidelity, Flexible GPU Energy Modeling B. Tran, M. Sinclair, S. Venkataraman, M. Maiterth, W. Shin

74 TenProf: A Tensor-Centric Profiler for Deep Learning Workload Analysis and Optimization X. Ding, K. Zhou, Y. Hao, P. Su

78 StencilMD: Optimizing Communication in Molecular Dynamics Simulations R. Deng, T. Schardl

82 GRASP: Fine-grained and Adaptive Sampled Simulation for GPU Performance Modeling L. Chao, Z. Huang, P. Cai, J. Xue, T. Xiong, R. Xue

89 Agile QoS-aware Dynamic Power Management with eBPF Governors M. Rezvani, D. Wong

147 Cross-Architecture Autotuning for Single-Source Heterogeneous Programming Models H. Abram, N. Papadopoulou, J. Domke, M. Pericàs

155 Taming Dynamic Diffusion LLM Inference through Virtual Static Execution j. Zhu, H. Wu, Y. Li, H. Wang, R. Li, J. ZHAI

159 IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion Y. Ko, H. Park, H. Lee, H. Lee

163 Clone: A Collaborative Multi-device System for Retrieval-Augmented Generation over CXL S. Ko, W. Doh, E. Na, H. Shim, S. Yun, J. So, Y. Kwon, S. Park, S. Roh, M. Yoon, T. Song, E. Lee, J. Ahn

171 Anchoring Whole-System Persistence and Resilience in CXL Y. Zhou, J. Zeng, C. Jung

186 GPZ: GPU-Accelerated Lossy Compressor for Particle Data R. Li, Y. Huang, L. Zhang, Z. Yang, S. Di, B. Zhang, J. Huang, J. Liu, J. Tian, G. Li, F. Song, H. Guo, F. Cappello, K. Zhao

211 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving S. Gao, J. Yin, F. Wang, W. Dong

212 SmartCap: Coordinated CPU–GPU Power Capping for Performance-Assurance Energy Efficiency Z. Zheng, Z. Lan, X. Wu, V. Taylor, M. Papka

217 Mantis: Decoding HPC Telemetry Data for Robust System Prediction Y. Lu, J. Ren, S. Evgenia

222 SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Q. Zhou, P. Yin, P. Zuo, C. Wang, J. Cheng

225 cuMIS: A Unified Scalable Framework for Computing Maximal Independent Sets on Trillion-Edge Graphs J. Nke, S. Kang, B. Rees, C. Lee

231 InferFast: Bridging the Gap Between Unstructured LLM Sparsity and Practical GPU Throughput Z. Shen, W. Bu, X. He, K. Sheng, H. Chen

243 BLEST: Blazingly Efficient BFS using Tensor Cores D. Elbek, K. Kaya

268 dLLM-Serve: Bridging the Memory Gap in Diffusion Language Model Serving J. Fan, Y. Zhang, X. Li, D. Nikolopoulos

269 HPMD: Enabling Hybrid Parallelism with Multi-Dimensional Adaptive DNN Training G. Yun, Y. Choi

272 MyT: Efficient Manycore based on Many Threading and Scalable Memory Parallelism A. Rajasukumar, R. Xu, T. Zhang, Y. Wang, T. Su, M. Nourian, J. Ding, J. Su, R. Khandelwal, A. Fell, D. Gleich, Y. Li, H. Hoffmann, A. Chien

277 PolyKAN: A High-Performance and Universal GPU Operator Library for Polynomial Kolmogorov-Arnold Networks m. yu, H. Zhong, J. Jiang, D. Huang, Y. Lu

281 Aurora: A Disaggregated GPU-PNM-PIM System for High-Throughput Mixed-Length LLM Inference H. Kim, S. Yu, M. Kim, J. Lee, H. Sung, E. Lee

282 SPPO: Making Million-Token LLM Training Practical on Modest GPU Clusters q. chen, S. Li, W. GAO, P. Sun, Y. Wen, T. Zhang

292 HOPO: Accelerating Multimodal Neural Networks Inference via Holistic Parallelism Optimization Y. Zheng, J. Sun, H. Li, G. Sun, J. Li

302 Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Z. Gong, R. Ran, F. Yao, W. Wen

381 EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU H. Bao, Z. Su, A. Setyaev, S. Kamenev, A. Gneushev, K. Zhao, J. Xiao, H. Lin, A. Bistrigova, S. Buzykanov, E. Tetin, G. Tan, B. Liu, X. Zou, Z. Dong, C. Korikov, X. Yu, Z. Hu

387 Scalable All-to-allv Algorithms for Dynamic and Irregular Communication Patterns C. Wei, A. Bhatele

392 Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents A. Sarkar, S. Ghosh, N. Tallent, A. Chadha, T. Roosta, A. Jannesari

415 Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnect J. Bellavita, L. Pichetti, T. Pasquali, F. Vella, G. Guidi

422 Lock Shielding: A General Technique for Misuse-Resilient Locks V. Shahare, M. Chabbi, N. Hegde

435 FaaSlim: Partial Caching of Snapshot-based VMs for Serverless Computing S. Eom, C. Park, G. Lee, H. Moon, Y. Choi

464 CATS: Correlation-aware Task Scheduling for GPU Power Optimization in AI Data Centers S. Subramaniyan, X. Wang

469 MPMOS: Massively Parallel Multi-Objective Shortest Paths L. Gold, D. Sidoti, K. Pattipati, O. Khan

475 DynSpAttn: Efficient Attention via Dual-Side Dynamic Sparsity on Sparse Tensor Cores R. FAN, X. YU, Z. Li, W. Luo, G. Gong, X. Chu

485 CORE-BFS: Communication Optimized Rectangular Partitioned BFS Achieving 160.845 TeraTEPS on Frontier Supercomputer H. Yang, H. Lu, M. Matheson, F. Wang, H. Liu

521 Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding D. Tokuda, T. Kubo, I. Yuksel, A. Olgun, H. Luo, T. Nagatani, G. De Oliveira Junior, A. Yağlıkçı, M. Sadrosadati, O. Mutlu, S. Takamaeda-Yamazaki

557 CKTI: A Domain-Specific Compiler for Lowering CUDA Kernels to Triton-IR C. Shi, R. Chen, Y. Sun, Y. Sui, J. Zhang, Y. Xie, M. Wang, S. Ming, S. Zhang, Y. Zhang

584 Exploiting Hybrid Energy Storage to Minimize the Carbon Footprint of AI Data Centers S. Wu, X. Wang

612 AdaPolySI: Adaptive Polynomial Filtered Subspace Iteration for Hermitian Interior Eigenvalue Problems Y. Ni, X. Xu, S. Li, J. Zhang, J. Chen, J. Wang, J. Roman

679 Non-Delayed Cholesky Factorization Y. Luo, S. Zhang, W. Liu

844 Distributed Disjoint Weighted Matchings in Demand-Aware Reconfigurable Optical Datacenters S. Heck, K. Hanauer, S. Schmid

Cycle 2:

ID Title Authors

19 SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication C. Zhuang, L. Zhang, B. Brock, D. Wu, P. Chen, T. Endo, S. Matsuoka, M. Wahib

23 LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers E. Yu, D. Dong, Z. Zhang, Z. Bai, W. Yang, H. Wang, D. Li, Y. Wu, L. Xiangke

32 TADS: Trend-Aware Dynamic Load Balancing for Large-Scale SNN Simulations with Delay-Sharded Graph Infrastructure H. Huang, S. Pang, Y. Zeng, G. Feng, Z. Chen, Y. Lu

53 Rethinking Collision Detection on GPU Ray Tracing Architecture D. Mandarapu, I. Fuksman, A. Pelenitysn, G. Bernstein, M. Kulkarni

66 Barrier-Aware Task Scheduling for Bulk-Synchronous Parallel Architectures T. Noack, A. Koch

67 EZCache: A Hierarchical Memory System for Zoned Neutral Atom Quantum Computers J. Zhong, Y. Deng, H. Jiang, J. Feng

68 Closing the Efficiency Gap: AI Datacenter Co-design Roadmap for Scalable Training of LLMs J. Tithi, H. Wu, J. Park, A. Abuhatzera, F. Petrini, T. Krishna

74 GRASP: Optimizing VLIW Instruction Scheduling via Graph Reinforcement Learning Z. Wang, W. Tong, J. Fang, Y. Zhang, W. Wang, J. Ren, Z. Tang

87 Parallel Quadratic Selected Inversion in Quantum Transport Simulation V. Maillou, M. Bollhofer, O. Schenk, A. Ziogas, M. Luisier

95 MegaZK: A Memory Efficient GPU System Accelerating End-to-end Zero-Knowledge Proof M. Li, Y. Yu, B. Wang, X. Fan, M. Gao, S. Deng

107 SumcheckPIM: An Efficient HBM-Based PIM Architecture for Linear Complexity Zero Knowledge Proofs C. Kim, T. Kang, S. Shin, T. Suh, Y. Yang, G. Koo

117 GPIR: Enabling Practical Private Information Retrieval with GPUs H. Ji, H. Yu, J. Kim, W. Choi, G. Suh, J. Ahn

131 COMETS: Cost-effective Multi-node Efficient Training System with Memory Pooling and Sharing H. Chen, S. Yang, M. Soltaniyeh, S. Pei, A. Chang, B. Kim, C. Hao

134 CXL-CCL: Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool D. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y. Li, H. Hu, H. Zhang, D. Li, J. Jiang

144 DEFT: Joint Task Placement and DVFS for Energy-Efficient Multi-GPU Runtimes J. Chen, M. PericÃ s

151 THAC: Unlocking Performance in Parallel HPC Applications via UQ-Aware Automated Approximation z. zhao, b. wang, B. yang, X. Chen, J. Liu, q. wang

159 WindStencil: Unleashing GPU Potential for High-Order Stencil Computation in High-Performance Inviscid CFD Simulations X. Zhang, H. Zhang, X. Liu, J. Li, R. Jin, J. Zhang, W. Yuan, S. Liang, Z. Lu

160 SVSIG: Incremental Streaming Graph Processing with Source Vertex Suppression J. Huang, X. Yan, D. Fu, H. Bian, T. Cao, Z. Li

177 ColdMap: Compaction-Aware Cost-Benefit Zone Cleaning for ZNS-Based Key-Value Stores S. Byeon, K. Min, J. Park, S. Lee, H. Kim, J. Han, J. Hwang, Z. Cao, Y. Kim

184 Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon Benefits D. Wong, A. Jahanshahi, S. Golrouye, O. Anderson, N. Yu

187 PACER: A Userspace Network Rate Controller in MPI with Adaptive Compression for Parallel Applications Y. Li, D. Ng, A. Kashyap, S. Di, G. Li, X. Lu

190 RPFC: A Router Partitioning and Forward Channel Routing Framework for 2.5D MCM System S. Tao, Z. Guo, T. Liu, J. Wang

193 GAAF: Fast and Scalable Graph-based Vector Similarity Search with Any-Match Label Filtering M. Ma, X. Yin, J. Qiu

195 Block-Aware Adaptive State Management for Optimistic Parallel Discrete Event Simulation X. Peng, Q. Wang, G. Liu, C. Hong, R. Xia, Z. Sun, X. Chen, Q. Zhang, J. Liu

252 Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs J. Ren, H. Ltaief, S. Zampini, D. Keyes

283 Continuation-Preserving Tiling for Pointer-Chasing Optimization in Structured Mutual Recursion A. Kumar, V. Singh, S. Biswas

307 The Performance-Power Frontier: A Model-Driven Approach to Energy-Aware Application Optimisation S. Pasupuleti, S. Wright

344 HoloGraph: Bridging the Throughput Gap in Heterogeneous Graph Pattern Matching via Workload-Aware Steering M. Haotian, W. Hsu, Y. Chung

366 SpinTune: Improving the Reliability of Quantum Sensor Networks for Practical Quantum-Classical Utility J. Ludmir, N. DiBrita, J. Han, P. Tirthak

375 TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency M. Hasanat, J. Ludmir, T. Patel, R. Roy

388 Harnessing MPI mutations for AI error detection A. Auville, T. Jammer, E. Petit, P. Castro, E. Saillard, M. Popov

424 C-3PQ: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations D. Popovici, H. Lee, N. Yoshioka, M. Ben, N. Ito, K. Klymko, D. Camps, A. Butko

425 Optimizing Streaming Tensor Decomposition on GPU W. Lin, J. Sheng, S. Feng, M. Dun, H. Cao, Q. Sun

427 S2VEC: Compiler-Driven Stream Specialization for Linearized Vectorization L. Crespo, A. Fernandes, G. Falcao, P. Tomás, N. Roma, N. Neves

446 TOTO: Transparent I/O Tuning for HPC Applications F. Boito, L. Teylo, M. Popov, L. Aimi, A. Bandet, L. Pilla, G. Pallez

501 Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU Y. Li, G. Guidi

524 Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation S. Chundury, B. Burgstahler, J. Li, I. Suh, F. Mueller

593 GFAz: State-of-the-Art Graphical Fragment Assembly Compression T. Yang, Y. Liu, B. Jiang, X. Shi, S. Jin

594 ViSim: A Lightweight SpMV Performance Simulator via Statistical and Visual Residual Learning S. Zhu, W. Huangfu, G. Chu

610 StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams H. Nguyen, B. Nicolae, T. Bicer, A. Gueroudji, M. Dorier, K. Chard, I. Foster

639 DA-MLAD: Drift-Decomposed Meta-Learning for Continual Log Anomaly Detection in Supercomputing Systems K. Tan, Y. Du, D. Zhan, Y. Xie, H. Yu, B. Zhao, H. Liu

650 Latency-SLO-Aware Memory Offloading for Large Language Model Inference C. Ma, H. Zhao, Z. Ye, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, Y. Li, D. Zhou

668 DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning S Shovan, A. Khanda, S Ferdous, S. Das, M. Halappanavar

685 CipherSkip: Efficient Sparse Matrix Multiplication with FHE W. Xiong, H. Zhou, Y. Ye, R. Jin, L. Xu

766 Three Birds, One Stone: Fast, Accurate-aware and Cost-Efficient Accelerator for Ternary LLM W. Jung, J. Kang, S. Shin, H. Um, J. Lim, G. Koo, Y. Park, S. Park, T. Suh

785 OCTANE: Breaking the Neighbor-List Bottleneck in GPU Molecular Dynamics H. Toutouni, S. Chakraborty, Y. Tu, J. Huang