Note: Updates to titles and author order at camera ready submission may not have been applied yet. Last updated 28 April 2026. More updates will follow as additional paper acceptances are processed.
ID Title Authors
12 Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning R. Nadig, V. Arulchelvan, R. Bera, T. Shahroodi, G. Singh, A. Kakolyris, I. Yuksel, M. Sadrosadati, J. Park, O. Mutlu
25 Parametric Mappings for Distributed Tensor Computations B. Wu, M. Kong
27 X-HD: Fast Hausdorff Distance Computation with Ray Tracing L. Geng, Z. Yuan, R. Lee, X. Zhang, F. Wang
38 DANMP: Accelerating Multi-Scale Deformable Attention Using Near-Memory-Processing Architecture H. Li, Q. Wang, B. Gao, D. Chen, Y. Huang, X. Xin
44 G-PathGen: An Efficient GPU-Parallel k-Critical Path Generation Algorithm C. Chang, Y. Chung, C. Chiu, W. Lee, B. Zhang, U. Schlichtmann, I. Lin, X. Yu, T. Huang
47 DCSM: Enabling Inter-Batch Parallelism for Continuous Subgraph Matching on GPU Y. Wei, P. Jiang
56 Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference Y. Huang, S. Di, G. Li
63 Wattchmen: Watching the Wattchers–High Fidelity, Flexible GPU Energy Modeling B. Tran, M. Sinclair, S. Venkataraman, M. Maiterth, W. Shin
74 TenProf: A Tensor-Centric Profiler for Deep Learning Workload Analysis and Optimization X. Ding, K. Zhou, Y. Hao, P. Su
78 StencilMD: Optimizing Communication in Molecular Dynamics Simulations R. Deng, T. Schardl
82 GRASP: Fine-grained and Adaptive Sampled Simulation for GPU Performance Modeling L. Chao, Z. Huang, P. Cai, J. Xue, T. Xiong, R. Xue
89 Agile QoS-aware Dynamic Power Management with eBPF Governors M. Rezvani, D. Wong
147 Cross-Architecture Autotuning for Single-Source Heterogeneous Programming Models H. Abram, N. Papadopoulou, J. Domke, M. Pericàs
155 Taming Dynamic Diffusion LLM Inference through Virtual Static Execution j. Zhu, H. Wu, Y. Li, H. Wang, R. Li, J. ZHAI
159 IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion Y. Ko, H. Park, H. Lee, H. Lee
163 Clone: A Collaborative Multi-device System for Retrieval-Augmented Generation over CXL S. Ko, W. Doh, E. Na, H. Shim, S. Yun, J. So, Y. Kwon, S. Park, S. Roh, M. Yoon, T. Song, E. Lee, J. Ahn
171 Anchoring Whole-System Persistence and Resilience in CXL Y. Zhou, J. Zeng, C. Jung
186 GPZ: GPU-Accelerated Lossy Compressor for Particle Data R. Li, Y. Huang, L. Zhang, Z. Yang, S. Di, B. Zhang, J. Huang, J. Liu, J. Tian, G. Li, F. Song, H. Guo, F. Cappello, K. Zhao
211 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving S. Gao, J. Yin, F. Wang, W. Dong
212 SmartCap: Coordinated CPU–GPU Power Capping for Performance-Assurance Energy Efficiency Z. Zheng, Z. Lan, X. Wu, V. Taylor, M. Papka
217 Mantis: Decoding HPC Telemetry Data for Robust System Prediction Y. Lu, J. Ren, S. Evgenia
222 SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Q. Zhou, P. Yin, P. Zuo, C. Wang, J. Cheng
225 cuMIS: A Unified Scalable Framework for Computing Maximal Independent Sets on Trillion-Edge Graphs J. Nke, S. Kang, B. Rees, C. Lee
231 InferFast: Bridging the Gap Between Unstructured LLM Sparsity and Practical GPU Throughput Z. Shen, W. Bu, X. He, K. Sheng, H. Chen
243 BLEST: Blazingly Efficient BFS using Tensor Cores D. Elbek, K. Kaya
268 dLLM-Serve: Bridging the Memory Gap in Diffusion Language Model Serving J. Fan, Y. Zhang, X. Li, D. Nikolopoulos
269 HPMD: Enabling Hybrid Parallelism with Multi-Dimensional Adaptive DNN Training G. Yun, Y. Choi
272 MyT: Efficient Manycore based on Many Threading and Scalable Memory Parallelism A. Rajasukumar, R. Xu, T. Zhang, Y. Wang, T. Su, M. Nourian, J. Ding, J. Su, R. Khandelwal, A. Fell, D. Gleich, Y. Li, H. Hoffmann, A. Chien
277 PolyKAN: A High-Performance and Universal GPU Operator Library for Polynomial Kolmogorov-Arnold Networks m. yu, H. Zhong, J. Jiang, D. Huang, Y. Lu
281 Aurora: A Disaggregated GPU-PNM-PIM System for High-Throughput Mixed-Length LLM Inference H. Kim, S. Yu, M. Kim, J. Lee, H. Sung, E. Lee
282 SPPO: Making Million-Token LLM Training Practical on Modest GPU Clusters q. chen, S. Li, W. GAO, P. Sun, Y. Wen, T. Zhang
292 HOPO: Accelerating Multimodal Neural Networks Inference via Holistic Parallelism Optimization Y. Zheng, J. Sun, H. Li, G. Sun, J. Li
302 Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Z. Gong, R. Ran, F. Yao, W. Wen
381 EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU H. Bao, Z. Su, A. Setyaev, S. Kamenev, A. Gneushev, K. Zhao, J. Xiao, H. Lin, A. Bistrigova, S. Buzykanov, E. Tetin, G. Tan, B. Liu, X. Zou, Z. Dong, C. Korikov, X. Yu, Z. Hu
387 Scalable All-to-allv Algorithms for Dynamic and Irregular Communication Patterns C. Wei, A. Bhatele
392 Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents A. Sarkar, S. Ghosh, N. Tallent, A. Chadha, T. Roosta, A. Jannesari
415 Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnect J. Bellavita, L. Pichetti, T. Pasquali, F. Vella, G. Guidi
422 Lock Shielding: A General Technique for Misuse-Resilient Locks V. Shahare, M. Chabbi, N. Hegde
435 FaaSlim: Partial Caching of Snapshot-based VMs for Serverless Computing S. Eom, C. Park, G. Lee, H. Moon, Y. Choi
464 CATS: Correlation-aware Task Scheduling for GPU Power Optimization in AI Data Centers S. Subramaniyan, X. Wang
469 MPMOS: Massively Parallel Multi-Objective Shortest Paths L. Gold, D. Sidoti, K. Pattipati, O. Khan
475 DynSpAttn: Efficient Attention via Dual-Side Dynamic Sparsity on Sparse Tensor Cores R. FAN, X. YU, Z. Li, W. Luo, G. Gong, X. Chu
485 CORE-BFS: Communication Optimized Rectangular Partitioned BFS Achieving 160.845 TeraTEPS on Frontier Supercomputer H. Yang, H. Lu, M. Matheson, F. Wang, H. Liu
521 Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding D. Tokuda, T. Kubo, I. Yuksel, A. Olgun, H. Luo, T. Nagatani, G. De Oliveira Junior, A. Yağlıkçı, M. Sadrosadati, O. Mutlu, S. Takamaeda-Yamazaki
557 CKTI: A Domain-Specific Compiler for Lowering CUDA Kernels to Triton-IR C. Shi, R. Chen, Y. Sun, Y. Sui, J. Zhang, Y. Xie, M. Wang, S. Ming, S. Zhang, Y. Zhang
584 Exploiting Hybrid Energy Storage to Minimize the Carbon Footprint of AI Data Centers S. Wu, X. Wang
612 AdaPolySI: Adaptive Polynomial Filtered Subspace Iteration for Hermitian Interior Eigenvalue Problems Y. Ni, X. Xu, S. Li, J. Zhang, J. Chen, J. Wang, J. Roman
679 Non-Delayed Cholesky Factorization Y. Luo, S. Zhang, W. Liu
844 Distributed Disjoint Weighted Matchings in Demand-Aware Reconfigurable Optical Datacenters S. Heck, K. Hanauer, S. Schmid
ID Title Authors
19 SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication C. Zhuang, L. Zhang, B. Brock, D. Wu, P. Chen, T. Endo, S. Matsuoka, M. Wahib
23 LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers E. Yu, D. Dong, Z. Zhang, Z. Bai, W. Yang, H. Wang, D. Li, Y. Wu, L. Xiangke
32 TADS: Trend-Aware Dynamic Load Balancing for Large-Scale SNN Simulations with Delay-Sharded Graph Infrastructure H. Huang, S. Pang, Y. Zeng, G. Feng, Z. Chen, Y. Lu
53 Rethinking Collision Detection on GPU Ray Tracing Architecture D. Mandarapu, I. Fuksman, A. Pelenitysn, G. Bernstein, M. Kulkarni
66 Barrier-Aware Task Scheduling for Bulk-Synchronous Parallel Architectures T. Noack, A. Koch
67 EZCache: A Hierarchical Memory System for Zoned Neutral Atom Quantum Computers J. Zhong, Y. Deng, H. Jiang, J. Feng
68 Closing the Efficiency Gap: AI Datacenter Co-design Roadmap for Scalable Training of LLMs J. Tithi, H. Wu, J. Park, A. Abuhatzera, F. Petrini, T. Krishna
74 GRASP: Optimizing VLIW Instruction Scheduling via Graph Reinforcement Learning Z. Wang, W. Tong, J. Fang, Y. Zhang, W. Wang, J. Ren, Z. Tang
87 Parallel Quadratic Selected Inversion in Quantum Transport Simulation V. Maillou, M. Bollhofer, O. Schenk, A. Ziogas, M. Luisier
95 MegaZK: A Memory Efficient GPU System Accelerating End-to-end Zero-Knowledge Proof M. Li, Y. Yu, B. Wang, X. Fan, M. Gao, S. Deng
107 SumcheckPIM: An Efficient HBM-Based PIM Architecture for Linear Complexity Zero Knowledge Proofs C. Kim, T. Kang, S. Shin, T. Suh, Y. Yang, G. Koo
117 GPIR: Enabling Practical Private Information Retrieval with GPUs H. Ji, H. Yu, J. Kim, W. Choi, G. Suh, J. Ahn
131 COMETS: Cost-effective Multi-node Efficient Training System with Memory Pooling and Sharing H. Chen, S. Yang, M. Soltaniyeh, S. Pei, A. Chang, B. Kim, C. Hao
134 CXL-CCL: Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool D. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y. Li, H. Hu, H. Zhang, D. Li, J. Jiang
144 DEFT: Joint Task Placement and DVFS for Energy-Efficient Multi-GPU Runtimes J. Chen, M. Pericà s
151 THAC: Unlocking Performance in Parallel HPC Applications via UQ-Aware Automated Approximation z. zhao, b. wang, B. yang, X. Chen, J. Liu, q. wang
159 WindStencil: Unleashing GPU Potential for High-Order Stencil Computation in High-Performance Inviscid CFD Simulations X. Zhang, H. Zhang, X. Liu, J. Li, R. Jin, J. Zhang, W. Yuan, S. Liang, Z. Lu
160 SVSIG: Incremental Streaming Graph Processing with Source Vertex Suppression J. Huang, X. Yan, D. Fu, H. Bian, T. Cao, Z. Li
177 ColdMap: Compaction-Aware Cost-Benefit Zone Cleaning for ZNS-Based Key-Value Stores S. Byeon, K. Min, J. Park, S. Lee, H. Kim, J. Han, J. Hwang, Z. Cao, Y. Kim
184 Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon Benefits D. Wong, A. Jahanshahi, S. Golrouye, O. Anderson, N. Yu
187 PACER: A Userspace Network Rate Controller in MPI with Adaptive Compression for Parallel Applications Y. Li, D. Ng, A. Kashyap, S. Di, G. Li, X. Lu
190 RPFC: A Router Partitioning and Forward Channel Routing Framework for 2.5D MCM System S. Tao, Z. Guo, T. Liu, J. Wang
193 GAAF: Fast and Scalable Graph-based Vector Similarity Search with Any-Match Label Filtering M. Ma, X. Yin, J. Qiu
195 Block-Aware Adaptive State Management for Optimistic Parallel Discrete Event Simulation X. Peng, Q. Wang, G. Liu, C. Hong, R. Xia, Z. Sun, X. Chen, Q. Zhang, J. Liu
252 Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs J. Ren, H. Ltaief, S. Zampini, D. Keyes
283 Continuation-Preserving Tiling for Pointer-Chasing Optimization in Structured Mutual Recursion A. Kumar, V. Singh, S. Biswas
307 The Performance-Power Frontier: A Model-Driven Approach to Energy-Aware Application Optimisation S. Pasupuleti, S. Wright
344 HoloGraph: Bridging the Throughput Gap in Heterogeneous Graph Pattern Matching via Workload-Aware Steering M. Haotian, W. Hsu, Y. Chung
366 SpinTune: Improving the Reliability of Quantum Sensor Networks for Practical Quantum-Classical Utility J. Ludmir, N. DiBrita, J. Han, P. Tirthak
375 TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency M. Hasanat, J. Ludmir, T. Patel, R. Roy
388 Harnessing MPI mutations for AI error detection A. Auville, T. Jammer, E. Petit, P. Castro, E. Saillard, M. Popov
424 C-3PQ: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations D. Popovici, H. Lee, N. Yoshioka, M. Ben, N. Ito, K. Klymko, D. Camps, A. Butko
425 Optimizing Streaming Tensor Decomposition on GPU W. Lin, J. Sheng, S. Feng, M. Dun, H. Cao, Q. Sun
427 S2VEC: Compiler-Driven Stream Specialization for Linearized Vectorization L. Crespo, A. Fernandes, G. Falcao, P. Tomás, N. Roma, N. Neves
446 TOTO: Transparent I/O Tuning for HPC Applications F. Boito, L. Teylo, M. Popov, L. Aimi, A. Bandet, L. Pilla, G. Pallez
501 Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU Y. Li, G. Guidi
524 Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation S. Chundury, B. Burgstahler, J. Li, I. Suh, F. Mueller
593 GFAz: State-of-the-Art Graphical Fragment Assembly Compression T. Yang, Y. Liu, B. Jiang, X. Shi, S. Jin
594 ViSim: A Lightweight SpMV Performance Simulator via Statistical and Visual Residual Learning S. Zhu, W. Huangfu, G. Chu
610 StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams H. Nguyen, B. Nicolae, T. Bicer, A. Gueroudji, M. Dorier, K. Chard, I. Foster
639 DA-MLAD: Drift-Decomposed Meta-Learning for Continual Log Anomaly Detection in Supercomputing Systems K. Tan, Y. Du, D. Zhan, Y. Xie, H. Yu, B. Zhao, H. Liu
650 Latency-SLO-Aware Memory Offloading for Large Language Model Inference C. Ma, H. Zhao, Z. Ye, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, Y. Li, D. Zhou
668 DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning S Shovan, A. Khanda, S Ferdous, S. Das, M. Halappanavar
685 CipherSkip: Efficient Sparse Matrix Multiplication with FHE W. Xiong, H. Zhou, Y. Ye, R. Jin, L. Xu
766 Three Birds, One Stone: Fast, Accurate-aware and Cost-Efficient Accelerator for Ternary LLM W. Jung, J. Kang, S. Shin, H. Um, J. Lim, G. Koo, Y. Park, S. Park, T. Suh
785 OCTANE: Breaking the Neighbor-List Bottleneck in GPU Molecular Dynamics H. Toutouni, S. Chakraborty, Y. Tu, J. Huang