Note: Updates to titles and author order at camera ready submission may not have been applied yet. Last updated 28 April 2026. More updates will follow as additional paper acceptances are processed.
Title Authors
Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning R. Nadig, V. Arulchelvan, R. Bera, T. Shahroodi, G. Singh, A. Kakolyris, I. Yuksel, M. Sadrosadati, J. Park, O. Mutlu
Parametric Mappings for Distributed Tensor Computations B. Wu, M. Kong
X-HD: Fast Hausdorff Distance Computation with Ray Tracing L. Geng, Z. Yuan, R. Lee, X. Zhang, F. Wang
DANMP: Accelerating Multi-Scale Deformable Attention Using Near-Memory-Processing Architecture H. Li, Q. Wang, B. Gao, D. Chen, Y. Huang, X. Xin
G-PathGen: An Efficient GPU-Parallel k-Critical Path Generation Algorithm C. Chang, Y. Chung, C. Chiu, W. Lee, B. Zhang, U. Schlichtmann, I. Lin, X. Yu, T. Huang
DCSM: Enabling Inter-Batch Parallelism for Continuous Subgraph Matching on GPU Y. Wei, P. Jiang
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference Y. Huang, S. Di, G. Li
Wattchmen: Watching the Wattchers–High Fidelity, Flexible GPU Energy Modeling B. Tran, M. Sinclair, S. Venkataraman, M. Maiterth, W. Shin
TenProf: A Tensor-Centric Profiler for Deep Learning Workload Analysis and Optimization X. Ding, K. Zhou, Y. Hao, P. Su
StencilMD: Optimizing Communication in Molecular Dynamics Simulations R. Deng, T. Schardl
GRASP: Fine-grained and Adaptive Sampled Simulation for GPU Performance Modeling L. Chao, Z. Huang, P. Cai, J. Xue, T. Xiong, R. Xue
Agile QoS-aware Dynamic Power Management with eBPF Governors M. Rezvani, D. Wong
Cross-Architecture Autotuning for Single-Source Heterogeneous Programming Models H. Abram, N. Papadopoulou, J. Domke, M. Pericàs
Taming Dynamic Diffusion LLM Inference through Virtual Static Execution j. Zhu, H. Wu, Y. Li, H. Wang, R. Li, J. ZHAI
IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion Y. Ko, H. Park, H. Lee, H. Lee
Clone: A Collaborative Multi-device System for Retrieval-Augmented Generation over CXL S. Ko, W. Doh, E. Na, H. Shim, S. Yun, J. So, Y. Kwon, S. Park, S. Roh, M. Yoon, T. Song, E. Lee, J. Ahn
Anchoring Whole-System Persistence and Resilience in CXL Y. Zhou, J. Zeng, C. Jung
GPZ: GPU-Accelerated Lossy Compressor for Particle Data R. Li, Y. Huang, L. Zhang, Z. Yang, S. Di, B. Zhang, J. Huang, J. Liu, J. Tian, G. Li, F. Song, H. Guo, F. Cappello, K. Zhao
FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving S. Gao, J. Yin, F. Wang, W. Dong
SmartCap: Coordinated CPU–GPU Power Capping for Performance-Assurance Energy Efficiency Z. Zheng, Z. Lan, X. Wu, V. Taylor, M. Papka
Mantis: Decoding HPC Telemetry Data for Robust System Prediction Y. Lu, J. Ren, S. Evgenia
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Q. Zhou, P. Yin, P. Zuo, C. Wang, J. Cheng
cuMIS: A Unified Scalable Framework for Computing Maximal Independent Sets on Trillion-Edge Graphs J. Nke, S. Kang, B. Rees, C. Lee
InferFast: Bridging the Gap Between Unstructured LLM Sparsity and Practical GPU Throughput Z. Shen, W. Bu, X. He, K. Sheng, H. Chen
BLEST: Blazingly Efficient BFS using Tensor Cores D. Elbek, K. Kaya
dLLM-Serve: Bridging the Memory Gap in Diffusion Language Model Serving J. Fan, Y. Zhang, X. Li, D. Nikolopoulos
HPMD: Enabling Hybrid Parallelism with Multi-Dimensional Adaptive DNN Training G. Yun, Y. Choi
MyT: Efficient Manycore based on Many Threading and Scalable Memory Parallelism A. Rajasukumar, R. Xu, T. Zhang, Y. Wang, T. Su, M. Nourian, J. Ding, J. Su, R. Khandelwal, A. Fell, D. Gleich, Y. Li, H. Hoffmann, A. Chien
PolyKAN: A High-Performance and Universal GPU Operator Library for Polynomial Kolmogorov-Arnold Networks m. yu, H. Zhong, J. Jiang, D. Huang, Y. Lu
Aurora: A Disaggregated GPU-PNM-PIM System for High-Throughput Mixed-Length LLM Inference H. Kim, S. Yu, M. Kim, J. Lee, H. Sung, E. Lee
SPPO: Making Million-Token LLM Training Practical on Modest GPU Clusters q. chen, S. Li, W. GAO, P. Sun, Y. Wen, T. Zhang
HOPO: Accelerating Multimodal Neural Networks Inference via Holistic Parallelism Optimization Y. Zheng, J. Sun, H. Li, G. Sun, J. Li
Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Z. Gong, R. Ran, F. Yao, W. Wen
EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU H. Bao, Z. Su, A. Setyaev, S. Kamenev, A. Gneushev, K. Zhao, J. Xiao, H. Lin, A. Bistrigova, S. Buzykanov, E. Tetin, G. Tan, B. Liu, X. Zou, Z. Dong, C. Korikov, X. Yu, Z. Hu
Scalable All-to-allv Algorithms for Dynamic and Irregular Communication Patterns C. Wei, A. Bhatele
Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents A. Sarkar, S. Ghosh, N. Tallent, A. Chadha, T. Roosta, A. Jannesari
Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnect J. Bellavita, L. Pichetti, T. Pasquali, F. Vella, G. Guidi
Lock Shielding: A General Technique for Misuse-Resilient Locks V. Shahare, M. Chabbi, N. Hegde
FaaSlim: Partial Caching of Snapshot-based VMs for Serverless Computing S. Eom, C. Park, G. Lee, H. Moon, Y. Choi
CATS: Correlation-aware Task Scheduling for GPU Power Optimization in AI Data Centers S. Subramaniyan, X. Wang
MPMOS: Massively Parallel Multi-Objective Shortest Paths L. Gold, D. Sidoti, K. Pattipati, O. Khan
DynSpAttn: Efficient Attention via Dual-Side Dynamic Sparsity on Sparse Tensor Cores R. FAN, X. YU, Z. Li, W. Luo, G. Gong, X. Chu
CORE-BFS: Communication Optimized Rectangular Partitioned BFS Achieving 160.845 TeraTEPS on Frontier Supercomputer H. Yang, H. Lu, M. Matheson, F. Wang, H. Liu
Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding D. Tokuda, T. Kubo, I. Yuksel, A. Olgun, H. Luo, T. Nagatani, G. De Oliveira Junior, A. Yağlıkçı, M. Sadrosadati, O. Mutlu, S. Takamaeda-Yamazaki
CKTI: A Domain-Specific Compiler for Lowering CUDA Kernels to Triton-IR C. Shi, R. Chen, Y. Sun, Y. Sui, J. Zhang, Y. Xie, M. Wang, S. Ming, S. Zhang, Y. Zhang
Exploiting Hybrid Energy Storage to Minimize the Carbon Footprint of AI Data Centers S. Wu, X. Wang
AdaPolySI: Adaptive Polynomial Filtered Subspace Iteration for Hermitian Interior Eigenvalue Problems Y. Ni, X. Xu, S. Li, J. Zhang, J. Chen, J. Wang, J. Roman
Non-Delayed Cholesky Factorization Y. Luo, S. Zhang, W. Liu
Distributed Disjoint Weighted Matchings in Demand-Aware Reconfigurable Optical Datacenters S. Heck, K. Hanauer, S. Schmid
Title Authors
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication C. Zhuang, L. Zhang, B. Brock, D. Wu, P. Chen, T. Endo, S. Matsuoka, M. Wahib
LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers E. Yu, D. Dong, Z. Zhang, Z. Bai, W. Yang, H. Wang, D. Li, Y. Wu, L. Xiangke
TADS: Trend-Aware Dynamic Load Balancing for Large-Scale SNN Simulations with Delay-Sharded Graph Infrastructure H. Huang, S. Pang, Y. Zeng, G. Feng, Z. Chen, Y. Lu
Rethinking Collision Detection on GPU Ray Tracing Architecture D. Mandarapu, I. Fuksman, A. Pelenitysn, G. Bernstein, M. Kulkarni
Barrier-Aware Task Scheduling for Bulk-Synchronous Parallel Architectures T. Noack, A. Koch
EZCache: A Hierarchical Memory System for Zoned Neutral Atom Quantum Computers J. Zhong, Y. Deng, H. Jiang, J. Feng
Closing the Efficiency Gap: AI Datacenter Co-design Roadmap for Scalable Training of LLMs J. Tithi, H. Wu, J. Park, A. Abuhatzera, F. Petrini, T. Krishna
GRASP: Optimizing VLIW Instruction Scheduling via Graph Reinforcement Learning Z. Wang, W. Tong, J. Fang, Y. Zhang, W. Wang, J. Ren, Z. Tang
Parallel Quadratic Selected Inversion in Quantum Transport Simulation V. Maillou, M. Bollhofer, O. Schenk, A. Ziogas, M. Luisier
MegaZK: A Memory Efficient GPU System Accelerating End-to-end Zero-Knowledge Proof M. Li, Y. Yu, B. Wang, X. Fan, M. Gao, S. Deng
SumcheckPIM: An Efficient HBM-Based PIM Architecture for Linear Complexity Zero Knowledge Proofs C. Kim, T. Kang, S. Shin, T. Suh, Y. Yang, G. Koo
GPIR: Enabling Practical Private Information Retrieval with GPUs H. Ji, H. Yu, J. Kim, W. Choi, G. Suh, J. Ahn
COMETS: Cost-effective Multi-node Efficient Training System with Memory Pooling and Sharing H. Chen, S. Yang, M. Soltaniyeh, S. Pei, A. Chang, B. Kim, C. Hao
CXL-CCL: Inter-Node Collective GPU-Communication Using a CXL Shared Memory Pool D. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y. Li, H. Hu, H. Zhang, D. Li, J. Jiang
DEFT: Joint Task Placement and DVFS for Energy-Efficient Multi-GPU Runtimes J. Chen, M. Pericà s
THAC: Unlocking Performance in Parallel HPC Applications via UQ-Aware Automated Approximation z. zhao, b. wang, B. yang, X. Chen, J. Liu, q. wang
WindStencil: Unleashing GPU Potential for High-Order Stencil Computation in High-Performance Inviscid CFD Simulations X. Zhang, H. Zhang, X. Liu, J. Li, R. Jin, J. Zhang, W. Yuan, S. Liang, Z. Lu
SVSIG: Incremental Streaming Graph Processing with Source Vertex Suppression J. Huang, X. Yan, D. Fu, H. Bian, T. Cao, Z. Li
ColdMap: Compaction-Aware Cost-Benefit Zone Cleaning for ZNS-Based Key-Value Stores S. Byeon, K. Min, J. Park, S. Lee, H. Kim, J. Han, J. Hwang, Z. Cao, Y. Kim
Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon Benefits D. Wong, A. Jahanshahi, S. Golrouye, O. Anderson, N. Yu
PACER: A Userspace Network Rate Controller in MPI with Adaptive Compression for Parallel Applications Y. Li, D. Ng, A. Kashyap, S. Di, G. Li, X. Lu
RPFC: A Router Partitioning and Forward Channel Routing Framework for 2.5D MCM System S. Tao, Z. Guo, T. Liu, J. Wang
GAAF: Fast and Scalable Graph-based Vector Similarity Search with Any-Match Label Filtering M. Ma, X. Yin, J. Qiu
Block-Aware Adaptive State Management for Optimistic Parallel Discrete Event Simulation X. Peng, Q. Wang, G. Liu, C. Hong, R. Xia, Z. Sun, X. Chen, Q. Zhang, J. Liu
Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs J. Ren, H. Ltaief, S. Zampini, D. Keyes
Continuation-Preserving Tiling for Pointer-Chasing Optimization in Structured Mutual Recursion A. Kumar, V. Singh, S. Biswas
The Performance-Power Frontier: A Model-Driven Approach to Energy-Aware Application Optimisation S. Pasupuleti, S. Wright
HoloGraph: Bridging the Throughput Gap in Heterogeneous Graph Pattern Matching via Workload-Aware Steering M. Haotian, W. Hsu, Y. Chung
SpinTune: Improving the Reliability of Quantum Sensor Networks for Practical Quantum-Classical Utility J. Ludmir, N. DiBrita, J. Han, T. Patel
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency M. Hasanat, J. Ludmir, T. Patel, R. Roy
Harnessing MPI mutations for AI error detection A. Auville, T. Jammer, E. Petit, P. Castro, E. Saillard, M. Popov
C-3PQ: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations D. Popovici, H. Lee, N. Yoshioka, M. Ben, N. Ito, K. Klymko, D. Camps, A. Butko
Optimizing Streaming Tensor Decomposition on GPU W. Lin, J. Sheng, S. Feng, M. Dun, H. Cao, Q. Sun
S2VEC: Compiler-Driven Stream Specialization for Linearized Vectorization L. Crespo, A. Fernandes, G. Falcao, P. Tomás, N. Roma, N. Neves
TOTO: Transparent I/O Tuning for HPC Applications F. Boito, L. Teylo, M. Popov, L. Aimi, A. Bandet, L. Pilla, G. Pallez
Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU Y. Li, G. Guidi
Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation S. Chundury, B. Burgstahler, J. Li, I. Suh, F. Mueller
GFAz: State-of-the-Art Graphical Fragment Assembly Compression T. Yang, Y. Liu, B. Jiang, X. Shi, S. Jin
ViSim: A Lightweight SpMV Performance Simulator via Statistical and Visual Residual Learning S. Zhu, W. Huangfu, G. Chu
StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams H. Nguyen, B. Nicolae, T. Bicer, A. Gueroudji, M. Dorier, K. Chard, I. Foster
DA-MLAD: Drift-Decomposed Meta-Learning for Continual Log Anomaly Detection in Supercomputing Systems K. Tan, Y. Du, D. Zhan, Y. Xie, H. Yu, B. Zhao, H. Liu
Latency-SLO-Aware Memory Offloading for Large Language Model Inference C. Ma, H. Zhao, Z. Ye, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, Y. Li, D. Zhou
DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning S Shovan, A. Khanda, S Ferdous, S. Das, M. Halappanavar
CipherSkip: Efficient Sparse Matrix Multiplication with FHE W. Xiong, H. Zhou, Y. Ye, R. Jin, L. Xu
Three Birds, One Stone: Fast, Accurate-aware and Cost-Efficient Accelerator for Ternary LLM W. Jung, J. Kang, S. Shin, H. Um, J. Lim, G. Koo, Y. Park, S. Park, T. Suh
OCTANE: Breaking the Neighbor-List Bottleneck in GPU Molecular Dynamics H. Toutouni, S. Chakraborty, Y. Tu, J. Huang