All program times are in British Summer Time (BST, UTC+1). Workshop session locations are currently denoted Room [1-6] and will be specified later. Paper session locations are currently denoted Room [ABC] and will be specified later.
Day |
Time |
Room 1 |
Room 2 |
Room 3 |
Room 4 |
Room 5 |
Room 6 |
|---|---|---|---|---|---|---|---|
Monday |
08:00-09:00 |
Registration |
|||||
09:00-10:30 |
Workshop: MCCSys |
Workshop: Benchmark |
Workshop: AI4HPCC |
Workshop: Arch4Health |
Workshop: WOCC'26 |
Workshop: Ramulator and DRAM Bender |
|
10:30-11:00 |
Coffee |
||||||
11:00-12:30 |
Workshop: MCCSys |
Workshop: Benchmark |
Workshop: AI4HPCC |
Workshop: Arch4Health |
Workshop: WOCC'26 |
Workshop: Ramulator and DRAM Bender |
|
12:30-13:30 |
Lunch |
||||||
13:30-15:00 |
Workshop: MCCSys |
Workshop: Benchmark |
Workshop: AI4HPCC |
Workshop: Arch4Health |
Workshop: PhysQ |
Workshop: NextAccel |
|
15:00-15:30 |
Coffee |
||||||
15:30-17:00 |
Workshop: MCCSys |
Workshop: Benchmark |
Workshop: AI4HPCC |
Workshop: Arch4Health |
Workshop: PhysQ |
Workshop: NextAccel |
|
Day |
Time |
Room A |
Room B |
Room C |
|---|---|---|---|---|
Tuesday |
07:00-08:00 |
Registration |
||
08:10-09:20 |
Opening + Keynote (Room A) |
|||
09:20-10:20 |
Best Paper Candidates (Plenary, Room A) |
|||
10:20-10:50 |
Coffee |
|||
10:50-12:10 |
Runtime Scheduling and Adaptive Execution |
Compiler, Code Generation and Autotuning |
Energy & Sustainability |
|
12:10-13:40 |
Lunch |
|||
13:40-15:00 |
Performance Modeling & Insight |
Fortran Mini-Workshop |
I/O & Storage |
|
15:00-15:30 |
Coffee |
|||
16:00 - 18:00 |
Tour of Belfast |
|||
18:00 - 20:00 |
Reception at Dark Horse |
|||
Day |
Time |
Room A |
Room B |
Room C |
Wednesday |
07:30-08:10 |
Registration |
||
08:10-9:20 |
Keynote (Room A) |
|||
09:20-10:20 |
Lightning Talks/Posters (Plenary, Room A) |
|||
10:20-10:50 |
Coffee |
|||
10:50-12:10 |
AI Training Systems |
Graph Search & Paths |
Communication & Collectives |
|
12:10-13:40 |
Lunch + Poster Session |
|||
13:40-15:00 |
LLM Serving |
Graph Analytics |
Near-Memory Architectures |
|
15:00-15:30 |
Coffee |
|||
15:30-17:10 |
Data Analytics & Compression |
Graph Traversal & Connectivity |
CXL & Memory Systems |
|
18:00-20:00 |
Conference Toast and Tour of Queen's University Belfast |
|||
20:00-22:00 |
Banquet at The Lanyon Building, Queen's University Belfast |
|||
Day |
Time |
Room A |
Room B |
Room C |
Thursday |
07:30-08:10 |
Registration |
||
08:10-9:20 |
Keynote (Room A) |
|||
09:20-10:20 |
Cross-Layer Performance Optimization |
(No sessions) | ||
10:20-10:50 |
Coffee |
|||
10:50-12:10 |
AI Inference |
GPU-Accelerated Query and Geometry |
Resilience and Error Detection |
|
12:10-13:40 |
Lunch |
|||
13:40-14:40 |
AI Kernels & Parallelism |
Sparse & Tensor Kernels |
Energy-Aware Systems |
|
15:00-15:30 |
Coffee |
|||
15:30-17:10 |
Efficient Privacy Computing |
Numerical & Scientific Kernels |
Quantum Computing |
|
Time |
Room A |
Room B |
Room C |
|---|---|---|---|
09:20- 10:20 |
Best Paper CandidatesCoordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon BenefitsA. Jahanshahi, S. Golrouye, O. Anderson, N. Yu, D. Wong FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving S. Gao, J. Yin, F. Wang, W. Dong OCTANE: Breaking the Neighbor-List Bottleneck in GPU Molecular Dynamics H. Toutouni, S. Chakraborty, Y. Tu, J. Huang |
(No sessions) | |
10:50- 12:10 |
Runtime Scheduling and Adaptive ExecutionBarrier-Aware Task Scheduling for Bulk-Synchronous Parallel ArchitecturesT. Noack, A. Koch FaaSlim: Partial Caching of Snapshot-based VMs for Serverless Computing S. Eom, C. Park, G. Lee, H. Moon, Y. Choi Block-Aware Adaptive State Management for Optimistic Parallel Discrete Event Simulation X. Peng, Q. Wang, G. Liu, C. Hong, R. Xia, Z. Sun, X. Chen, Q. Zhang, J. Liu Lock Shielding: A General Technique for Misuse-Resilient Locks V. Shahare, M. Chabbi, N. Hegde |
Compiler, Code Generation and AutotuningGRASP: Optimizing VLIW Instruction Scheduling via Graph Reinforcement LearningZ. Wang, W. Tong, J. Fang, Y. Zhang, W. Wang, J. Ren, Z. Tang Continuation-Preserving Tiling for Pointer-Chasing Optimization in Structured Mutual Recursion A. Kumar, V. Singh, S. Biswas S2VEC: Compiler-Driven Stream Specialization for Linearized Vectorization L. Crespo, A. Fernandes, G. Falcao, P. Tomás, N. Roma, N. Neves CKTI: A Domain-Specific Compiler for Lowering CUDA Kernels to Triton-IR C. Shi, R. Chen, Y. Sun, Y. Sui, J. Zhang, Y. Xie, M. Wang, S. Ming, S. Zhang, Y. Zhang |
Energy & SustainabilityWattchmen: Watching the Wattchers – High Fidelity, Flexible GPU Energy ModelingB. Tran, M. Sinclair, S. Venkataraman, M. Maiterth, W. Shin Agile QoS-aware Dynamic Power Management with eBPF Governors M. Rezvani, D. Wong SmartCap: Coordinated CPU–GPU Power Capping for Performance-Assurance Energy Efficiency Z. Zheng, Z. Lan, X. Wu, V. Taylor, M. Papka CATS: Correlation-aware Task Scheduling for GPU Power Optimization in AI Data Centers S. Subramaniyan, X. Wang |
13:40- 15:00 |
Performance Modeling & InsightTenProf: A Tensor-Centric Profiler for Deep Learning Workload Analysis and OptimizationX. Ding, K. Zhou, Y. Hao, P. Su GRASP: Fine-grained and Adaptive Sampled Simulation for GPU Performance Modeling L. Chao, Z. Huang, P. Cai, J. Xue, T. Xiong, R. Xue Mantis: Decoding HPC Telemetry Data for Robust System Prediction Y. Lu, J. Ren, E. Smirni ViSim: A Lightweight SpMV Performance Simulator via Statistical and Visual Residual Learning S. Zhu, W. Huangfu, G. Chu |
Fortran Mini-Workshop:An interactive discussion of a recent survey of the international Fortran ecosystem led by Austen Rainer and Andrew Brown |
I/O & StorageHarmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement LearningR. Nadig, V. Arulchelvan, R. Bera, T. Shahroodi, G. Singh, A. Kakolyris, I. Yuksel, M. Sadrosadati, J. Park, O. Mutlu ColdMap: Compaction-Aware Cost-Benefit Zone Cleaning for ZNS-Based Key-Value Stores S. Byeon, K. Min, J. Park, S. Lee, H. Kim, J. Han, J. Hwang, Z. Cao, Y. Kim CoCache: Accelerating Reads in KV Stores via Cooperative Metadata and Data Cache Management H. Tang, W. Zhu, Q. Zhang, J. Zhang, J. Jiang, Z. Zhang, H. Zhang, Y. Li, Y. Xu TOTO: Transparent I/O Tuning for HPC Applications F. Boito, L. Teylo, M. Popov, L. Aimi, A. Bandet, L. Pilla, G. Pallez |
Time |
Room A |
Room B |
Room C |
|---|---|---|---|
09:20- 10:20 |
Lightning Talks/PostersThese lightning talks are based on the accepted posters |
(No sessions) | |
10:50- 12:10 |
AI Training SystemsClosing the Efficiency Gap: AI Datacenter Co-design Roadmap for Scalable Training of LLMsJ. Tithi, H. Wu, J. Park, A. Abuhatzera, F. Petrini, T. Krishna COMETS: Cost-effective Multi-node Efficient Training System with Memory Pooling and Sharing H. Chen, S. Yang, M. Soltaniyeh, S. Pei, A. Chang, B. Kim, C. Hao SPPO: Making Million-Token LLM Training Practical on Modest GPU Clusters Q. chen, S. Li, W. GAO, P. Sun, Y. Wen, T. Zhang Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents A. Sarkar, S. Ghosh, N. Tallent, A. Chadha, T. Roosta, A. Jannesari |
Graph Search & PathsG-PathGen: An Efficient GPU-Parallel k-Critical Path Generation AlgorithmC. Chang, Y. Chung, C. Chiu, W. Lee, B. Zhang, U. Schlichtmann, I. Lin, X. Yu, T. Huang Parallel Bidirectional A* Search for GPU-Accelerated Pathfinding H. Al Khansa, J. Luna, A. Mouawad, I. Hajj MPMOS: Massively Parallel Multi-Objective Shortest Paths L. Gold, D. Sidoti, K. Pattipati, O. Khan DistroMatch: Distributed Disjoint Weighted Matchings in Demand-Aware Reconfigurable Optical Datacenters S. Heck, K. Hanauer, S. Schmid |
Communication & CollectivesSHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix MultiplicationC. Zhuang, L. Zhang, B. Brock, D. Wu, P. Chen, T. Endo, S. Matsuoka, M. Wahib PACER: A Userspace Network Rate Controller in MPI with Adaptive Compression for Parallel Applications Y. Li, D. Ng, A. Kashyap, S. Di, G. Li, X. Lu Skew-aware Adaptive All-to-allv Algorithms for Dynamic Deep Learning Workloads C. Wei, A. Bhatele StencilMD: Optimizing Communication in Molecular Dynamics Simulations R. Deng, T. Schardl |
13:40- 15:00 |
LLM ServingTaming Dynamic Diffusion LLM Inference through Virtual Static ExecutionJ. Zhu, H. Wu, Y. Li, H. Wang, R. Li, J. Zhai SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Q. Zhou, P. Yin, P. Zuo, C. Wang, J. Cheng InferFast: Bridging the Gap Between Unstructured LLM Sparsity and Practical GPU Throughput Z. Shen, W. Bu, X. He, K. Sheng, H. Chen dLLM-Serve: Bridging the Memory Gap in Diffusion Language Model Serving J. Fan, Y. Zhang, X. Li, D. Nikolopoulos |
Graph AnalyticsDCSM: Enabling Inter-Batch Parallelism for Continuous Subgraph Matching on GPUY. Wei, P. Jiang SVSIG: Incremental Streaming Graph Processing with Source Vertex Suppression J. Huang, X. Yan, D. Fu, H. Bian, T. Cao, Z. Li GAAF: Fast and Scalable Graph-based Vector Similarity Search with Any-Match Label Filtering M. Ma, X. Yin, J. Qiu HoloGraph: Bridging the Throughput Gap in Heterogeneous Graph Pattern Matching via Workload-Aware Steering M. Haotian, W. Hsu, Y. Chung |
Near-Memory ArchitecturesDANMP: Accelerating Multi-Scale Deformable Attention Using Near-Memory-Processing ArchitectureH. Li, Q. Wang, B. Gao, D. Chen, Y. Huang, X. Xin RPFC: A Router Partitioning and Forward Channel Routing Framework for 2.5D MCM System S. Tao, Z. Guo, T. Liu, J. Wang UpDown: Efficient Manycore based on Many Threading and Scalable Memory Parallelism A. Rajasukumar, R. Xu, T. Zhang, Y. Wang, T. Su, M. Nourian, J. Ding, J. Su, R. Khandelwal, A. Fell, D. Gleich, Y. Li, H. Hoffmann, A. Chien Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding D. Tokuda, T. Kubo, I. Yuksel, A. Olgun, H. Luo, T. Nagatani, G. De Oliveira Junior, A. Yağlıkçı, M. Sadrosadati, O. Mutlu, S. Takamaeda-Yamazaki |
15:30- 17:10 |
Data Analytics & CompressionGPZ: GPU-Accelerated Lossy Compressor for Particle DataR. Li, Y. Huang, L. Zhang, Z. Yang, S. Di, B. Zhang, J. Huang, J. Liu, J. Tian, G. Li, F. Song, H. Guo, F. Cappello, K. Zhao GFAz: State-of-the-Art Graphical Fragment Assembly Compression T. Yang, Y. Liu, B. Jiang, X. Shi, S. Jin Optimizing Streaming Tensor Decomposition on GPU W. Lin, J. Sheng, S. Feng, M. Dun, H. Cao, Q. Sun DA-MLAD: Drift-Decomposed Meta-Learning for Continual Log Anomaly Detection in Supercomputing Systems K. Tan, Y. Du, D. Zhan, Y. Xie, H. Yu, B. Zhao, H. Liu TADS: Trend-Aware Dynamic Load Balancing for Large-Scale SNN Simulations with Delay-Sharded Graph Infrastructure H. Huang, S. Pang, Y. Zeng, G. Feng, Z. Chen, Y. Lu |
Graph Traversal & ConnectivitycuMIS: A Unified Scalable Framework for Computing Maximal Independent Sets on Trillion-Edge GraphsJ. Nke, S. Kang, B. Rees, C. Lee BLEST: Blazingly Efficient BFS using Tensor Cores D. Elbek, K. Kaya CORE-BFS: Communication-Optimized REctangular-partitioned BFS Achieving 160.845 TeraTEPS on Frontier Supercomputer H. Yang, H. Lu, M. Matheson, F. Wang, H. Liu DynLP: Parallel Dynamic Batch Update for Label Propagation in Graph-based Semi-Supervised Learning S. Shovan, A. Khanda, S. Ferdous, S. Das, M. Halappanavar Parametric Mappings for Distributed-Memory Tensor Computations B. Wu, M. Kong |
CXL & Memory SystemsCXL-CCL: Inter-Node Collective GPU-Communication Using a CXL Shared Memory PoolD. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y. Li, H. Hu, H. Zhang, D. Li, J. Jiang IBEX: Internal Bandwidth‑Efficient Compression Architecture for Scalable CXL Memory Expansion Y. Ko, H. Park, H. Lee, H. Lee Clone: A Collaborative Multi-device System for Retrieval-Augmented Generation over CXL S. Ko, W. Doh, E. Na, H. Shim, S. Yun, J. So, Y. Kwon, S. Park, S. Roh, M. Yoon, T. Song, E. Lee, J. Ahn Anchoring Whole-System Persistence and Resilience in CXL Y. Zhou, J. Zeng, C. Jung Griffin: Coherency-Aware Task Scheduling and Memory Allocation for CXL Interconnects S. Lee, K. Diab, D. Tootaghaj, L. Cao, P. Sharma, A. Gavrilovska |
Time |
Room A |
Room B |
Room C |
|---|---|---|---|
09:20- 10:20 |
Cross-Layer Performance OptimizationTHAC: Unlocking Performance in Parallel HPC Applications via UQ-Aware Automated ApproximationZ. Zhao, B. Wang, B. Yang, X. Chen, J. Liu, Q. Wang Cross-Architecture Autotuning for Single-Source Heterogeneous Programming Models H. Abram, N. Papadopoulou, J. Domke, M. Pericàs Look Before You Leap : Precision Instruction Supply via SmartScout X. Zhang, P. Qu, T. Zhang, F. Su, Z. Pan, Y. Zhang |
(No sessions) | |
10:50- 12:10 |
AI InferenceLayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy ServersE. Yu, D. Dong, Z. Zhang, Z. Bai, W. Yang, H. Wang, D. Li, Y. Wu, L. Xiangke Aurora: A Disaggregated GPU-PNM-PIM System for High-Throughput Mixed-Length LLM Inference H. Kim, S. Yu, M. Kim, J. Lee, H. Sung, E. Lee HOPO: Accelerating Multimodal Neural Networks Inference via Holistic Parallelism Optimization Y. Zheng, J. Sun, H. Li, G. Sun, J. Li Latency-SLO-Aware Memory Offloading for Large Language Model Inference C. Ma, H. Zhao, Z. Ye, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, Y. Li, D. Zhou |
GPU-Accelerated Query and GeometryParallel Query Processing through Optimal Key Grouping on GPU-Based B+-TreesZ. Chen, J. Li, J. Meng, N. Pitaksirianan, Y. Tu, B. Zeng, C. Dong X-HD: Fast Hausdorff Distance Computation with Ray Tracing L. Geng, Z. Yuan, R. Lee, X. Zhang, F. Wang Rethinking Collision Detection on GPU Ray Tracing Architecture D. Mandarapu, I. Fuksman, A. Pelenitsyn, G. Bernstein, M. Kulkarni |
Resilience and Error DetectionNot All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model InferenceY. Huang, S. Di, G. Li SpinTune: Improving the Reliability of Quantum Sensor Networks for Practical Quantum-Classical Utility J. Ludmir, N. DiBrita, J. Han, P. Tirthak Harnessing MPI mutations for AI error detection A. Auville, T. Jammer, E. Petit, P. Castro, E. Saillard, M. Popov StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams H. Nguyen, B. Nicolae, T. Bicer, A. Gueroudji, M. Dorier, K. Chard, I. Foster |
13:40- 15:00 |
AI Kernels & ParallelismHPMD: Enabling Hybrid Parallelism with Multi-Dimensional Adaptive DNN TrainingG. Yun, Y. Choi EPLoN: Exploiting Efficient Parallelism with Selective Rematerialization for Lightning Attention on Ascend NPU H. Bao, Z. Su, A. Setyaev, S. Kamenev, A. Gneushev, K. Zhao, J. Xiao, H. Lin, A. Bistrigova, S. Buzykanov, E. Tetin, G. Tan, B. Liu, X. Zou, Z. Dong, C. Korikov, X. Yu, Z. Hu DynSpAttn: Efficient Attention via Dual-Side Dynamic Sparsity on Sparse Tensor Cores R. Fan, X. Yu, Z. Li, W. Luo, G. Gong, X. Chu Three Birds, One Stone: Fast, Accurate-aware and Cost-Efficient Accelerator for Ternary LLM W. Jung, J. Kang, S. Shin, H. Um, J. Lim, G. Koo, Y. Park, S. Park, T. Suh |
Sparse & Tensor KernelsCommunication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU InterconnectJ. Bellavita, L. Pichetti, T. Pasquali, F. Vella, G. Guidi Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU Y. Li, G. Guidi PolyKAN: A High-Performance and Universal GPU Operator Library for Polynomial Kolmogorov-Arnold Networks M. Yu, H. Zhong, J. Jiang, D. Huang, Y. Lu |
Energy-Aware SystemsDEFT: Joint Task Placement and DVFS for Energy-Efficient Multi-GPU RuntimesJ. Chen, M. Pericàs Phase-aware Peak Power Reduction for Minimizing the Capital Expense of LLM Inference S. Wu, Y. Ma, X. Wang Exploiting Hybrid Energy Storage to Minimize the Carbon Footprint of AI Data Centers S. Wu, X. Wang The Performance-Power Frontier: A Model-Driven Approach to Energy-Aware Application Optimisation S. Pasupuleti, S. Wright |
15:30- 17:10 |
Efficient Privacy ComputingMegaZK: A Memory Efficient GPU System Accelerating End-to-end Zero-Knowledge ProofM. Li, Y. Yu, B. Wang, X. Fan, M. Gao, S. Deng SumcheckPIM: An Efficient HBM-Based PIM Architecture for Linear Complexity Zero Knowledge Proofs C. Kim, T. Kang, S. Shin, T. Suh, Y. Yang, G. Koo GPIR: Enabling Practical Private Information Retrieval with GPUs H. Ji, H. Yu, J. Kim, W. Choi, G. Suh, J. Ahn Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Z. Gong, R. Ran, F. Yao, W. Wen CipherSkip: Efficient Sparse Matrix Multiplication with FHE W. Xiong, H. Zhou, Y. Ye, R. Jin, L. Xu |
Numerical & Scientific KernelsWindStencil: Unleashing GPU Potential for High-Order Stencil Computation in High-Performance Inviscid CFD SimulationsX. Zhang, H. Zhang, X. Liu, J. Li, R. Jin, J. Zhang, W. Yuan, S. Liang, Z. Lu Cheetah: Optimizing Execution Pipelines for Matrix-Free Finite Element Operators on GPUs J. Ren, H. Ltaief, S. Zampini, D. Keyes AdaPolySI: Adaptive Polynomial Filtered Subspace Iteration for Hermitian Interior Eigenvalue Problems Y. Ni, X. Xu, S. Li, J. Zhang, J. Chen, J. Wang, J. Roman Non-Delayed Cholesky Factorization Y. Luo, S. Zhang, W. Liu Parallel Quadratic Selected Inversion in Quantum Transport Simulation V. Maillou, M. Bollhofer, O. Schenk, A. Ziogas, M. Luisier |
Quantum ComputingquEStab: Towards Scalable Quantum Circuit Simulation on Multi-GPU using an Extended Stabilizer FormalismH. Shin, S. Lee, Y. Kim EZCache: A Hierarchical Memory System for Zoned Neutral Atom Quantum Computers J. Zhong, Y. Deng, H. Jiang, J. Feng TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency M. Hasanat, J. Ludmir, T. Patel, R. Roy C-3PQ: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations D. Popovici, H. Lee, N. Yoshioka, M. Ben, N. Ito, K. Klymko, D. Camps, A. Butko Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation S. Chundury, B. Burgstahler, J. Li, I. Suh, F. Mueller |