This is where I log the papers I've read this year. My goal for 2026 is to read 300 papers. By the time this document was last updated (May 25, 2026), I have read 61 papers this year. I should have read 118 papers by this time, meaning I am 57 papers behind schedule.

| 5/24/26 | MoE-Lens, Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints | Link | University of Michigan, arXiv, 2026 |
| 5/22/26 | MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache | Link | University of Edinburgh, arXiv, 2025 |
| 5/21/26 | HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading | Link | Cal Tech, arXiv, 2025 |
| 5/20/26 | Helios: Adaptive Model and Early-Exit Selection for Efficient LLM Inference Serving | Link | UT Austin, arXiv, 2025 |
| 5/14/26 | CUCo: An Agentic Framework for Compute and Communication Co-design | Link | UT Austin, arXiv, 2026 |
| 5/14/26 | Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs | Link | Carnegie Mellon, OSDI, 2025 |
| 5/9/26 | Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | Link | Georgia Tech/Microsoft, OSDI, 2024 |
| 5/8/26 | MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU | Link | U. of Notre Dame, arXiv, 2026 |
| 5/7/26 | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | Link | Stanford, ICML, 2023 |
| 5/6/26 | H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | Link | UT Austin/Carnegie Mellon, NeurIPS, 2023 |
| 5/5/26 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Link | UC Berkeley, SOSP, 2023 |
| 5/4/26 | Orca, A Distributed Serving System for Transformer-Based Generative Models | Link | Seoul National University, OSDI, 2022 |
| 5/4/26 | Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios | Link | AI Security Institute, arXiv, 2026 |
| 4/30/26 | Pie: Pooling CPU Memory for LLM Inference | Link | UC Berkeley, arXiv, 2024 |
| 4/29/26 | Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving | Link | UT Austin, SoCC, 2026 |
| 3/9/26 | HiCCL: A Hierarchical Collective Communication Library | Link | Stanford University, IPDPS, 2025 |
| 3/8/26 | MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven Exploration | Link | Shantou University, DATE, 2026 |
| 3/5/26 | Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects | Link | University of Rome, SC, 2024 |
| 3/2/26 | RDMA over Ethernet for Distributed Training at Meta Scale | Link | Meta, ACM SIGCOMM, 2024 |
| 3/1/26 | big.VLITTLE: On-Demand Data-Parallel Acceleration for Mobile Systems on Chip | Link | Cornell University, MICRO, 2022 |
| 2/27/26 | Synthesizing optimal collective algorithms | Link | Microsoft Research, PPoPP, 2021 |
| 2/20/26 | Computing the Full Earth System at 1km Resolution | Link | Max Planck Institute for Meteorology, SC, 2025 |
| 2/18/26 | The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory Execution | Link | University of Illinois, HPCA, 2026 |
| 2/5/26 | ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage | Link | ETH Zurich, SC, 2025 |
| 2/5/26 | Real-Time Object Detection and Recognition in FPGA-Based Autonomous Driving Systems | Link | Samsung, IJCTT, 2024 |
| 2/4/26 | Harmonic CUDA: Asynchronous Programming on GPUs | Link | University of California/NVIDIA, PMAM, 2023 |
| 2/3/26 | Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References | Link | NVIDIA, CGO, 2026 |
| 2/1/26 | BetterTogether: An Interference-Aware Framework for Fine-grained Software Pipelining on Heterogeneous SoCs | Link | University of California, IISWC, 2025 |
| 1/29/26 | Optimizing Green Energy Consumption of Fog Computing Architectures | Link | France University of Rennes, IEEE SBAC-PAD, 2020 |
| 1/28/26 | An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUs | Link | Taiwan Tsing Hua University/IBM, arXiv, 2025 |
| 1/28/26 | Collective Communication for 100k+ GPUs | Link | Meta, arXiv, 2026 |
| 1/26/26 | Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms | Link | ETH Zurich/NVIDIA, arXiv, 2025 |
| 1/19/26 | The Landscape of GPU-Centric Communication | Link | KoƧ University, arXiv, 2024 |
| 1/18/26 | Hot Regions in SPEC CPU2017 | Link | UT Austin, IISWC, 2018 |
| 1/12/26 | A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems | Link | Nanjing University China, arXiv, 2026 |
| 1/10/26 | PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel | Link | Meta AI, Proceedings of the VLDB Endowment, 2023 |
| 1/8/26 | MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications | Link | Microsoft, arXiv, 2025 |
| 1/8/26 | Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap | Link | UT Austin/AMD, arXiv, 2025 |
| 1/7/26 | GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling | Link | UT Austin, IEEE Transactions on Computers, 2012 |
| 1/3/26 | Phase-Based Frequency Scaling for Energy-Efficient Heterogeneous Computing | Link | University of Salerno Italy, IPDPS, 2025 |
| 1/2/26 | Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data | Link | Princeton, MICRO, 2003, 12 pages |
| 1/1/26 | GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling | Link | 2018, 12 pages |
| 12/31/25 | Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling | Link | 2025, 15 pages |
| 12/31/25 | Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs | Link | 2025, 15 pages |
| 12/27/25 | Optimizing Distributed ML Communication with Fused Computation-Collective Operations | Link | 2024, 17 pages |
| 12/26/25 | Defect graph neural networks for materials discovery in high-temperature clean-energy applications | Link | 2023, 12 pages |
| 12/24/25 | Power Stabilization for AI Training Datacenters | Link | 2025, 10 pages |
| 12/23/25 | Advancing Cloud Computing Capabilities on gem5 by Implementing the RISC-V Hypervisor Extension | Link | 2024, 8 pages |