This is where I log the papers I've read this year. My goal for 2026 is to read 300 papers. By the time this document was last updated (May 25, 2026), I have read 61 papers this year. I should have read 118 papers by this time, meaning I am 57 papers behind schedule.

A monthly calendar depicting the following tabular data graphically.

5/24/26MoE-Lens, Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource ConstraintsLinkUniversity of Michigan, arXiv, 2026
5/22/26MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert CacheLinkUniversity of Edinburgh, arXiv, 2025
5/21/26HeadInfer: Memory-Efficient LLM Inference by Head-wise OffloadingLinkCal Tech, arXiv, 2025
5/20/26Helios: Adaptive Model and Early-Exit Selection for Efficient LLM Inference ServingLinkUT Austin, arXiv, 2025
5/14/26CUCo: An Agentic Framework for Compute and Communication Co-designLinkUT Austin, arXiv, 2026
5/14/26Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor ProgramsLinkCarnegie Mellon, OSDI, 2025
5/9/26Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-ServeLinkGeorgia Tech/Microsoft, OSDI, 2024
5/8/26MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPULinkU. of Notre Dame, arXiv, 2026
5/7/26FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPULinkStanford, ICML, 2023
5/6/26H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language ModelsLinkUT Austin/Carnegie Mellon, NeurIPS, 2023
5/5/26Efficient Memory Management for Large Language Model Serving with PagedAttentionLinkUC Berkeley, SOSP, 2023
5/4/26Orca, A Distributed Serving System for Transformer-Based Generative ModelsLinkSeoul National University, OSDI, 2022
5/4/26Measuring AI Agents' Progress on Multi-Step Cyber Attack ScenariosLinkAI Security Institute, arXiv, 2026
4/30/26Pie: Pooling CPU Memory for LLM InferenceLinkUC Berkeley, arXiv, 2024
4/29/26Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM ServingLinkUT Austin, SoCC, 2026
3/9/26HiCCL: A Hierarchical Collective Communication LibraryLinkStanford University, IPDPS, 2025
3/8/26MPM-LLM4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM-Driven ExplorationLinkShantou University, DATE, 2026
3/5/26Exploring GPU-to-GPU Communication: Insights into Supercomputer InterconnectsLinkUniversity of Rome, SC, 2024
3/2/26RDMA over Ethernet for Distributed Training at Meta ScaleLinkMeta, ACM SIGCOMM, 2024
3/1/26big.VLITTLE: On-Demand Data-Parallel Acceleration for Mobile Systems on ChipLinkCornell University, MICRO, 2022
2/27/26Synthesizing optimal collective algorithmsLinkMicrosoft Research, PPoPP, 2021
2/20/26Computing the Full Earth System at 1km ResolutionLinkMax Planck Institute for Meteorology, SC, 2025
2/18/26The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory ExecutionLinkUniversity of Illinois, HPCA, 2026
2/5/26ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed StorageLinkETH Zurich, SC, 2025
2/5/26Real-Time Object Detection and Recognition in FPGA-Based Autonomous Driving SystemsLinkSamsung, IJCTT, 2024
2/4/26Harmonic CUDA: Asynchronous Programming on GPUsLinkUniversity of California/NVIDIA, PMAM, 2023
2/3/26Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous ReferencesLinkNVIDIA, CGO, 2026
2/1/26BetterTogether: An Interference-Aware Framework for Fine-grained Software Pipelining on Heterogeneous SoCsLinkUniversity of California, IISWC, 2025
1/29/26Optimizing Green Energy Consumption of Fog Computing ArchitecturesLinkFrance University of Rennes, IEEE SBAC-PAD, 2020
1/28/26An Online Fragmentation-Aware Scheduler for Managing GPU-Sharing Workloads on Multi-Instance GPUsLinkTaiwan Tsing Hua University/IBM, arXiv, 2025
1/28/26Collective Communication for 100k+ GPUsLinkMeta, arXiv, 2026
1/26/26Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and AlgorithmsLinkETH Zurich/NVIDIA, arXiv, 2025
1/19/26The Landscape of GPU-Centric CommunicationLinkKoƧ University, arXiv, 2024
1/18/26Hot Regions in SPEC CPU2017LinkUT Austin, IISWC, 2018
1/12/26A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP SystemsLinkNanjing University China, arXiv, 2026
1/10/26PyTorch FSDP: Experiences on Scaling Fully Sharded Data ParallelLinkMeta AI, Proceedings of the VLDB Endowment, 2023
1/8/26MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI ApplicationsLinkMicrosoft, arXiv, 2025
1/8/26Design Space Exploration of DMA based Finer-Grain Compute Communication OverlapLinkUT Austin/AMD, arXiv, 2025
1/7/26GPGPU Power Modeling for Multi-Domain Voltage-Frequency ScalingLinkUT Austin, IEEE Transactions on Computers, 2012
1/3/26Phase-Based Frequency Scaling for Energy-Efficient Heterogeneous ComputingLinkUniversity of Salerno Italy, IPDPS, 2025
1/2/26Runtime Power Monitoring in High-End Processors: Methodology and Empirical DataLinkPrinceton, MICRO, 2003, 12 pages
1/1/26GPGPU Power Modeling for Multi-Domain Voltage-Frequency ScalingLink2018, 12 pages
12/31/25Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage TilingLink2025, 15 pages
12/31/25Optimal Software Pipelining and Warp Specialization for Tensor Core GPUsLink2025, 15 pages
12/27/25Optimizing Distributed ML Communication with Fused Computation-Collective OperationsLink2024, 17 pages
12/26/25Defect graph neural networks for materials discovery in high-temperature clean-energy applicationsLink2023, 12 pages
12/24/25Power Stabilization for AI Training DatacentersLink2025, 10 pages
12/23/25Advancing Cloud Computing Capabilities on gem5 by Implementing the RISC-V Hypervisor ExtensionLink2024, 8 pages