Susav Shrestha

Researching high-throughput, low-latency LLM inference systems at scale. My focus spans model parallelism, efficient attention, and sparsity-driven optimizations for scalable distributed inference across GPU clusters.

Featured Publications

NeurIPS 2025

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy

📄 Paper 💻 Code 🔬 Project Page

ISMM 2024

ESPN: Memory-Efficient Multi-vector Information Retrieval

Susav Shrestha, Narasimha Reddy, Zongwang Li

📄 Paper 💻 Code

Experience

2026–

Senior AI and HPC Engineer · NVIDIA

Feb 2026 – Present

2024–25

Research Intern · NVIDIA

Austin, TX · May – Aug 2024

Santa Clara, CA · May – Aug 2025

2022

Research Intern · Samsung

San Jose, CA · May – Aug 2022

Education

PhD, Computer Engineering

Texas A&M University · Aug 2021 – Feb 2026

Advised by Dr. Narasimha Reddy

Recent Updates

Feb 2026

🎓 Successfully defended my PhD dissertation

2025

📄 Polar Sparsity accepted at NeurIPS 2025