Susav Shrestha

I specialize in building efficient and scalable machine learning systems for real-world deployment. My work bridges the gap between algorithm design and system-level efficiency, with a focus on accelerating large-scale inference through hardware-aware design, sparsity, and parallelism.

I am a PhD candidate in Computer Engineering at Texas A&M University, advised by Dr. Narasimha Reddy.

Research Interests

Efficient and Sparse LLM Inference
Hardware Efficient and High-Throughput Distributed Inference at Scale

Selected Publications

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy. Paper,Code
ESPN: Memory-Efficient Multi-vector Information Retrieval
Susav Shrestha, Narasimha Reddy, Zongwang Li. In Proceedings of ISMM 2024Paper,Code

Selected Experience

Research Intern, NVIDIA, Santa Clara, CA, May - Aug 2025
Research Intern, NVIDIA, Austin, TX, May - Aug 2024
Research Intern, Samsung Semiconductor, San Jose, CA, May - Aug 2022

Updates

🔬 Started a Research Internship at NVIDIA for Summer 2025 in Santa Clara
📄 Released new paper on Polar Sparsity with up to 2.2× decoding speedup in LLM inference