Susav Shrestha
Researching high-throughput, low-latency LLM inference systems at scale. My focus spans model parallelism, efficient attention, and sparsity-driven optimizations for scalable distributed inference across GPU clusters.
Featured Publications
NeurIPS 2025
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Experience
2026โ
Senior AI and HPC Engineer ยท NVIDIA
Feb 2026 โ Present
2024โ25
Research Intern ยท NVIDIA
Austin, TX ยท May โ Aug 2024
Santa Clara, CA ยท May โ Aug 2025
2022
Research Intern ยท Samsung
San Jose, CA ยท May โ Aug 2022
Education
PhD, Computer Engineering
Texas A&M University ยท Aug 2021 โ Feb 2026
Advised by Dr. Narasimha Reddy
Recent Updates
Feb 2026
๐ Successfully defended my PhD dissertation
2025
๐ Polar Sparsity accepted at NeurIPS 2025
