Publications
My research focuses on building efficient and scalable machine learning systems, with an emphasis on inference optimization through sparsity, hardware-aware design, and distributed architectures.
Featured Work
NeurIPS 2025 Published
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Introduces novel contextual sparsity techniques achieving up to 2.2× speedup in LLM inference across diverse batch sizes and sequence lengths.
Conference Publications
Journal Publications
The Journal of Supercomputing 2025 Journal
Storage Access Optimization for Efficient GPU-Centric Information Retrieval
Optimizes storage access patterns for GPU-centric retrieval systems, delivering substantial speedups in embedding processing.
Preprints
arXiv 2025 Preprint
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
*Equal contribution
Develops adaptive calibration techniques for speculative decoding, improving efficiency in language model inference.
