I work at the intersection of GPU systems and AI for science — writing CUDA kernels, high-performance computing, training neural operators at scale, and building infrastructure that makes hard computing problems tractable. I have experience in methodological advancements and accelerating them on modern silicon.
I'm a GPU performance and ML systems engineer finishing a PhD at Johns Hopkins. My research background is in large-scale scientific computing — problems that stress-test both numerical methods and hardware simultaneously. That combination pushed me deep into GPU architecture, kernel optimization, and distributed systems.
At NVIDIA, I built ML infrastructure for physics simulation inside AI factories — hybrid neural operators trained on multi-GPU clusters for real-time physical state reconstruction from live sensor streams. The system replaced classical solvers running on dedicated HPC racks with inference at 5000× the speed, contributing to 5 US patents.
My PhD work took me even deeper into the hardware. I took a 30-year-old, 75,000-line CFD solver and rebuilt it end-to-end for modern GPU clusters — writing CUDA kernels for sparse linear solvers, ray tracing, and geometric computations; scaling inter-node communication with CUDA-aware MPI, NCCL, and NVSHMEM; and profiling exhaustively with Nsight to reach 90%+ weak and strong scaling efficiency across 1 billion grid points on L40s, A100, H100, and GH200 nodes. That work produced a 500× end-to-end speedup over the CPU baseline and a formal collaboration with NVIDIA.
That foundation in low-level GPU systems also led to kernel research: writing Blackwell GEMM kernels from scratch with CuTe and CUTLASS, exploring warp specialization and Tensor Memory Accelerator patterns. I think in memory hierarchies and compute rooflines.
I've published 13 peer-reviewed papers, presented at GTC 2026, and hold an NVIDIA Academic Research Grant (4 RTX Blackwell GPUs) to keep pushing the work forward.
GEMM kernels for NVIDIA Blackwell written from scratch with CuTe and CUTLASS. Explores warp specialization, Tensor Memory Accelerator (TMA), and persistent kernel patterns. Built to understand what the hardware can actually do — not just call cuBLAS.
Heterogeneous GNN-based neural simulator with sparse cross-attention and autoregressive time-conditioning on unstructured meshes. Learns to predict unsteady spatio-temporal fields from data — replacing hours of numerical simulation with milliseconds of inference.
Multi-GPU solver scaled to ~1 billion grid points with 90%+ efficiency. Custom CUDA kernels for sparse linear algebra and geometry, CUDA-aware MPI + NVSHMEM for inter-node communication, Nsight-profiled throughout. Tested on A100, H100, GH200.
Open to conversations about GPU performance, ML systems, and research at the boundary of hardware and learning.