GPU Performance · ML Systems · AI for Science

Sushrut
Kumar

Sushrut Kumar

I work at the intersection of GPU systems and AI for science — writing CUDA kernels, high-performance computing, training neural operators at scale, and building infrastructure that makes hard computing problems tractable. I have experience in methodological advancements and accelerating them on modern silicon.

PhD Candidate, Johns Hopkins · Defending April 2026
↗ GitHub Publications Email
About

I'm a GPU performance and ML systems engineer finishing a PhD at Johns Hopkins. My research background is in large-scale scientific computing — problems that stress-test both numerical methods and hardware simultaneously. That combination pushed me deep into GPU architecture, kernel optimization, and distributed systems.

At NVIDIA, I built ML infrastructure for physics simulation inside AI factories — hybrid neural operators trained on multi-GPU clusters for real-time physical state reconstruction from live sensor streams. The system replaced classical solvers running on dedicated HPC racks with inference at 5000× the speed, contributing to 5 US patents.

My PhD work took me even deeper into the hardware. I took a 30-year-old, 75,000-line CFD solver and rebuilt it end-to-end for modern GPU clusters — writing CUDA kernels for sparse linear solvers, ray tracing, and geometric computations; scaling inter-node communication with CUDA-aware MPI, NCCL, and NVSHMEM; and profiling exhaustively with Nsight to reach 90%+ weak and strong scaling efficiency across 1 billion grid points on L40s, A100, H100, and GH200 nodes. That work produced a 500× end-to-end speedup over the CPU baseline and a formal collaboration with NVIDIA.

That foundation in low-level GPU systems also led to kernel research: writing Blackwell GEMM kernels from scratch with CuTe and CUTLASS, exploring warp specialization and Tensor Memory Accelerator patterns. I think in memory hierarchies and compute rooflines.

I've published 13 peer-reviewed papers, presented at GTC 2026, and hold an NVIDIA Academic Research Grant (4 RTX Blackwell GPUs) to keep pushing the work forward.

5000×
Inference speedup · neural operator (NVIDIA)
500×
End-to-end speedup · GPU solver vs. CPU baseline
1B
Grid points · multi-GPU solver, 90%+ scaling efficiency
13
Peer-reviewed publications
5
US Patents · NVIDIA internship
Experience
Jan–Sep 2024
May–Sep 2025
NVIDIA · Santa Clara
ML Systems & Physics Simulation Intern
Built ML training and inference infrastructure for physics simulation inside AI factories. Designed a hybrid neural operator combining Graph Neural Operators for sparse sensor encoding with 3D Fourier Neural Operators for full-field physical state reconstruction. Scaled training to 8×A100s using DDP, mixed precision, and activation checkpointing. Built GPU-accelerated data pipelines in NVIDIA Warp to process 1 TB of training data.
→ 5000× inference speedup vs. classical solvers · 5 US patents filed
PyTorchPyG PhysicsNeMoNVIDIA Warp DDPmixed precision GNOFNO A100neural operators
Jan 2021
– present
Johns Hopkins · Flow Physics Lab
Doctoral Researcher — GPU Systems & ML
Two parallel workstreams over five years.

GPU kernel & systems engineering: Rewrote a 75,000-line legacy solver end-to-end for modern multi-GPU clusters. CUDA kernels for sparse linear solvers (BiCGStab + preconditioner), ray tracing, and geometric computations. Inter-node communication via CUDA-aware MPI, NCCL, and NVSHMEM. Nsight-profiled throughout — occupancy, memory bandwidth, warp divergence — to reach 90%+ scaling efficiency at 1B grid points. Tested on L40s, A100, H100, GH200 Superchip.
→ 500× total speedup: 10× new numerical methods + 25× GPU acceleration + 2× algorithmic improvements · Presented at NVIDIA GTC 2026, San Jose

ML research: Built AeTHERON — a heterogeneous GNN neural simulator with sparse cross-attention and autoregressive time-conditioning for unsteady spatio-temporal prediction on unstructured meshes. Analyzed >40 TB of 4D simulation data across 3 journal papers.
CUDA C++OpenACC MPINCCL NVSHMEMNsight PyTorchPyG GNNH100 GH200C++Fortran
Sep–Nov 2020
University of Melbourne
Research Intern
Built a CNN autoencoder to predict PDE solutions from initial conditions — 400× faster than the numerical solver on the same problem class.
TensorFlowCNNPDE surrogate
May–Jul 2019
IIT Kharagpur · JNCASR Fellow
Research Fellow
Modeled particle transport via stochastic rotation dynamics. Restructured data layout to cut time complexity from O(NDK) to O(K), then applied SIMD vectorization and LLVM compilation via Numba — 40× net speedup.
PythonNumba SIMDalgorithmic optimization
Technical Stack
GPU / CUDA
CUDA C++ CuTe / CUTLASS GEMM kernels Warp specialization TMA / persistent kernels OpenACC NCCL / NVSHMEM Nsight Systems / Compute CUDA graphs
ML Systems
PyTorch Distributed training (DDP) Mixed precision Neural operators (FNO/GNO) GNNs PyTorch Geometric Activation checkpointing NVIDIA Warp / PhysicsNeMo
HPC / Systems
C++ (modern) MPI SLURM / HPC clusters Sparse linear solvers Fortran Python / NumPy / SciPy CMake PySpark / Pandas
Hardware
Blackwell (B100/B200) Hopper (H100) GH200 Superchip A100 L40s Roofline analysis Memory hierarchy opt. Occupancy tuning
Open Source
CUDA · Blackwell · Kernel Engineering
blackwell_gemm_bench

GEMM kernels for NVIDIA Blackwell written from scratch with CuTe and CUTLASS. Explores warp specialization, Tensor Memory Accelerator (TMA), and persistent kernel patterns. Built to understand what the hardware can actually do — not just call cuBLAS.

CuTeCUTLASS BlackwellTMACUDA C++
View on GitHub ↗
GNN · Neural Operator · ML Research
AeTHERON

Heterogeneous GNN-based neural simulator with sparse cross-attention and autoregressive time-conditioning on unstructured meshes. Learns to predict unsteady spatio-temporal fields from data — replacing hours of numerical simulation with milliseconds of inference.

PyTorch GeometricGNN sparse attentionneural operator
View on GitHub ↗
CUDA · Multi-GPU · HPC Solver
ImmerseFlow++

Multi-GPU solver scaled to ~1 billion grid points with 90%+ efficiency. Custom CUDA kernels for sparse linear algebra and geometry, CUDA-aware MPI + NVSHMEM for inter-node communication, Nsight-profiled throughout. Tested on A100, H100, GH200.

CUDAMPI NVSHMEMC++OpenACC
View on GitHub ↗
Research Highlights
Add your neural operator
inference demo here
(image or video)
Neural Operators · NVIDIA · 2024–2025
Hybrid Graph-Fourier Neural Operator for Real-Time Physical State Reconstruction
Classical physics solvers discretize space and time, integrating equations forward step by step — computationally expensive by design. We asked: can a neural network learn the mapping from sparse sensor observations to full physical fields, end-to-end, fast enough for real-time deployment inside an AI factory?
By fusing Graph Neural Operators (for unstructured sensor graphs) with 3D Fourier Neural Operators (for global field reconstruction), we achieved 5000× inference speedup over the classical baseline — with accuracy sufficient for production digital twin deployment.
GNOFNO PyTorchDDP 8×A1005 patents
Add Nsight profiling screenshot
or roofline plot here
GPU Kernel Engineering · Johns Hopkins · 2021–2026
Extreme-Scale Multi-GPU Solver at 90%+ Efficiency
Starting from a 75,000-line legacy codebase, we rebuilt a high-performance immersed boundary solver for modern GPU clusters — custom CUDA kernels, NVSHMEM communication, and Nsight-guided optimization at the warp level.
500× end-to-end speedup. 90%+ weak and strong scaling to 1 billion grid points across H100 and GH200 nodes.
CUDANVSHMEM H100GH200Nsight
Add AeTHERON architecture
diagram or prediction results
GNN · Neural Simulator · ML Research
AeTHERON: Autoregressive GNN Simulator on Unstructured Meshes
Unstructured mesh problems — where geometry is complex and irregular — are where classical neural PDE solvers struggle most. AeTHERON uses heterogeneous graph attention with autoregressive time-conditioning to roll out stable long-horizon predictions.
Stable multi-step rollout on unstructured meshes where standard GNNs diverge within a few steps.
PyTorch Geometricsparse attention GNNautoregressive
Selected Publications
2026
High-Performant Implementation of Sharp-Interface Immersed Boundary Method on Multi-GPU Clusters for Extreme-Scale Simulation with Moving Boundaries
In preparation · Journal of Computational Physics
2026
GPU-Accelerated Simulations of Moving Boundary Problems and Fluid-Structure Interaction at Extreme Scales
AIAA Aviation 2026 (Accepted)
2026
A GPU-Accelerated Sharp Interface Immersed Boundary Solver for Large Scale Flow Simulations
AIAA SciTech 2026
2025
Freeman Scholar Lecture — Sharp-Interface Immersed Boundary Methods in Fluid Dynamics
ASME Journal of Fluids Engineering, 147(3), 030801
2025
Computational Modelling and Analysis of the Coupled Aero-Structural Dynamics in Bat-Inspired Wings
Journal of Fluid Mechanics, 1010, A53
2023
Force Moment Partitioning and Scaling Analysis of Vortices Shed by a 2D Pitching Wing in Quiescent Fluid
Experiments in Fluids, 64, 158
2022
Contribution of Spanwise and Cross-Span Vortices to Lift Generation of Low-Aspect-Ratio Wings
Physical Review Fluids, 7(11), 114102
2020
A Quantitative Analysis of Machine Learning Based Regressors for Pressure Reconstruction in Particle Image Velocimetry
ASME Fluid Engineering Division Summer Meeting
→ All 13 publications on Google Scholar
Contact

Let's build something
unreasonably fast.

Open to conversations about GPU performance, ML systems, and research at the boundary of hardware and learning.

Email me GitHub LinkedIn Scholar