Making Large-Scale AI Self-Optimising: NUS Computing’s He Bingsheng Receives 2026 Google Research Award
Training a large language model is not just about writing good code and pressing run. These models are spread across hundreds or thousands of processors – such as Google's Tensor Processing Units (TPUs), specialised accelerators designed to train and serve large-scale machine learning models – that must learn to work in lockstep: exchanging data, splitting tasks, staying synchronised. When something goes wrong, the whole system slows down, and expensive hardware sits idle.
The problem? Finding and fixing these bottlenecks is still a craft. Engineers pore over profiling traces, adjust configurations by trial and error, and lean on deep systems expertise that most research teams simply do not have.
Professor He Bingsheng wants to change that. His project, selected for the 2026 Google Awards for Machine Learning Research and Education with TPUs, is building tools that automatically diagnose and resolve performance bottlenecks in distributed AI workloads, turning what has been a manual, expert-driven process into an automated, reproducible one.
The project, Lightweight and Automated Performance Optimization of Training and Inference on TPUs, rethinks profiling itself as the foundation of a self-optimising system.
Rather than simply flagging problems for a human to interpret, the tools capture bottlenecks with minimal disruption and generate actionable optimisation strategies on their own. Working alongside Professor He on the project are Visiting Research Fellow Weihao Cui and PhD researchers Feng Yu and Junyi Hou, with a shared long-term vision: making efficient large-scale ML accessible to any researcher, not just a small pool of systems specialists.
The most immediate impact will be on large language model workloads – both dense transformers and mixture-of-experts (MoE) variants – across pre-training, fine-tuning, and inference. MoE models, which route different inputs to specialised sub-networks, are particularly tricky: their sparse, irregular routing patterns make manual performance tuning especially painful. Recommendation systems and embedding-heavy applications face similar challenges, relying on high-dimensional sparse lookups that stress distributed infrastructure in much the same way.
The project is being developed as an open-source initiative under Medusa Compute, with publications planned at top-tier ML and systems venues.
