Small Collab Leads to Big Win at 2026 OpenEnv Hackathon San Francisco
May 12, 2026
William Chen (Founder, Touchdown Labs), Yiying Xie (Northeastern University), Warren Low (National University of Singapore), and Farhan Navas (National University of Singapore) take home $10,000 Mercor prize.
With great power, comes great computability — a fact not lost on machine learning engineers who have harnessed the power of Graphics Processing Units (GPUs) and their supporting parallel programming architecture, Compute Unified Device Architecture (CUDA). CUDA kernels enable tens of thousands of GPU cores to perform billions of operations simultaneously, opening the door to endless possibilities in artificial intelligence (AI) and reinforcement learning (RL) applications.
The 2026 OpenEnv Hackathon San Francisco, hosted by Cerebral Valley and Pytorch, challenged teams to build RL environments and post-train a base model to improve performance across benchmarks aligned with specific themes. Mercor, a Fremont-based AI startup specializing in training large language models (LLMs), awarded its theme’s $10,000 prize to the team led by William Chen (Founder, Touchdown Labs) and supported by Yiying Xie (Northeastern University), Warren Low (National University of Singapore), and Farhan Navas (National University of Singapore) for their RL-powered, LLM training architecture for GPUs.
Xie holds a Bachelor’s degree in Business Administration from Shenzhen University and a Master’s of Science in Finance from the University of San Diego. She is currently pursuing a Master’s of Science in Computer Science from Northeastern’s Khoury College of Computer Sciences as an Align student, which allows students from non-CS backgrounds to transfer into CS.
“Align has been incredibly impactful for my transition from finance to computer science,” says Xie. “Separating noise from uncertainty in finance models turns out to be exactly the same core skill needed for designing RL reward functions. This hackathon project was an opportunity to put that into practice, and it’s a direct reflection of what Align’s structured, research-grounded learning makes possible.”
Too many cores, not enough coordination
While CUDA-driven GPUs have a major processing advantage, their capacity is also their biggest hurdle. There are three primary hardware-related performance bottlenecks: uncoalesced memory access, warp divergence, and load imbalance. Discovering the culprit involves taking a deep dive into what’s being saved on each core and, furthermore, how information is being accessed between cores.
Data is typically stored sequentially by default, with elements physically stored next to each other on the chip as memory is allocated. This makes sense when quickly assigning data to memory addresses, but it can mean there’s no clear path when actually accessing the data if relevant information wasn’t stored adjacently. That leads to uncoalesced memory access, one of the biggest killers of GPU performance.
Another significant problem is warp divergence. GPU threads are grouped into clusters of 32 called warps. Within each warp, threads can execute their instructions simultaneously — but divergent paths run sequentially, with threads not on the active path stuck idling. Consider a simple if-else statement: if a condition is met, follow Path A; else, follow Path B. Branching statements like these cause one path to execute at a time, preventing GPUs from harnessing their parallel processing advantage. In the worst case scenario, where each thread in a warp has its own path, efficiency drops to 3% from warp divergence alone.
Similarly, threads within the same cluster must wait for the slowest thread to finish. When there is a load imbalance, threads with relatively lower workloads are stuck inactive until threads with heavier workloads complete, leading to underutilization that diminishes overall throughput.
If it’s slow, make it faster;
If it’s slow, make it faster;
Manually troubleshooting uncoalesced memory access, warp divergence, and load imbalances is possible by assigning similar tasks to threads in the same warps, rewriting code to avoid branching statements, and other methods. However, the sheer volume of cores in GPUs makes this incredibly time-consuming and resource-intensive.
But what if AI could rewrite CUDA kernels more efficiently, using direct performance feedback from the GPU to recursively optimize performance? That’s exactly what Xie and her team laid the groundwork for in their winning hackathon project.
While LLMs are known for generating natural language, they’re fundamentally sequence-to-sequence models, meaning they can learn to predict a sequence given a prompt in any language. This implies that they can learn to generate, and even optimize CUDA kernels. DeepSeek’s 2024 Group Relative Policy Optimization (GRPO) RL algorithm was a critical pivot away from human feedback in the loop, but it was developed specifically for language reasoning tasks, not hardware optimization.
“Designating the GPU as the LLM’s teacher — where the LLM iteratively rewrites its code based on the GPU’s feedback — is a fundamentally different loop from asking a LLM to evaluate text,” explains Xie. “It’s inherently slower and more expensive because kernel optimization is a complex, multi-phase process. But if you can make it work efficiently, it offers huge performance gains.”
From foundations to frontiers
Xie and her team developed an end-to-end architecture and framework to close the RL training loop specifically for hardware optimization. Starting with GRPO, they leveraged a Mixture of Experts (MoE) architecture, where different parts of the model specialize in handling certain types of input, becoming “experts” in tasks that can run in parallel. The model then uses RL to improve itself for a series of five iterations with reward signals based on real-time measurements of hardware speed.
“Building RL environments to optimize CUDA Kernels is one of the most fundamental and high-leverage applications of the recursive self improvement loop we get from AI,” says Anirudh Ravichandran, Tech Lead Manager at Mercor. “This project’s positioning in that space, its execution, and compelling design make it a winner.”
A notorious issue in this type of analysis is self-evaluation bias, where the model reinforces errors as it assesses its own output. To address that, the team applied a Turn-level Reinforce-Leave-One-Out (TRLOO) methodology. In each iteration, or turn, instead of letting the model evaluate all of its outputs together, each output is judged against the others — not itself. This reduces bias and allows the model to improve more effectively over successive iterations.
“It’s a genuinely hard problem to solve, and most of the work is invisible,” says Xie. “What we proved in the infrastructure and validation stage at the hackathon, is that the training signal is real: reward variance exists, gradients are non-degenerate, and the model’s behavior measurably changed in just five iterations. That is the foundation. Scaling to 50+ iterations and getting to a positive reward is next.”