Important
We'll refactor the code and share a much simpler API very soon! Sorry for the transition period..
Mert Yuksekgonul*, Daniel Koceja*, Xinhao Li*, Federico Bianchi*
Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou†, Carlos Guestrin†, Yu Sun*
Stanford · NVIDIA · Astera Institute · UC San Diego · Together AI
TTT-Discover performs reinforcement learning at test time, allowing the LLM to continue training with experience specific to the problem at hand. We achieve new state-of-the-art across mathematics, GPU kernels, algorithms, and biology.
| Mathematics Erdős Overlap ↓ |
Kernel A100 TriMul ↓ |
Kernel H100 TriMul ↓ |
Algorithms AtCoder ↑ |
Biology Denoising ↑ |
|
|---|---|---|---|---|---|
| Best Human | 0.380927 | 4531 μs | 1371 μs | 566,997 | 0.64 |
| Prev. Best AI | 0.380924 | — | — | 558,026 | — |
| TTT-Discover | 0.380876 | 2198 μs | 1161 μs | 567,062 | 0.71 |
pip install -r requirements/requirements-math.txtSet environment variables:
export TINKER_API_KEY="..."
export WANDB_API_KEY="..."
export WANDB_ENTITY="..." Task-specific requirements:
- GPU kernels:
requirements/requirements-gpumode.txt - AtCoder:
requirements/requirements-ale.txt - Denoising:
requirements/denoising/requirements-denoising.txt(see README)
Requires SLURM. Launch AC1 (autocorrelation inequality) on 4 nodes:
python main_tinker_submitit.py \
--nodes 4 \
--partition default \
--cpus-per-task 100 \
env=ac1 \
model_name="openai/gpt-oss-120b" \
sampler_type=puct_backprop \
initial_exp_type=random \
num_epochs=50 \
wandb_project="my-project" \
wandb_name="ac1-run-1"Or use a preconfigured script:
bash scripts/tinker/ac1.shSee docs/launching.md for all parameters and docs/intro.md for adding new tasks.
Mathematics — Classic open problems in combinatorics and analysis
| Task | Erdős Min. Overlap ↓ | Autocorr. (AC1) ↓ | Autocorr. (AC2) ↑ |
|---|---|---|---|
| Best Human | 0.380927 | 1.50973 | 0.9015 |
| Prev. Best AI | 0.380924 | 1.50314 | 0.9610 |
| TTT-Discover | 0.380876 | 1.50287 | 0.9591 |
Kernel Engineering — GPUMode TriMul competition for triangular matrix multiplication
| Task | A100 ↓ | H100 ↓ | B200 ↓ | MI300x ↓ |
|---|---|---|---|---|
| Best Human | 4531 μs | 1371 μs | 1005 μs | 2462 μs |
| TTT-Discover | 2198 μs | 1161 μs | 905 μs | 1596 μs |
Algorithm Engineering — AtCoder Heuristic Contests on real-world optimization [AHC39] [AHC58]
| Task | AHC39 (Geometry) ↑ | AHC58 (Scheduling) ↑ |
|---|---|---|
| Best Human | 566,997 | 847,674,723 |
| Prev. Best AI | 558,026 | 848,373,282 |
| TTT-Discover | 567,062 | 848,414,228 |
Biology — Single-cell RNA-seq denoising on OpenProblems benchmark
| Task | PBMC ↑ | Tabula ↑ |
|---|---|---|
| Best Human | 0.64 | 0.64 |
| TTT-Discover | 0.71 | 0.73 |
This work builds on several outstanding projects and communities:
- GPU Mode — Community for GPU kernel optimization and the TriMul competition
- ALE-Bench — AtCoder-based benchmark for LLM evaluation
- AlphaEvolve — DeepMind's evolutionary coding agent
- OpenEvolve — Open-source implementation of AlphaEvolve
- Tinker — LLM training recipes and RL framework
@article{ttt-discover2026,
title = {Learning to Discover at Test Time},
author = {Yuksekgonul, Mert and Koceja, Daniel and Li, Xinhao
and Bianchi, Federico and McCaleb, Jed and Wang, Xiaolong
and Kautz, Jan and Choi, Yejin and Zou, James
and Guestrin, Carlos and Sun, Yu},
journal = {arXiv preprint arXiv:2601.16175},
year = {2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
