AI can already run experiments and hill-climb a local benchmark. The next step in AI-driven research is harder: human review, reproduction, and earn enough trust from other researchers that they'll build on it.
Our autonomous research agent, Aiden, spent 22 days inside OpenAI's Parameter Golf, a 44-day ML optimization competition also used as a hiring challenge. The task was simple to state and hard to execute: train the best language model that fits inside a 16 MB artifact and runs under tight compute constraints.
By the end, Aiden had become the most influential contributor in the competition: it produced more leaderboard records than any individual human researcher, and its PRs were cited more than anyone else's.
Some specifics from the runs:
- 7 merged leaderboard records (next-best individual human: 3)
- 435 citations of Aiden's PRs from other contributors; 10 of Aiden's PRs each cited by 10+ others (the most in the competition)
- ~28% of submissions accepted, about 6x the community average
- ~$1,411 in GPU compute per accepted record, about 3.7x more efficient than the community average
The challenge: Parameter Golf
OpenAI ran Parameter Golf from March 18 to April 30, 2026. The task was to train the best language model that fit inside a 16-megabyte artifact, in under ten minutes on eight H100 GPUs. Submissions were scored on bits-per-byte against the FineWeb validation set, a tokenizer-agnostic measure of how well a model predicts held-out text.
OpenAI put up $1 million in compute credits and reviewed every submission. Some statistics:
- 1,016 participants entered
- 2,048 pull requests filed
- 47 merged into the leaderboard, from 31 contributors who set at least one record
- $249,550 in RunPod compute credits burned across the competition
47 out of 2,048 is a 2.3% merge rate. Most PRs were rejected because the improvement didn't hold across random seeds, the code violated competition rules, or the result fell within noise of the existing record. Getting accepted to the leaderboard required a real, reproducible improvement that was verified by OpenAI's reviewers.
Aiden: an Agent that Publishes its own Research
Aiden is a self-improving multi-agent system, built on the Claude and GPT model families, that runs indefinitely to hill-climb on a single metric. It continuously improves itself, updating its own prompts and workflows, building new tools to work more efficiently.
Compared to our last-generation system, AIDE, we saw a significant jump in capability. But we think performance is not the most exciting part. Aiden is the first AI research system that can actively collaborate with a human research community by continuously publishing its own work.
Most AI research systems focus on the private loop: propose ideas, run experiments, inspect results, repeat. Aiden went further. It turned experiments into public PRs that other researchers could inspect, review, cite, fork, fix, and extend.
This matters because research is not finished when an experiment works locally. It only becomes useful to a community when the result is packaged clearly, checked against the rules, and published in a form others can build on.
In Parameter Golf, Aiden automated that full path: it found ideas, tested them, selected the useful ones, audited them for legality, wrote the implementation, and submitted high-quality PRs into the public competition.
The part that surprised us was what happened after publishing. The human competitors trusted Aiden's PRs enough to keep building on its work. Researchers cited Aiden's work in their own submissions, forked its code as a starting point for new ideas, and treated its contributions the same way they treated work from any other strong contributor.
The Most-Cited Contributor
Aiden continuously drew on the community's work. The loop also ran in the other direction. Other researchers cited, forked, fixed, and extended Aiden's PRs.
This influence was measurable. Parameter Golf PRs often named the prior PRs they built on, much like papers cite prior work. We can therefore compute a PR-level h-index: an author has h-index h if h of their PRs were each cited by at least h later PRs.
By this measure, Aiden had the highest all-PR h-index in the competition: 10. The next highest was 7. This meant Aiden did not just produce accepted records; it produced reusable work that other participants repeatedly built on.
Higher Acceptance Rate, Less Compute
Aiden's contributions came from a combination of three things: high experimental throughput, efficient use of compute, and strong quality control before anything became public.
The obvious advantage of an agent is throughput. Across 22 days, Aiden generated roughly 1,243 distinct experiment configurations and executed more than 1,300 local runs. Most of those experiments never left the machine, but they helped the system learn which directions were promising and which constraints were actually binding.
4% of the compute, 15% of the records
The obvious objection is that an agent with enough compute can brute-force results. We didn't have that to fall back on. Aiden ran on a single GPU node.
Throughput was the first multiplier. Aiden kept the experiment loop running continuously across 22 days.
But the result wasn't just a function of running more experiments. Aiden made better use of compute too. Per unit of visible compute, it produced substantially more accepted leaderboard records than the community average.
Aiden used 3,304 GPU-hours over 22 days. At RunPod's public H100 SXM rate of $2.99 per GPU-hour, that's roughly $9,879 in compute proxy. RunPod reported $249,550 in credits burned across the challenge, a conservative denominator, since it excludes off-platform GPU use by other participants (and us). Even against that, Aiden used under 4% of visible compute while producing 7 of 47 official leaderboard records, or about 15%.
| Compute proxy | Official leaderboard records | Cost proxy per record | |
|---|---|---|---|
| Aiden GPU-hours | ~$9,879 | 7 | ~$1,411 |
| RunPod-reported community credits | ~$249,550 | 47 | ~$5,310 |
Signal, Not Noise
The other obvious objection was that Aiden spammed the queue. The concern is fair: an autonomous research system can generate a lot of experiments, but if it pushes too many weak ones into the public stream, it only lowers the average signal quality and creates more review burden.
In Parameter Golf, the public PR stream was the community's shared information channel. Every PR told other participants something: which ideas were worth trying, which constraints were binding, and which directions had already failed.
What we saw was the opposite. Aiden's public submissions reached the leaderboard at roughly 6x the community average. It increased the density of useful signal in the public stream.
| Cohort | Leaderboard records | Filed | Adjudicated | Leaderboard / Filed | Leaderboard / Adjudicated |
|---|---|---|---|---|---|
dexhunter (Aiden) |
7 | 25 | 14 | 28.0% | 50.0% |
| Whole community | 47 | 1002 | 385 | 4.7% | 12.2% |
This is the result we cared about: not just more AI-generated work, but higher-signal public work. Most failed ideas stayed local. The PRs that did become public were more likely to survive review, enter the leaderboard, and become useful to other participants.
Limitation & Conclusion
With the right scaffolding around them, autonomous agents have crossed a real threshold in empirical research. As Aiden demonstrated, AI can now operate within the public frontier with immense compute leverage and a positive signal-to-noise ratio.
However we also find, in this highly competitive environment, human researchers still produced the creative primitives. The agent's role is complementary: it takes those community primitives, polishes or recombines them, and feeds work back into the public stream.
The specific loop, where the community generates entropy and the AI executes constraint-driven engineering, is how real-world R&D actually accelerates today.
This is the exact thesis behind Weco AI (We Collaborate with AI): to accelerate technology breakthroughs by making this human-AI collaborative loop the production-grade default for engineering and research.
Aiden is still in the research prototype stage. We're opening early access to Aiden for a small group of design partners working on model quality scores. If you're interested, please fill out the form.
Early Access Form
Appendix
Case study: Asynchronous Human-AI Collaboration
What we found really interesting is how ad hoc the collaboration between AI and human researchers can be. From the human side, there was no onboarding and no idea what was on the other end.
For example, in PR #1629 -> PR #1736 lineage, Aiden plateaued for 5 days, then a human contributor, @romeerp, shipped a clever lossless tokenizer on top of Aiden's base (its last record PR). Aiden fused it with components it had built during the plateau, and shipped the biggest jump in weeks.
After shipping PR #1626 on April 14, Aiden hit a saturation period. It explored in several directions: tokenization, attention, architecture. The most consequential was that Aiden read Qwen's 2025 paper on gated attention (arXiv:2505.06708) and ported the Qwen-style mechanism into its own stack. The new attention layer introduced additional parameters, pushing the model artifact over the competition's 16 MB hard cap. To address that, Aiden built a small quantization mechanism specifically for the gate weights, called QuantGate, that compressed them enough to fit the stack back under the cap. The combination finally fit all of the constraints. But the resulting BPB was indistinguishable from PR #1626 within seed noise. After five days of effort, Aiden had a richer stack that scored the same.
Then, at 02:12 UTC on April 19, a community contributor (@romeerp) filed PR #1729: a new tokenizer called CaseOps, built on top of Aiden's PR #1626 base.
The easiest way to see the idea is with a toy example. A lossy casefold transform might turn The CAT into the cat. That helps the model, because it no longer has to separately learn The, the, CAT, and cat, but it destroys information: from the cat, there is no way to recover the original capitalization. The competition's rules disallow this kind of irreversible transformation.
CaseOps keeps much of the same modeling benefit without dropping the information. It rewrites the text into a mostly lowercased stream while carrying the original casing through explicit operator tokens:
Original: The CAT
CaseOps: <TITLE>the <ALLCAPS>cat
Because the operator tokens preserve the casing exactly, the original text can be reconstructed from the CaseOps version, which makes the transformation legal.
Aiden's local exploration during saturation had included lossy casefold of its own. It had spent a lot of time understanding exactly why the trick performed so well but was ruled illegal under the competition's byte-accounting rules. So when CaseOps arrived with a clever fix for the obstacle, Aiden recognized what it was within minutes.
Within eight hours of PR #1729's filing, Aiden had audited CaseOps for legality, and ported it into its own stack alongside the locally-built gated attention and QuantGate. The unexpected part is that the gated-attention and QuantGate components, which had been noise-level on the pre-CaseOps base, now produced a measurable improvement when combined with CaseOps. After recognizing that, Aiden shipped PR #1736 at val_bpb 1.06549, a 0.0023 BPB improvement over PR #1729 itself.
The point of this case study isn't that the breakthrough was lucky. It's that Aiden was positioned to catch it when it arrived. The unpredictable half inherently defies engineering. There is no way to program exactly when a community catalyst like CaseOps will surface, nor can you guarantee that combining it with locally built components will actually yield a measurable leap in performance. An autonomous agent cannot force a breakthrough or schedule the exact moment it pays off.
That said, Aiden's experimentation during the plateau (and the prior shipping of PR #1626) built the conditions for the breakthrough. Because Aiden was both a contributor to the community and a consumer of it, it had already built the components that made CaseOps land. They were also the natural byproduct of an agent deeply embedded in the ecosystem as both a creator and a consumer. This dual role forged the shared technical lineage between the new community catalyst and Aiden's local stack, and eventually led to a breakthrough.