Latent Space·April 7, 2026·1h 17m

Extreme Harness Engineering for the 1B token/day Dark Factory — Ryan Lopopolo, OpenAI Frontier

TL;DR

Ryan Lopopolo built a 1M+ line internal app with zero hand-written code — over roughly five months, his OpenAI Frontier team used Codex end-to-end, saying the setup became about 10x faster than manual engineering once they paid the upfront cost of reshaping the repo for agents.
The bottleneck wasn’t model capability — it was human attention — Ryan says the scarce resource is “the synchronous human attention of my team,” which pushed them to automate review, observability, CI handling, and even postmortem feedback loops instead of babysitting 1,500 PRs.
Harness engineering means encoding taste as text the agent can actually see — docs, lints, review agents, markdown trackers, and error messages became the place to store non-functional requirements like reliability, modularity, and timeout rules, because anything not visible to the agent is effectively garbage.
They optimized the codebase for agent legibility, not human preference — when Codex got background shells in GPT-5.3 and stopped patiently blocking on long builds, the team reworked their system from Make to Bazel to Turbo to NX until builds landed under one minute, treating slow builds as a signal to decompose further.
Symphony emerged when 5–10 PRs per engineer per day became too much context switching — Ryan’s team built an Elixir-based orchestration layer that trashes bad worktrees, reruns tasks from scratch, manages rework states, and turns agents into something closer to asynchronous teammates than IDE copilots.
Ryan’s bigger thesis is that coding agents expand upward into general knowledge work — if a user journey can be collapsed into code, he argues the Codex harness can solve it, which is why OpenAI Frontier is framing enterprise AI around deployable, observable, governed agents rather than isolated chatbots.

The Breakdown

The no-handwritten-code bet

Ryan opens with the constraint that defined everything: he wasn’t allowed to write the code himself. Working inside OpenAI Frontier product exploration, he wanted to prove enterprise agents should be able to do what he does, so he forced himself to make Codex do the whole job. The result was an internal greenfield repo that grew past 1 million lines, with Ryan calling the models and harnesses “isomorphic to me in capability.”

The painful month that paid for the factory

He’s blunt that the early phase was awful: the first month and a half was “10 times slower” than just coding by hand. But every time the model failed on a big task, the team would double-click into it, build smaller primitives, and create better assembly stations for the agent. That investment compounded across model generations from early Codex mini through GPT-5.1, 5.2, 5.3, and 5.4.

Why a one-minute build became law

One of the sharpest moments is Ryan explaining how Codex behavior changed when background shells arrived in 5.3: suddenly the agent was less willing to sit around waiting on long builds. So the team rebuilt their build system repeatedly — Make to Bazel to Turbo to NX — until the inner loop was under one minute. Slow builds weren’t tolerated; they were treated as a ratchet forcing better decomposition and cleaner build graphs.

Turning invisible engineering taste into text

A huge chunk of the conversation is about making “good engineering” legible to the model. Ryan says docs, lints, markdown trackers, review agents, and test outputs are all just ways to inject non-functional requirements into prompt space — reliability, observability, modularity, timeout rules, all of it. His example is perfect: when a missing timeout triggers a page, he can ask Codex in Slack to both fix the bug and update the reliability docs so the lesson becomes durable team knowledge.

Human review fades, systems thinking takes over

The wild part: they’ve mostly moved beyond humans reviewing code before merge. Ryan says review is increasingly post-merge, because the real job is not checking every diff but asking where the agent keeps making mistakes and how to eliminate that category forever. He compares himself less to an engineer in the weeds and more to someone tech-leading a 500-person org, inferring from samples where the system needs better architecture or sharper primitives.

Symphony: the orchestrator born from context-switch exhaustion

By late December the team was at 3.5 PRs per engineer per day; after GPT-5.2, that jumped to 5–10, and Ryan says the constant TMUX hopping was exhausting. Symphony was their answer: an Elixir orchestration system that supervises task daemons, moves tickets through states, and if a PR is garbage, nukes the worktree and starts over from scratch. The energy here is very “remove myself from the loop,” with Ryan joking that the dream state is opening Linear twice a day and saying yes or no.

Specs, ghost libraries, and the end of fixed software packages

Later, the conversation turns to how they distribute these ideas. Ryan describes generating executable specs from their proprietary repo, then using Codex-in-TMUX loops to implement, review, compare, and iteratively tighten the spec until it can reproduce the system with high fidelity — what people on Twitter started calling “ghost libraries.” He also agrees with Bret Taylor’s take that many dependencies are becoming optional: if a library is only a couple thousand lines and tokens are cheap, in-house it, strip it to what you need, and let security agents patch it directly.

Frontier as the enterprise layer for governed agents

The ending zooms out from one team’s workflow to OpenAI’s broader enterprise platform. Ryan frames Frontier as the infrastructure for deploying observable, safe, controllable agents into real companies — integrated with IM stacks, security tooling, governance controls, and company-specific safety specs. His closing thesis lands hard: the future isn’t just better coding copilots, it’s agents that inherit a team’s workflows, context, even meme culture, and become actual coworkers that can “just do things.”