2026-02-11 | OpenAI | Harness Engineering: Leveraging Codex in an Agent-First World

OpenAI揭秘零人工代码软件开发实验如何利用Codex代理重塑工程范式并实现研发效率的十倍跃迁

新闻 科技 管理 商业
人工智能 软件工程 自动代码生成 Codex 智能体 研发效率 数字化转型 技术债务 软件开发自动化 Prompt Engineering 系统可维护性 +7

媒体详情

上传日期
2026-03-06 21:04
处理状态
已完成
转录状态
已完成
LLM 提供商/模型
cursorhub/gpt-5.4 (reasoningEffort=high)

原文

下载为TXT
February 11, 2026

[Engineering](https://openai.com/news/engineering/)

# Harness engineering: leveraging Codex in an agent-first world

By Ryan Lopopolo, Member of the Technical Staff

Listen to article

18:04

Share

Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with **0 lines of manually-written code**.

The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.

**Humans steer. Agents execute.**

We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.

This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.

## We started with an empty git repository

The first commit to an empty repository landed in late August 2025.

The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.

There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.

Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has _increased_ as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.

Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: **no manually-written code**.

## Redefining the role of the engineer

The lack of hands-on human coding **introduced a different kind of engineering work, focused on systems, scaffolding, and leverage**.

Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”

Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a [Ralph Wiggum Loop⁠(opens in a new window)](https://ghuntley.com/loop/)). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.

Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.

## Increasing application legibility

As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.

For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.

![Diagram titled “Codex drives the app with Chrome DevTools MCP to validate its work.” Codex selects a target, snapshots the state before and after triggering a UI path, observes runtime events via Chrome DevTools, applies fixes, restarts, and loops re-running validation until the app is clean.](https://images.ctfassets.net/kftzwdyauwt9/1Gu58eNlqDEuITmbqJDmq9/1e2e62f7e15fb16d2da0da5407240564/fig_1__codex_drives_the_app_.png?w=3840&q=90&fm=webp)

We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.

![Diagram titled “Giving Codex a full observability stack in local dev.” An app sends logs, metrics, and traces to Vector, which fans out data to an observability stack containing Victoria Logs, Metrics, and Traces, each queried via LogQL, PromQL, or TraceQL APIs. Codex uses these signals to query, correlate, and reason, then implements fixes in the codebase, restarts the app, re-runs workloads, tests UI journeys, and repeats in a feedback loop.](https://images.ctfassets.net/kftzwdyauwt9/4Xr18TZ5G4Bh8zIgsTFIVK/f7ae689ddd8c31664e39d809b0973425/OAI_Harness_engineering_Giving_Codex_a_full_observability_stack_desktop-light__1_.svg?w=3840&q=90)

We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).

## We made repository knowledge the system of record

Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: **give Codex a map, not a 1,000-page instruction manual.**

We tried the “one big [`AGENTS.md`⁠(opens in a new window)](https://agents.md/)” approach. It failed in predictable ways:

-   **Context is a scarce resource.** A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.
-   **Too much guidance becomes** **_non-guidance_****.** When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
-   **It rots instantly.** A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
-   **It’s hard to verify.** A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.

So instead of treating `AGENTS.md` as the encyclopedia, we treat it as **the table of contents**.

The repository’s knowledge base lives in a structured `docs/` directory treated as the system of record. A short `AGENTS.md` (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.

#### Plain Text

`   1  AGENTS.md  2  ARCHITECTURE.md  3  docs/  4  ├── design-docs/  5  │   ├── index.md  6  │   ├── core-beliefs.md  7  │   └── ...  8  ├── exec-plans/  9  │   ├── active/  10  │   ├── completed/  11  │   └── tech-debt-tracker.md  12  ├── generated/  13  │   └── db-schema.md  14  ├── product-specs/  15  │   ├── index.md  16  │   ├── new-user-onboarding.md  17  │   └── ...  18  ├── references/  19  │   ├── design-system-reference-llms.txt  20  │   ├── nixpacks-llms.txt  21  │   ├── uv-llms.txt  22  │   └── ...  23  ├── DESIGN.md  24  ├── FRONTEND.md  25  ├── PLANS.md  26  ├── PRODUCT_SENSE.md  27  ├── QUALITY_SCORE.md  28  ├── RELIABILITY.md  29  └── SECURITY.md       `

In-repository knowledge store layout.

Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. [Architecture documentation⁠(opens in a new window)](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html) provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.

Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in [execution plans⁠(opens in a new window)](https://cookbook.openai.com/articles/codex_exec_plans) with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

This enables **progressive disclosure**: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.

We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.

## Agent legibility is the goal

As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.

Because the repository is entirely agent-generated, it’s optimized first for _Codex’s_ _legibility_. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain **directly from the repository itself.**

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.

![Diagram titled “The limits of agent knowledge: What Codex can’t see doesn’t exist.” Codex’s knowledge is shown as a bounded bubble. Below it are examples of unseen knowledge—Google Docs, Slack messages, and tacit human knowledge. Arrows indicate that to make this information visible to Codex, it must be encoded into the codebase as markdown.](https://images.ctfassets.net/kftzwdyauwt9/7uWHsJIC6o3uQPsnQ2Avz9/8be3e321892054bd215afb2b250a176a/OAI_Harness_engineering_The_limits_of_agent_knowledge_desktop-light.png?w=3840&q=90&fm=webp)

We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.

Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.

This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic `p-limit`\-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.

Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. [Aardvark](https://openai.com/index/introducing-aardvark/)) that are working on the codebase as well.

## Enforcing architecture and taste

Documentation alone doesn’t keep a fully agent-generated codebase coherent. **By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation.** For example, we require Codex to [parse data shapes at the boundary⁠(opens in a new window)](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/), but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).

Agents are most effective in environments with [strict boundaries and predictable structure⁠(opens in a new window)](https://bits.logic.inc/p/ai-is-forcing-us-to-write-good-code), so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.

The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.

![Diagram titled “Layered domain architecture with explicit cross-cutting boundaries.” Inside the business logic domain are modules: Types → Config → Repo, and Providers → Service → Runtime → UI, with App Wiring + UI at the bottom. A Utils module sits outside the boundary and feeds into Providers.](https://images.ctfassets.net/kftzwdyauwt9/4Rlip1H3T9apPlSmWs7Wr8/7708c176bfbe11951e06ad8e2b83bf01/OAI_Harness_engineering_Layered_domain_architecture_with_explicit_cross-cutting_boundries_desktop-light.png?w=3840&q=90&fm=webp)

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.

In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.

At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.

The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.

Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code

## Throughput changes the merge philosophy

As Codex’s throughput increased, many conventional engineering norms became counterproductive.

The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.

This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.

## What “agent-generated” actually means

When we say the codebase is generated by Codex agents, we mean everything in the codebase.

Agents produce:

-   Product code and tests
-   CI configuration and release tooling
-   Internal developer tools
-   Documentation and design history
-   Evaluation harnesses
-   Review comments and responses
-   Scripts that manage the repository itself
-   Production dashboard definition files

Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.

Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.

## Increasing levels of autonomy

As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.

Given a single prompt, the agent can now:

-   Validate the current state of the codebase
-   Reproduce a reported bug
-   Record a video demonstrating the failure
-   Implement a fix
-   Validate the fix by driving the application
-   Record a second video demonstrating the resolution
-   Open a pull request
-   Respond to agent and human feedback
-   Detect and remediate build failures
-   Escalate to a human only when judgment is required
-   Merge the change

This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.

## Entropy and garbage collection

**Full agent autonomy also introduces novel problems.** Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.

Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.

Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.

This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.

## What we’re still learning

This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.

What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.

**Our most difficult challenges now center on designing environments, feedback loops, and control systems** that help agents accomplish our goal: build and maintain complex, reliable software at scale.

As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so [you can just build things](https://openai.com/codex/).

最新摘要

生成于 2026-03-06 21:15

核心概览

2026年2月11日,Ryan Lopopolo 介绍了团队过去五个月的一项工程实验:从一个空的 Git 仓库出发,在“0 行人工手写代码”的约束下,使用 Codex 构建并持续交付一个真实可用的软件产品。这个产品并非演示样品,而是已经有内部日常用户、外部 alpha 测试者、会发布部署、会出现故障、也会被修复的运行中系统。团队估计,如果改为传统手写代码方式,完成同等规模工作所需时间大约会是现在的十倍。

文章的中心命题不是“AI 替人写了多少代码”,而是软件工程的主工作对象正在发生转移:人类从直接实现功能,转向设计环境、表达意图、组织知识、设置约束、建立反馈闭环,并让智能体在这个系统里可靠执行。在这种模式下,“Humans steer. Agents execute.” 不是口号,而是整个工程体系的分工原则。

五个月后,这个完全由 Codex 产出的仓库已经扩展到约百万行代码,覆盖应用逻辑、基础设施、内部工具、测试、CI、文档、可观测性与开发辅助工具。期间,三名工程师驱动 Codex 发起并合并了约 1500 个 PR,平均每名工程师每天约 3.5 个 PR;随着团队扩展到七人,总吞吐量并未下降,反而还在提升。文中据此提出,在智能体优先的开发范式下,真正稀缺的资源不再是代码产出能力,而是人类的时间与注意力

一、从空仓库开始:把“零人工代码”变成一条硬约束

实验始于 2025 年 8 月底的第一次提交。最初的仓库脚手架——包括目录结构、CI 配置、格式规则、包管理器设置、应用框架——由 Codex CLI 配合 GPT‑5,并参考少量现有模板生成。甚至连指导智能体如何在仓库内工作的 AGENTS.md,也是由 Codex 编写。

这一点非常关键:团队并不是在一套人类既有代码基础上让智能体做增量修改,而是让智能体从第一天开始塑造仓库本身。因此,代码库的组织方式、文档结构、架构边界、工作流工具、质量门槛,都是围绕“如何让智能体稳定工作”逐步形成的。

在整个开发过程中,人类从未直接向仓库提交代码,这被团队明确为核心哲学:no manually-written code。如果出现问题,解决方式也不是“人类先补一段关键逻辑”,而是继续让 Codex 来完成修复,包括补代码、补文档、补 lint 规则、补自动化检查和补内部工具。由此,仓库不仅是智能体的产出物,也是智能体持续工作的运行环境。

二、工程师角色被重新定义:不再以写代码为中心

文中指出,早期进展比预期更慢,但原因并不是 Codex 本身“不会写”,而是系统环境定义不足:工具不够、抽象不够、内部结构不够、目标与约束不够可执行。换言之,瓶颈不在生成能力,而在可操作的工程系统没有搭起来

于是,工程师的职责发生了重构:

  • 不再把主要精力放在亲自实现功能;
  • 而是拆解目标、设计脚手架、补齐工具、沉淀文档、编码规则;
  • 让智能体可以沿着明确边界持续推进任务。

团队的实际工作方式是“深度优先”推进:先把大目标分解为更小的构件,例如设计、实现、审查、测试、验证、发布,然后让 Codex 分别完成这些构件,并用这些构件去解锁更复杂的任务。当某项工作失败时,团队不会简单要求智能体“再试一次”,而是反过来追问:到底缺少了哪一种能力?这种能力如何被表达清楚?又如何被系统机械地强制执行?

这使工程师的工作层级整体上移。人类主要负责:

  • 排定优先级;
  • 把用户反馈转写为明确的验收标准;
  • 判断什么时候需要升级人工裁决;
  • 把经验和偏好持续沉淀进仓库;
  • 维护让智能体可以长期稳定工作的控制系统。

三、PR 不再只是代码提交,而是一条 agent-to-agent 的闭环流水线

在这套体系里,人类与系统的主要交互手段几乎全部变成了提示词。工程师描述任务后运行智能体,由 Codex 直接打开 PR。随后,一个典型 PR 会沿着以下闭环推进:

  1. Codex 先在本地审查自己的改动;
  2. 再请求额外的、更加具体的智能体审查,既可以在本地,也可以在云端进行;
  3. 然后读取人类或其他智能体提出的反馈;
  4. 回应评论并迭代修改;
  5. 重复上述过程,直到所有智能体审查者都满意。

文中把这个过程称为一种 Ralph Wiggum Loop。它的关键不在某个单点动作,而在于:智能体不仅负责“产出第一版代码”,还负责围绕这份代码进行自检、复审、吸收反馈、再次修正,直到进入可合并状态。这意味着审查本身也被纳入智能体工作流,而不再完全依赖人类精读每一次改动。

Codex 在这个过程中并不依赖工程师手工复制粘贴上下文。它直接使用团队的标准开发工具,例如 gh、本地脚本,以及嵌入仓库的技能模块,自主拉取上下文、读取反馈、做出修改、推送更新,甚至常常自己 squash 并 merge PR。人类仍可介入审查,但并非每次都必须介入;随着系统成熟,团队已经把绝大多数审查工作尽量转移到智能体之间完成

四、吞吐量上升后,瓶颈变成了人类 QA,而不是编码速度

随着代码产出速度持续提高,团队很快发现新的瓶颈不是“写得够不够快”,而是人类是否有足够时间做质量验证。既然真正稀缺的是人类注意力,最直接的办法不是压低产出,而是让应用本身更容易被 Codex 读取、驱动、观察和验证。

1. 让应用对智能体可操作:git worktree + 独立实例

团队先把应用改造成可以按 git worktree 启动。这意味着每个改动都能拥有一个独立运行的应用实例,Codex 可以针对当前任务单独启动、验证、回收,不会和其他任务互相污染。这样,智能体不只是静态读代码,而是能在属于当前任务的隔离环境中真正运行系统。

2. 让 UI 对智能体可读:接入 Chrome DevTools Protocol

团队把 Chrome DevTools Protocol 接入智能体运行时,并为 DOM 快照、截图、页面导航等能力编写专门技能。这样一来,Codex 可以:

  • 自主打开页面并沿着指定路径操作;
  • 在操作前后抓取 DOM 状态;
  • 观察页面截图和运行时事件;
  • 对照变化判断问题是否被复现;
  • 修改代码后重启应用,再次执行相同路径验证修复是否成立。

由此,UI 行为本身不再只靠人类肉眼确认,而成为智能体可直接检验的反馈源。文中的图示进一步说明了这种循环:Codex 选择目标路径,记录修复前状态,触发界面行为,读取运行时事件,应用补丁后重新启动,再重复运行验证,直到界面行为恢复正常。

3. 让日志、指标、追踪都进入智能体的可见范围

团队对可观测性也做了同样处理。每个 worktree 对应一个本地、临时性的可观测性栈,日志、指标和追踪都暴露给 Codex 使用;任务结束后,这套环境随之销毁。智能体可以直接查询:

  • 日志;
  • 指标;
  • 追踪信息。

文中明确提到,Codex 可以使用 LogQL 查询日志,使用 PromQL 查询指标,并围绕这些信号推断问题位置与性能表现。在配图中,还展示了通过 TraceQL API 查询追踪信息的方式。这样,像下面这类目标就从模糊指令变成了可执行约束:

  • “确保服务启动在 800ms 内完成”;
  • “这四条关键用户路径中,没有任何 span 超过 2 秒”。

换言之,当 UI、日志、指标、追踪都能被智能体直接访问时,修复 bug、调优性能、验证可靠性就不再依赖工程师亲自跑环境、翻日志、点界面。文中还提到,单次 Codex 运行经常会围绕同一任务持续工作 6 小时以上,很多时候发生在人类已经下线休息的时间段。

五、把仓库变成唯一真相源:给智能体一张地图,而不是一整本说明书

在大任务场景下,上下文管理是智能体效率的核心问题之一。团队最早的经验之一是:不要试图用一个巨大的 AGENTS.md 把一切都写进去。要给 Codex 一张地图,而不是一份一千页的说明书。

团队尝试过“单一大文档”方案,结果很差,原因包括:

  • 上下文资源本来就稀缺,大量说明会挤占任务本身、代码本身和真正相关文档的空间;
  • 当所有规则都被写成“重要”,真正关键的约束反而失去辨识度;
  • 大而全的文档极易过时;
  • 单块文档很难做覆盖率、时效性、归属和交叉引用等机械检查,最终一定漂移。

因此,团队改造了知识结构:短小的 AGENTS.md 只承担目录作用,真正的知识库存放在结构化的 docs/ 目录中,并作为系统记录源存在。

仓库中的知识被组织为多个明确区域,例如:

  • 设计文档与索引;
  • 核心信念;
  • 产品规格;
  • 执行计划;
  • 已完成计划与技术债跟踪;
  • 参考资料;
  • 面向设计、前端、产品理解、质量、可靠性、安全等主题的专项文档;
  • 自动生成内容,例如数据库 schema 文档。

其中,执行计划被视为一等工件。小改动可以用轻量、短期的计划处理;复杂任务则用正式的 execution plan,记录进度与决策日志,并直接提交进仓库。活跃计划、已完成计划和已知技术债都与代码共同版本化,避免依赖外部聊天记录或临时说明。

这种方式对应文中强调的渐进披露:先给智能体稳定、简短、不会频繁变化的入口,再清楚告诉它下一步应该去哪一类文档查找更深层信息,而不是一开始就把所有说明硬塞进上下文。

为了防止知识库再次失真,团队还把维护动作自动化:

  • 用专门的 lint 和 CI 检查文档是否更新、是否交叉链接、结构是否正确;
  • 用一个定期运行的“文档园艺”智能体扫描过时或已不符合真实代码行为的文档;
  • 由该智能体直接发起修复 PR。

这意味着文档不再只是给人看的附属物,而是智能体工作能力的一部分。

六、仓库首先要对智能体可读,而不是优先满足人的阅读习惯

文中提出了一个重要判断:对智能体来说,任何无法在运行时上下文中访问到的知识,实际上就等于不存在。 代码、Markdown、schema、执行计划这类仓库内本地可见、可版本控制的工件,是它真正能够依赖的世界;而 Google Docs、聊天记录、口头共识、人脑中的默会知识,如果没有被编码进仓库,就无法参与推理。

这让团队逐渐把更多上下文“推回仓库”:

  • 一次 Slack 上达成的架构共识;
  • 一套产品原则;
  • 某个设计取舍的决策历史;
  • 团队默认接受的工程规范;
  • 甚至团队文化层面的偏好。

只要这些信息会影响后续实现质量,就需要以仓库内工件的形式沉淀。这种做法既服务于 Codex,也服务于未来加入的工程师,因为两者面对的问题本质相同:看不见的信息就无法被可靠使用。

这一原则还直接影响技术选型。团队更偏向那些常被称为“boring”的技术:可组合、API 稳定、在训练数据中更常见、更容易被模型内化。某些情况下,与其围绕行为不透明的上游库做复杂适配,不如让智能体直接在仓库内重写一个更小、但完全可理解、可测试、可修改的功能子集。文中的例子是,他们没有直接引入类似 p-limit 的通用包,而是实现了自己的并发映射辅助工具,让它与 OpenTelemetry 深度集成、保持 100% 测试覆盖,并完全符合当前运行时需求。

七、靠不变量而不是人工盯梢来维持架构与“口味”

文章明确指出,文档本身不足以维持一个完全由智能体生成的代码库的整体一致性。仅靠“告诉它应该怎么做”不够,必须把关键约束变成能自动检查、自动阻止、自动纠偏的机制。

团队采用的原则是:强制执行不变量,而不是微观管理每个实现细节。

一个例子是,团队要求在边界处解析数据形状,而不是只做验证;但并不强制指定必须选哪一个库,虽然模型似乎倾向使用 Zod。更重要的是整体架构约束。团队把每个业务域都做成固定分层,并严格控制依赖方向:

Types → Config → Repo → Service → Runtime → UI

横切关注点,如认证、连接器、遥测、功能开关,只能通过一个明确接口进入:Providers。除此之外的依赖路径一律视为非法,并通过自定义 lint 与结构测试机械执行。图示中还展示了位于边界外的 Utils 模块,以及业务域和应用装配层的关系。

文章特别强调,在传统团队里,这种架构纪律往往会被认为要等到数百名工程师时才值得建立;但在智能体优先的开发模式下,它恰恰是非常早期的前提。因为没有强边界,速度带来的不是规模化收益,而是规模化漂移

除了分层结构,团队还编码了一组“口味不变量”,例如:

  • 强制结构化日志;
  • 对 schema 和类型命名设定规则;
  • 限制文件大小;
  • 固化平台相关的可靠性要求。

由于这些 lint 是自定义的,报错信息也被刻意写成“带修复提示的错误说明”,这样错误本身就能把 remediation 指令重新注入到智能体上下文中。对人类开发者而言,这类规则可能显得琐碎甚至刻板;但对智能体来说,一旦编码成规则,就会在整个仓库同时生效,形成成倍放大的约束收益

八、吞吐量改变了合并哲学:等待更贵,修正更便宜

当 Codex 的吞吐量大幅提高后,许多传统工程规范开始显得不再合适。团队采用的是低阻塞合并策略

  • 尽量减少阻塞式 merge gate;
  • 让 PR 保持短生命周期;
  • 遇到测试抖动时,经常通过后续运行补修,而不是长时间阻塞主线。

文中的判断非常直接:在一个智能体吞吐远超人类注意力的系统里,修正通常很便宜,等待反而很昂贵。 当然,这并不是一条脱离场景的普适规则。文中也明确指出,若在低吞吐、低修复能力环境里照搬这种做法,将是不负责任的。它成立的前提,是已经拥有持续自动修复、快速重跑、短周期回路和较强环境控制能力。

九、“agent-generated”不是局部辅助,而是几乎整个仓库都由智能体生成

文中对“智能体生成代码库”的定义非常彻底。它不是指“主要业务代码用 AI 写”,而是指仓库中的几乎一切都由 Codex 生成,包括:

  • 产品代码和测试;
  • CI 配置和发布工具;
  • 内部开发工具;
  • 文档与设计历史;
  • 评估 harness;
  • 审查评论与回复;
  • 管理仓库自身的脚本;
  • 生产环境仪表盘定义文件。

人类始终在环,但处于比过去更高的抽象层。团队把智能体遇到困难视为系统信号:说明缺了工具、护栏或文档,于是再把这些缺失补回仓库,而且仍然要求由 Codex 自己完成这些补丁。由此,系统能力会随着每次失败不断内生增强,而不是通过人工“越俎代庖”来短路问题。

十、自主性已经提升到端到端完成新功能的程度

随着测试、验证、审查、反馈处理与故障恢复逐步编码进系统,团队认为仓库最近跨过了一个重要门槛:给定单个提示,Codex 已经能够端到端推动一个新功能或缺陷修复闭环完成。

它可以依次完成:

  • 校验当前代码库状态;
  • 复现已报告的问题;
  • 录制展示失败现象的视频;
  • 实现修复;
  • 驱动应用验证修复结果;
  • 录制修复后视频;
  • 打开 PR;
  • 响应人类与智能体反馈;
  • 检测并处理构建失败;
  • 仅在需要判断力时升级给人类;
  • 最终合并改动。

这可以视为文中给出的一个完整“从问题到合并”的端到端流程实例。与此同时,文章也强调,这种高度自治依赖于当前仓库的结构、文档、工具链和反馈系统,不应被当作未经投入即可复制的通用状态。

十一、完全自治会带来新的熵增:因此必须做持续“垃圾回收”

文章没有把高度自治描绘成无代价状态。相反,作者明确指出:Codex 会复制仓库里已经存在的模式,包括不均匀甚至不理想的模式。 当这种复制在高吞吐环境中持续发生时,漂移几乎不可避免。

团队最初靠人工治理:每周五拿出整整 20% 的工作时间清理所谓 “AI slop”。这显然无法扩展。于是,他们把人类偏好与经验压缩成一组黄金原则,直接编码进仓库,并配套后台清理流程。

文中给出的黄金原则示例包括:

  1. 优先使用共享工具包,而不是到处手写小工具,让关键不变量集中维护;
  2. 不要用“YOLO 式”方式试探数据结构,要么在边界严格验证,要么依赖带类型的 SDK,避免让智能体建立在猜测形状之上继续扩展。

在此基础上,团队安排了一组定期运行的后台 Codex 任务,负责:

  • 扫描偏离黄金原则的代码模式;
  • 更新各领域和架构层的质量评分;
  • 发起有针对性的重构 PR。

这些 PR 多数可以在一分钟内完成审查,并自动合并。文章把这一过程比作垃圾回收:技术债像高利贷,越早、越连续地小额偿还,越能防止其复利式累积。这里的关键不是偶尔做一次大清扫,而是让偏好被编码一次后,持续作用于每一天新增的每一行代码。

十二、仍未解决的问题:长期一致性、判断力分配、模型演进

尽管这套方法已经支持了 OpenAI 内部发布与采用,但文末保持了明确的开放性。团队仍在持续学习的问题包括:

  • 完全由智能体生成的系统,在多年尺度上会如何演化其架构一致性
  • 哪些地方最值得投入人类判断力,以及如何把这种判断转写成可复用、可积累的系统能力
  • 随着模型能力继续增强,今天的仓库结构、工具和控制回路会如何变化

文章最终给出的结论不是“软件工程从此变简单”,而是:纪律仍然是软件工程的核心,只是它越来越少体现在逐行代码书写上,而越来越多体现在脚手架、抽象边界、知识组织、反馈闭环和控制系统的设计上。

结论

这项实验展示的不是单一工具效率提升,而是一种工程范式转移:

  • 人类的主要职责从写代码,转向设计让智能体稳定工作的环境。
  • 仓库内的代码、文档、计划、规则和质量信号被统一为智能体可见、可验证、可执行的工作空间。
  • 高速度必须依赖强约束、强反馈和持续清理,否则吞吐量只会放大熵增。
  • “零人工手写代码”并不意味着人类退出开发流程,而意味着人类上移到更高层的目标设定、系统设计和规则编码位置。

文中最清晰的方向是:未来复杂软件的竞争力,越来越不只是“谁能写出代码”,而是“谁能设计出一个让智能体可靠地产生、验证、修复和维护代码的工程系统”。

相关推荐 (...)