Codex Review: OpenAI Coding Agent for Mac

Key Highlights

✓Mac-native app for orchestrating parallel AI coding agents across projects
✓Automations: scheduled cloud agents that code while you sleep — unique in the market
✓Built-in worktrees: multiple agents work on the same repo without conflicts
✓Open source CLI built in Rust — cross-platform, lightweight, free
✓Included in ChatGPT Plus ($20/mo) — no separate subscription needed
✓56.8% SWE-Bench Pro: honest about failing nearly half of professional tasks

The Agent That Never Logs Off

Every AI coding tool makes the same pitch: it writes code so you don't have to. Codex makes a different one: it writes code while you're not even there.

OpenAI's Codex launched its Mac desktop app on February 2, 2026, and the headline feature isn't the AI model or the interface — it's Automations. You define a set of instructions, attach optional skills, set a schedule, and Codex runs in the cloud on that schedule. No laptop required. No terminal open. The agent works, the results land in a review queue, and you deal with them when you're ready.

No other coding tool does this. Claude Code needs your terminal running. Cursor needs your editor open. Codex, at least in theory, operates continuously.

That's the theory. The practice is more complicated.

What Codex Actually Is

Codex exists across four surfaces, which is both its strength and its confusion:

The Mac App is a focused desktop experience for managing coding threads in parallel. Each agent runs in its own thread, organized by project. You can switch between tasks without losing context, review diffs inline, comment on changes, and open them in your preferred editor. Built-in worktree support means multiple agents can work on the same repository without stepping on each other.

The CLI is open source, built in Rust, and cross-platform. Install it via npm or Homebrew. It launches a full-screen terminal UI where you iterate with Codex in real time — closer to the Claude Code experience but with OpenAI's models.

The Web Interface lives inside ChatGPT. You connect a GitHub repo, describe a task, and Codex spins up a cloud sandbox, clones your code, makes changes, runs your test suite, and presents a clean diff.

VS Code Extension brings Codex into your existing editor workflow.

Four surfaces. Same underlying capability. But the fragmentation is confusing — developers on X are actively debating which surface to use for what, and even OpenAI employees are still building out feature parity between them.

The Model: Honest Numbers

GPT-5.3-Codex, released February 5, combines the coding power of its predecessor with stronger reasoning. It's 25% faster and uses fewer tokens. It sets a new industry high on SWE-Bench Pro and Terminal-Bench.

But here's the number that matters: 56.8% on SWE-Bench Pro. That's the best score in the industry — and it means the model still fails nearly half of professional-level software engineering tasks.

This is the uncomfortable truth about every AI coding agent in 2026. The demos are impressive. The benchmarks improve quarterly. But on real-world professional tasks — the kind with ambiguous requirements, complex dependencies, and subtle edge cases — the failure rate is still coin-flip territory.

OpenAI deserves credit for publishing these numbers transparently. Many competitors don't.

The Automation Thesis

Automations are what separate Codex from the pack. The concept: define repeating coding workflows — dependency updates, test generation, documentation maintenance, lint fixes — and schedule them to run automatically.

Each automation combines instructions with optional Skills (reusable instruction bundles plus scripts). Codex runs them in cloud sandboxes on your schedule, and results appear in a review queue.

The vision is compelling: a fleet of AI agents maintaining your codebase in the background, handling the routine work that compounds into technical debt when humans neglect it. Dependency updates every Monday. Test coverage gaps filled every Wednesday. Documentation synced after every release.

The limitation is equally clear: those cloud sandboxes cannot access private registries, internal packages, or company-specific tooling. If your React frontend imports from an internal component library — as most enterprise apps do — Codex will fail on every task that touches those imports.

This is a fundamental architectural constraint, not a bug to be patched. Cloud sandboxes trade access for isolation. For open-source and greenfield projects, it works beautifully. For enterprise codebases with proprietary dependencies, it's a dealbreaker.

The Competitive Frame

The $20/month coding agent war has three contenders with radically different philosophies:

Claude Code ($20/mo via Claude Pro) runs locally in your terminal. It sees your actual file system, uses your actual toolchain, and explains its reasoning in real time. No cloud sandbox limitations. Revenue: $2.5B annualized. The market leader by a wide margin.

Cursor ($20/mo Pro) wraps AI into a polished IDE experience. Tab completion, visual diffs, multi-file editing. Revenue: $1B+ annualized. The "make AI feel familiar" approach.

Codex ($20/mo via ChatGPT Plus) uses cloud sandboxes for isolation and adds scheduled automations. Revenue: not disclosed separately. The "agents that work autonomously" approach.

Each philosophy implies a different bet about the future. Claude Code bets that developers want raw reasoning and local control. Cursor bets they want visual polish and IDE familiarity. Codex bets they want delegation and autonomy.

The revenue numbers suggest the market currently favors raw reasoning over autonomy. But Codex's automation thesis hasn't had time to prove itself yet — the Mac app is barely a month old.

The Spark Preview

GPT-5.3-Codex-Spark, a faster variant, is quietly rolling out to ChatGPT Pro users. It's appearing in usage dashboards but isn't selectable in the CLI or app yet. No official announcement, no documentation — just backend infrastructure showing up where it shouldn't yet.

This quiet rollout suggests OpenAI is iterating rapidly on model variants specifically optimized for different coding workflows. Spark appears to be a speed-optimized variant for quick tasks, while the full GPT-5.3-Codex handles complex reasoning.

The Verdict

Codex's automation feature is genuinely novel and points to where AI coding tools are heading. The idea that agents maintain your codebase on a schedule — not just when you ask — is the logical next step.

But the product is new, the surfaces are fragmented, and the cloud sandbox limitation makes it impractical for most enterprise workflows. The 56.8% professional task success rate is the best in the industry and still not reliable enough for unsupervised automation.

Codex is a strong addition to a ChatGPT Plus subscription you're already paying for. It's not yet a reason to switch from Claude Code or Cursor as your primary coding tool.

Use Codex if: You already have ChatGPT Plus, work on open-source or greenfield projects, and want to experiment with scheduled AI coding.

Skip it if: You need private registry access, prefer local execution, or want a mature daily-driver tool. Claude Code and Cursor are both further ahead for that.

Explore MenFem

Explore MenFem

Codex

Key Highlights

The Agent That Never Logs Off

What Codex Actually Is

The Model: Honest Numbers

The Automation Thesis

The Competitive Frame

The Spark Preview

The Verdict

About OpenAI

Related Reviews

Z.ai GLM-5

OpenAI Codex

State of Generative Media 2026 — a16z Annual Report

Related Reviews

Z.ai GLM-5

OpenAI Codex

State of Generative Media 2026 — a16z Annual Report

Intelligence