OpenAI Launches GPT-5.4: Major Leap for AI Agents

I watched a demo where GPT-5.4 opened a terminal, wrote a script, and then clicked through a desktop to run it. You could see the room split between curiosity and a cold, practical worry. I asked the engineer whether this was progress or a problem waiting to happen.

A product manager on my Slack posted a clip of GPT-5.4 opening Excel and running a macro.

OpenAI is pitching GPT-5.4 as more than a smarter chatbot. The company says this release is its first general-use model with native computer-use capabilities: it can write code, execute tasks, and issue keyboard and mouse commands to interact with apps across your machine. You’ll see it in ChatGPT, Codex, and the OpenAI API; feature gates are split—GPT-5.4 Thinking for Plus, Teams, and Pro; GPT-5.4 Pro for the API and ChatGPT Enterprise and Edu subscribers.

That access model is a behavioral shift. Giving a model the ability to act on a desktop is like handing a careful intern the office key: powerful if you trust them, risky if you don’t. The product play is clear—make agents that can complete multi-step professional work without the user micromanaging clicks.

What can GPT-5.4 do that previous models couldn’t?

Short answer: autonomous computer tasks. Beyond improved reasoning and coding, OpenAI says GPT-5.4 can navigate operating systems, run scripts it writes, and chain actions across apps. For developers, that means API access to a model that can both generate and execute solutions; for enterprise customers, it promises agentic workflows that move work forward with less human orchestration.

A benchmarking engineer ran APEX-Agents and watched OpenAI jump to the top of the leaderboard.

OpenAI didn’t just release buzz—they released metrics. The company claims GPT-5.4 topped Mercor’s APEX-Agents leaderboard (professional services tasks) and led OSWorld-Verified and WebArena Verified for computer use. In user-facing terms, OpenAI says individual responses are 33% less likely to contain errors than GPT-5.2 and that the model is 18% less likely to make mistakes overall, with fewer hallucinations.

Benchmarks matter because they translate into hours saved and fewer follow-ups. If those numbers hold across real customers—ChatGPT Plus, Codex dev teams, and API integrators—the product could shave significant friction from workflows inside companies and tools like Slack, Zapier, and Microsoft Office integrations.

Is it safe to give GPT-5.4 control of my computer?

There’s no simple yes or no. OpenAI frames the capability as governed by access controls and subscription tiers, but giving a model permissions introduces new attack surfaces and policy questions. Enterprises will want strict auditing, whitelists, and human-in-the-loop policies before allowing automated actions on sensitive systems.

In the company Slack, an employee posted a resignation note and a line about the Department of Defense—reactions spiked.

Product wins don’t sit in a vacuum. After OpenAI announced DoD work—and after rival Anthropic publicly refused to weaken safeguards—the company reportedly lost about 1.5 million users and faced internal unrest. Sam Altman had to address staff directly as defections and dissent became visible. Public trust and developer goodwill aren’t measured in benchmark points.

Technical excellence can be a fast-moving story, but reputation moves like a different clock; policy, PR, and the workforce reaction form a lock you can’t pry open with better results alone. The company will be betting that improved reliability and concrete, demonstrable safety controls will pull users back to ChatGPT, Codex, and paying API customers.

For you—if you build with the API, manage teams that use ChatGPT Enterprise, or just keep an eye on AI risk—GPT-5.4 is a material step. It raises the practical bar for what agentic AI can do, while reopening hard debates about who gets the keys and how they’re monitored. Is a faster, quieter model enough to fix a public relations fracture, or will the politics of trust outweigh technical progress?