You hear the computer click through three apps while you sip your coffee. The assistant finishes a spreadsheet, patches a snippet of code and then pulls up the exact paper you needed. I sat back and realized the cursor wasn’t just moving — it had a plan.
I’ve tested models that talk fast and models that act fast; this one both plans and executes. You’ll want to know what that means for your workflows, your team, and the next time a deadline lands on your desk.
Office teams are reopening files after meetings — GPT-5.4 improves agentic workflow, research, and knowledge work
When a project manager tells you “the deck needs to be done by noon,” the clock suddenly matters. GPT-5.4 folds reasoning, the coding strengths inherited from GPT-5.3-Codex, and agentic workflows into one model so you can ask it to build slides, edit spreadsheets, and draft documents without switching tools.
I watched it orchestrate multi-step tasks across apps like a backstage tech running cues, and the difference is not cosmetic. The Codex app now supports a context window up to 1 million tokens, which means long research threads, full reports, and complex spreadsheets can stay in one session. Tool search is sharper, so the model finds the right utility instead of guessing.
Benchmarks back the claim. GPT-5.4 scored 83% on GDPval, which measures professional knowledge work across 44 occupations; Anthropic’s Claude Opus 4.6 scored 78%. In SWE-Bench Pro for coding, GPT-5.4 hit 57.7%. OSWorld-Verified — a test of practical computer use — records 75% accuracy. Numbers don’t tell the whole story, but they nudge you toward rethinking who does what in the office.
Can GPT-5.4 run my desktop apps?
Yes — and with more reliability than previous releases. The ChatGPT Thinking variant demonstrates native computer use and multi-step automation; it executes app actions, chains commands, and improves tool selection. For product teams, that means fewer handoffs and fewer “you do it; I’ll review” loops.
One late-night search turned into a resolved citation — GPT-5.4 gets better at agentic web browsing and facts
At 2 a.m., I asked it to find an obscure methods paper and verify a statistic; it surfaced the source and outlined the methodology. GPT-5.4’s web-browsing agents can chase hard-to-locate information and return organized evidence instead of a list of links.
In scientific fact tests GPT-5.4 scored 92.8% on GPQA Diamond; Google’s Gemini 3.1 Pro posted 94.3%. OpenAI also reports a reduction in hallucinations: 33% fewer false claims and 18% fewer errors versus GPT-5.2. That’s not perfect, but the model is getting measurably more dependable.
The model behaves like a skilled librarian with a flashlight — it sifts, checks, and hands you the most relevant volumes. Use it for literature reviews, citation harvesting, or vetting product claims, but keep a skeptic’s eye on edge cases and novel claims.
How accurate is GPT-5.4 compared to Gemini and Claude?
Short answer: competitive. GPQA Diamond: GPT-5.4 at 92.8% vs Gemini 3.1 Pro at 94.3%. GDPval: GPT-5.4 at 83% vs Claude Opus 4.6 at 78%. Each model has strengths — Gemini slightly ahead on some scientific QA, Claude strong on specific benchmarks — but GPT-5.4 narrows gaps while adding native computer control.
Your teammate just clicked “Upgrade” — rollout, pricing, and what it means for developers
In a product meeting, someone inevitably asks: “Who’s paying for this?” OpenAI is rolling GPT-5.4 Thinking to ChatGPT Plus and Team users; ChatGPT Pro subscribers get the more powerful GPT-5.4 Pro. For API buyers, pricing is set at $2.50 per one million input tokens (€2) and $15 per one million output tokens (€14).
That price-per-token framing matters because GPT-5.4 improves token efficiency and reduces redundant queries; in practice you may spend less to get more done. It’s also a signal: OpenAI wants this model to be used inside workflows, not just as a chat partner.
Developers and product teams should test agentic flows in staging before pushing to production. Integration points include ChatGPT, Codex, and platforms such as GitHub Copilot and Microsoft’s AI stack; if you’re building internal automation or customer-facing tools, this model changes the cost-benefit math for automation projects.
How much does GPT-5.4 cost on the API?
API pricing: $2.50 per 1M input tokens (€2) and $15 per 1M output tokens (€14). Factor in higher throughput from fewer hallucinations and better tool use — you may see ROI sooner than with older models.
You’ll want to test agents on the tasks where errors are expensive and measure whether the model’s reductions in false claims and task friction actually lower your headcount or your time-to-delivery — and then decide who keeps the keys?