Sam Altman: 0.1% More Excited About ChatGPT’s Future

I clicked the new GPT-5.5 toggle at midnight and watched the cursor stutter—then answer. The response arrived faster than last week’s build, but it didn’t feel like a revelation. I walked away 0.1% more excited about ChatGPT’s future, which is oddly specific and oddly promising.

I’ll be blunt: you and I have seen this movie before. You’ve read the press release, you’ve scrolled past the praise, and you’ve saved a test prompt for the moment you have uninterrupted time. I’m here to tell you what actually shifted, what didn’t, and why a fraction of progress still matters.

At my desk last week, an engineer dropped a chart showing milliseconds shaved from inference times.

The headline from OpenAI is simple: GPT-5.5 runs faster and, according to their blog, shows gains in agentic coding, computer use, knowledge work, and early scientific research. Sam Altman posted on X, “I personally like it.” There’s genuine pride behind that line—he praised the inference team and called out a strategic shift: “to a significant degree, we have to become an AI inference company now.” That’s not marketing puff; it’s an operational signal.

What is GPT-5.5?

GPT-5.5 is the latest incremental update to OpenAI’s flagship model. Think of it as faster reasoning, with better behavior in tasks that require acting over time—agents, automation, multiturn workflows. It’s not a leap; it’s precision tuning of latency, memory, and task persistence, which matters if you want an assistant that can actually complete multi-step jobs.

In a Slack channel yesterday, someone pasted a Claude Mythos benchmark and the thread lit up.

Benchmarks dominate headlines because they’re easy to compare. OpenAI says GPT-5.5 beats Anthropic’s Claude Opus 4.7 on several cybersecurity and computer-use tests, while Anthropic points to Mythos as the real standard. You should read both press releases with your skepticism on.

How does GPT-5.5 compare to Claude?

Short answer: it depends on the test and access level. Opus performs well on many public benchmarks; Mythos (Anthropic’s limited-access model) dominates the internal charts Anthropic shared last week. OpenAI’s win in autonomous agent benchmarks is meaningful for product engineers building task-driven assistants; Anthropic’s gains in some coding tests show there are trade-offs. Menlo Ventures’ Deedy Das even noted GPT-5.5 doesn’t hit the highest mark on coding benchmarks—so pick the model by what you need it to do.

At a coffee shop, a founder told me they’d already rolled an experiment with workspace agents.

OpenAI didn’t release GPT-5.5 alone. In the same week they shipped an image-generation update, workspace agents that complete tasks autonomously, a privacy filter for PII, and a Codex update for coding agents. That level of output is a rhythm: frequent releases, iterative polish, and a product stack that keeps gaining new practical features. The announcement cadence makes the company feel more like a software ship that patches itself weekly.

Should I upgrade to GPT-5.5?

If you build tools that need lower latency, better agent behavior, or improved persistence across steps, yes—this is worth testing. If your primary need is raw, creative code synthesis, wait and compare—benchmarks show mixed results. The pragmatic move is to bench your own workflows: let the models run your real tasks, not synthetic benchmarks.

I want to parse two mental models for you. First: it feels like upgrading a phone’s camera by a fraction—sharper but not revolutionary. Second: the AI arms race is a treadmill that keeps speeding up. Two images, one truth: incremental improvements compound.

Benchmarks can be gamed. The industry trains to tests, and when confronted with novel prompts they still make obvious mistakes. That doesn’t make the improvements irrelevant. It makes them conditional. You don’t buy the model; you buy the fit between model behavior and your workload.

So where does that leave us? OpenAI is iterating fast—Sam Altman’s public enthusiasm and the flood of updates show a team optimizing inference efficiency, product utility, and safety tooling. Anthropic continues to push, and other players will keep shifting the map. For practitioners, the imperative is practical: run real tasks, measure results, and be alert to where a small margin becomes decisive.

I’m 0.1% more excited because I see utility creeping forward—enough to change how we delegate work to software, not enough to call it a new kind of intelligence. What test will you run first to prove whether GPT-5.5 matters for your work?