OpenAI’s o3 price plunge changes everything for vibe coders

mercredi 18 juin 2025, 11:00 , par InfoWorld

On June 10, OpenAI slashed the list price of its flagship reasoning model, o3, by roughly 80% from $10 per million input tokens and $40 per million output tokens to $2 and $8, respectively. API resellers reacted immediately: Cursor now counts one o3 request the same as a GPT-4o call, and Windsurf lowered the “o3-reasoning” tier to a single credit as well. For Cursor users, that’s a ten-fold cost cut overnight.

Latency improved in parallel. OpenAI hasn’t published new latency metrics; third-party dashboards still see time to first token (TTFT) in the 15s to 20s range for long prompts. Thanks to fresh Nvidia GB200 clusters and a revamped scheduler that shards long prompts across more GPUs, o3 feels snappier in real use. o3 is still slower than lightweight models, but no longer coffee-break slow.

Claude 4 is fast yet sloppy

Much of the community’s oxygen has gone to Claude 4. It’s undeniably quick, and its 200k context window feels luxurious. Yet, in day-to-day coding, I, along with many Reddit and Discord posters, keep tripping over Claude’s action bias: It happily invents stubbed functions instead of real implementations, fakes unit tests, or rewrites mocks that were told to leave alone. The speed is great; the follow-through often isn’t.

o3: Careful, deliberate, and suddenly affordable

o3 behaves almost the opposite way. It thinks first, asks clarifying questions, and tends to produce code that actually compiles. Until last week, that deliberation was priced like a Jeff Bezos-style turducken superyacht. Now it’s a used Honda Civic.

Specifically, the same 4k in / 1.6k out architectural prompt fell from $0.10 to $0.02—about an 80% cut, exactly mirroring OpenAI’s official 80% price drop.

Tool calling: When ‘reasoning’ goes full Rube Goldberg

One caveat: o3 loves tool calls often too much. Windsurf users complain it “overuses unnecessary tool calls and still fails to write the code.” In my own sessions, o3 peppers the planner with diff, run tests, search, and even file-system reads more aggressively than Claude does. Claude (and smaller models) often infer the answer without explicit calls; o3 prefers to see the facts for itself. That’s great until it isn’t. Keep a finger on the kill switch. I usually tell o3 to create a series of contained subtasks so that it doesn’t try seeing the whole picture when its code time.

Some tips:

Throttle calls: Set hard caps, e.g., “Use a maximum of 8 tool calls.”

Demand minimal scope: Remind o3, e.g., “Touch only these two files.”

Review diffs and commit often: As with any model, it is far from perfect.

Why reasoning matters for coding

Reasoning-heavy models excel at multi-hop constraints such as renaming a domain class, updating a database migration, fixing integration tests, and keeping the logic intact in one go. Research on chain-of-thought for code generation shows that structured reasoning improves pass-at-1 accuracy by double-digit percentages on benchmarks such as HumanEval and Mostly Basic Python Problems (MBPP). Smaller models falter after the third hop; o3 keeps more invariants in working memory, so its first draft passes more often.

But is this “real” thinking?

Apple’s recent paper “The Illusion of Thinking” argues that so-called large reasoning models (LRMs) don’t really reason; they just pattern-match longer chains of tokens. The authors show that LRMs plateau on synthetic hard-mode puzzles. That echoes what most practitioners already know: A chain of thought is powerful but not magic. Whether you label it “reasoning” or “very fancy autocomplete,” the capability boost is real, and the price drop makes that boost usable.

Under the hood: Subsidies, silicon, and scale

OpenAI can’t possibly turn a profit at $2 in / $8 out if o3 inference still costs last winter’s rates. Two forces make the math less crazy:

Hardware leaps: Nvidia’s GB200 NVL72 promises 30× inference throughput versus equal-node H100 clusters at big energy cuts.

Capital strategy: Oracle’s 15-year, $40B chip lease to OpenAI spreads capex over a decade, turning GPU spend into cloud-like opex.

Even so, every major vendor is in land-grab mode, effectively subsidizing floating point operations per second (FLOPS) to lock in developers before regulation and commoditization.

Rising alternatives keep the pressure on

OpenAI’s rivals smell opportunity:

BitNet b1.58 (Microsoft Research): A 1-bit model that runs respectable code generation on CPUs, slashing infrastructure costs.

Qwen3-235B-A22B (Alibaba): An Apache 2.0 mixture-of-experts (MoE) giant with Claude-level reasoning and only 22B parameters active per token.

BitNet does not match o3 or GPT-4o in raw capability, yet it runs on modest hardware and reminds us that progress is not limited to ever-larger models. As smaller architectures improve, they may never equal the absolute frontier, but they can reach the level most tasks demand.

Qwen continues to trail o3 in both speed and skills, but the pricing trend is clear: Advanced reasoning is becoming a low-cost commodity. Vendor subsidies might never be recouped, and without a strong lock-in, cheaper hardware and rapid open-source releases could drive the marginal cost of top-tier reasoning toward zero.

Practical workflow adjustments

Promote o3 to your main coder and planner. The latency is now bearable, the price is sane, and the chain-of-thought pays off. What you may pay for in wait you make up in not having to rework.

Retain a truly lightweight fallback. No one wants to wait around for reasoning for simple things like “make a branch” or “start docker.” There are many models that can do these things. Pick something light and cheap.

Tame tool mania. Explicitly set a max tool calls rule per request, and lean on diff reviews before merging.

Prompt economically. Even at $2 in / $8 in, sloppy prompts still burn money. Use terse system messages and reference context IDs rather than pasting full files.

Watch latency spikes. Subsidy era or not, usage surges can reintroduce throttles. Have a backup model keyed in your IDE just in case.

Consider alternatives to heavyweights like Cursor and Windsurf. I started making my own AI coding assistant in part so I could use different models and combinations of models. However, you can do a version of this with the more open source alternatives like Roo Code or Cline. They have their own issues, but being able to access the whole catalog of models on OpenRouter is eye opening.

The bottom line

A few weeks ago I told readers o3 was “too slow and too expensive for daily coding.” Today it’s neither. It’s still no REPL, but the brains-per-buck ratio just flipped. o3 out-codes Claude in reliability, and its cost finally lets you keep it on speed-dial without lighting your wallet on fire.

Christmas came in June and Santa Sam stuffed our stockings with subsidized FLOPS. Fire up o3, keep your prompts tight, and let the model over-think so you don’t have to.

Lire la suite sur InfoWorld