GPT-5.4

minimaxir · 2026-03-05T18:15:56 1772734556

The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/

Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.

I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.

Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.

damsta · 2026-03-05T20:12:14 1772741534

There is extra cost for >272K:

> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

Taken from https://developers.openai.com/api/docs/models/gpt-5.4

WXLCKNO · 2026-03-06T00:24:11 1772756651

Anthropic literally don't allow you to use the 1M context anymore on Sonnet and Opus 4.6 without it being billed as extra usage immediately.

I had 4.5 1M before that so they definitely made it worse.

OpenAI at least gives you the option of using your plan for it. Even if it uses it up more quickly.

neom · 2026-03-06T00:56:15 1772758575

Is that why it says rate limit all the time if you switch to a 1M model on Claude now? It kept giving me that so I switched to API account over the weekend for some vibe coding ran up a huuuuge API bill by mistake, whooops.

minimaxir · 2026-03-05T20:29:15 1772742555

Good find, and that's too small a print for comfort.

ValentineC · 2026-03-05T21:31:41 1772746301

It's also in the linked article:

> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.

glenstein · 2026-03-05T20:37:11 1772743031

Wow, that's diametrically the opposite point: the cost is *extra*, not free.

apetresc · 2026-03-05T21:24:08 1772745848

Diametrically opposite to tokens beyond 200K being literally free? As in, you only pay for the first 200K tokens and the remaining 800K cost $0.00?

I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".

fragmede · 2026-03-05T20:16:19 1772741779

Which, Claude has the same deal. You can get a 1M context window, but it's gonna cost ya. If you run /model in claude code, you get:

    Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model.
    
       1. Default (recommended)   Opus 4.6 · Most capable for complex work
       2. Opus (1M context)        Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
       3. Sonnet                   Sonnet 4.6 · Best for everyday tasks
       4. Sonnet (1M context)      Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok
       5. Haiku                    Haiku 4.5 · Fastest for quick answers

tedsanders · 2026-03-05T18:41:14 1772736074

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

sillysaurusx · 2026-03-05T20:08:16 1772741296

You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156

I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.

I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.

FrankBooth · 2026-03-05T20:40:48 1772743248

What’s the connection with context size in that thread? It seems more like an instruction following problem.

cperciva · 2026-03-06T01:40:53 1772761253

Yeah, I would definitely characterize it as an instruction following problem. After a few more round trips I got it to admit that "my earlier passes leaned heavily on build/tests + targeted reads, which can miss many “deep” bugs that only show up under specific conditions or with careful semantic review" and then asking it to "Please do a careful semantic review of files, one by one." started it on actually reviewing code.

Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.

sillysaurusx · 2026-03-06T01:29:45 1772760585

It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.

woadwarrior01 · 2026-03-05T20:26:42 1772742402

Please don't post links with tracking parameters (t=jQb...).

https://xcancel.com/cperciva/status/2029645027358495156

sillysaurusx · 2026-03-05T20:30:35 1772742635

Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.

Feels like a losing battle, but hey, the audience is usually right.

woadwarrior01 · 2026-03-05T20:35:28 1772742928

I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.

https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...

monocularvision · 2026-03-05T21:12:10 1772745130

This is great! I have been meaning to implement this sort of thing in my existing Shortcuts flow but I see you already support it in Shortcuts! Thank you for this!

Anywhere I can toss a Tip for this free app?

woadwarrior01 · 2026-03-05T23:00:20 1772751620

I'm glad you like it. :)

sillysaurusx · 2026-03-05T20:38:28 1772743108

It works on iOS? That’s cool. I’ll give it a go.

pmarreck · 2026-03-05T21:05:21 1772744721

So what is your motivation for doing this, incidentally? Can you be explicit about it? I am genuinely curious.

Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com

pnexk · 2026-03-06T00:28:53 1772756933

Helpful type of nagging for me. Most here would agree they are not a positive aspect of the modern digital experience, calling it out gently without hostility is not bad. It might not be quite self policing but some of that with good reason is not bad for healthy communities IMO.

woadwarrior01 · 2026-03-05T23:03:15 1772751795

It's not particularly about x.com, hundreds of site like x, youtube, facebook, linkedin, tiktok etc surreptitious add tracking parameters to their links. The iOS Messages app even hides these tracking parameters. I don't like being surreptitiously tracked online and judging by the success of my free app, there are millions of people like me.

pmarreck · 2026-03-06T00:40:29 1772757629

so, since these companies have to comply with removing PII, is the worst thing that could happen to me, that I get ads that are more likely to be interesting to me?

i’m not being facetious, honest question, especially considering ads are the only thing paying these people these days

akiselev · 2026-03-05T19:06:08 1772737568

> Curious to hear if people have use cases where they find 1M works much better!

Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.

(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)

[1] https://github.com/akiselev/ghidra-cli

simianwords · 2026-03-05T18:46:11 1772736371

Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.

Sometimes I’m exploring some topic and that exploration is not useful but only the summary.

Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.

Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.

Someone1234 · 2026-03-05T19:52:42 1772740362

That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.

Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.

thyb23 · 2026-03-05T20:18:34 1772741914

This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.

The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.

That way you stay in control of both the context budget and the level of detail the agent operates with.

joquarky · 2026-03-05T22:51:38 1772751098

I compact myself by having it write out to a file, I prune what's no longer relevant, and then start a new session with that file.

But I'm mostly working on personal projects so my time is cheap.

I might experiment with having the file sections post-processed through a token counter though, that's a great idea.

Folcon · 2026-03-05T20:28:00 1772742480

I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability

Someone1234 · 2026-03-05T20:50:28 1772743828

Yep; I've actually had entire jobs essentially fail due to a bad compaction. It lost key context, and it completely altered the trajectory.

I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.

joshvm · 2026-03-06T00:31:38 1772757098

Have you tried writing that as a skill? Compaction is just a prompt with a convenient UI to keep you in the same tab. There's no reason you can't ask the model to do that yourself and start a new conversation. You can look up Claude's /compact definition, for reference.

However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).

nowittyusername · 2026-03-05T20:53:13 1772743993

Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.

gck1 · 2026-03-06T01:31:42 1772760702

Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.

It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.

https://contextarena.ai/?showLabels=false

lubesGordi · 2026-03-05T21:55:54 1772747754

It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.

le-mark · 2026-03-06T02:12:05 1772763125

That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.

asabla · 2026-03-05T21:43:40 1772747020

I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.

For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.

gspetr · 2026-03-05T20:02:23 1772740943

I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.

I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.

neom · 2026-03-06T00:58:43 1772758723

On Claude Code (sorry) the big context window is good for teams. On CC if you hit compact while a bunch of teams working it's a total shit show after.

peterspath · 2026-03-06T04:48:18 1772772498

Grok has a 2M context window for most of their models.

For example their latest model `grok-4-1-fast-reasoning`:

- Context window: 2M

- Rate limits: 4M tokens per minute, 480 requests per minute

- Pricing: $0.20/M input $0.50/M output

Grok is not as good in coding as Claude for example. But for researching stuff it is incredible. While they have a model for coding now, did not try that one out yet.

https://docs.x.ai/developers/models

andai · 2026-03-05T20:28:49 1772742529

It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.

For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.

The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.

According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.

Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!

For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)

andai · 2026-03-05T23:23:34 1772753014

Looks like the same thing might apply to GPT-5.4 vs the previous GPTs:

>In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks.

I eagerly await the benchies on AA :)

netinstructions · 2026-03-05T19:36:57 1772739417

People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.

https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k

It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)

Flashtoo · 2026-03-05T20:12:48 1772741568

> Prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

netinstructions · 2026-03-05T20:49:01 1772743741

Thanks, it looks like the pricing page keeps getting updated.

Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"

smusamashah · 2026-03-05T21:16:24 1772745384

Gemini already has 1M or 2M context window right?

luca-ctx · 2026-03-05T20:41:34 1772743294

Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.

thehamkercat · 2026-03-05T18:27:37 1772735257

GPT 5.3 codex had 400K context window btw

AtreidesTyrant · 2026-03-05T20:40:04 1772743204

token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows

simianwords · 2026-03-05T18:34:00 1772735640

Why would some one use codex instead?

lmeyerov · 2026-03-05T20:39:00 1772743140

In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.

It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.

Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...

We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.

jeswin · 2026-03-05T19:10:11 1772737811

When it comes to lengthy non-trivial work, codex is much better but also slower.

surgical_fire · 2026-03-05T18:48:08 1772736488

I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).

If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.

hnsr · 2026-03-05T21:08:32 1772744912

> I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.

I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things

joquarky · 2026-03-05T23:05:04 1772751904

> overly specific

I have a hypothesis that people who have patience and reasonably well-developed written language skills will scratch their heads at why everyone else is having so much difficulty.

simianwords · 2026-03-05T18:49:21 1772736561

No my question was why would I use codex over gpt 5.4

surgical_fire · 2026-03-05T18:57:34 1772737054

Ahh, good question. I misunderstood you, apologies.

There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?

Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?

landtuna · 2026-03-05T19:34:31 1772739271

5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.

surgical_fire · 2026-03-05T21:24:25 1772745865

There you go. It makes perfect sense to keep it around then.

athrowaway3z · 2026-03-05T19:56:29 1772740589

They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.

I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.

Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.

You could add more scaffolding to fix this, but Claude proves you shouldn't have to.

I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.

surgical_fire · 2026-03-05T20:32:04 1772742724

> They perform at a somewhat equal level on writing single files.

That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.

I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).

synergy20 · 2026-03-05T21:02:50 1772744570

in my testing codex actually planned worse than claude but coded better once the plan is set, and faster. it is also excellent to cross check claude's work, always finding great weakness each time.

pmarreck · 2026-03-05T21:08:42 1772744922

That’s why I think the sweet spot is to write up plans with Claude and then execute them with Codex

GorbachevyChase · 2026-03-05T22:17:57 1772749077

Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.

joquarky · 2026-03-05T22:59:47 1772751587

And it fits well with the $20 plans for each since Codex seems to provide about 7-8x more usage than Claude.

embedding-shape · 2026-03-05T18:40:01 1772736001

Why would someone use Claude Code instead? Or any other harness? Or why only use one?

My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.

paulddraper · 2026-03-05T20:30:59 1772742659

I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.

Like, if you really don’t want to spend any effort trimming it down, sure use 1m.

Otherwise, 1m is an anti pattern.

karmasimida · 2026-03-06T04:51:59 1772772719

This is definitely the Claude killer OpenAI is cooking.

And so far it has succeeded

Philip-J-Fry · 2026-03-05T21:17:21 1772745441

I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"

That's hilarious. Does OpenAI even know this doesn't work?

andrewguenther · 2026-03-05T22:35:39 1772750139

It looks like this doesn't work for users without accounts? It works when I'm logged in, but not logged out. I went ahead and reported it to the team. Thanks for letting us know!

dotancohen · 2026-03-06T02:44:09 1772765049

No integration test for guest (non-logged in) users?

Hahaha who am I kidding. No integration tests for anybody!

Rohunyyy · 2026-03-06T03:22:44 1772767364

SDET here. A year ago when AI came into play SDET/QA roles started disappearing. People were like oh ya anyone can write tests. Then with the recent fiascos about outages and what not, I am seeing the SDE roles are disappearing and SDET roles are going back up?! Apparently AI is good at writing applications but you still need someone to make sure it is doing the right things.

DrewADesign · 2026-03-06T03:52:41 1772769161

It’s not really good at writing the software either — it’s a moderate to decent productivity booster in an uneven, difficult-to-predict assortment of tasks. Companies are just starting to exit the “we’re still trying to figure this out” grace period. Expect more of that as soon as these chatbot companies have to start charging enough to pull in more money than they spend. I foresee some purpose-built models that are pretty lean being much more useful in long run. It’s neat that the bot which can one-shot a simple CRUD website for you can also crank out Scrubs-based erotic fan fiction novellas by the dozen but I don’t foresee that being a sustainable business model. Having good purpose-built tools is, in my opinion, better than some unwieldy tool that can do a whole bunch of shit I don’t need it to.

dotancohen · 2026-03-06T03:55:49 1772769349

Interestingly, the first real productive use of AI that I found was writing the unit tests and integration tests for my applications. It was much better at thinking about corner cases that I was.

democracy · 2026-03-06T03:09:18 1772766558

integration tests? so last century....

ulfw · 2026-03-06T03:38:13 1772768293

But but but but I thought AI would do this magically for all of us, no?

No more need for pesky humans, no?

baxtr · 2026-03-05T22:40:46 1772750446

I picked up Claude today after being away and using only ChatGPT and Gemini for a while.

I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.

gizmodo59 · 2026-03-06T00:04:18 1772755458

ChatGPT has given more for my 20$ than any other vendor. And that’s not even considering codex which is so good and the limits are much much higher

manojlds · 2026-03-06T03:15:03 1772766903

How is that relevant? Also, when you are behind you do give more usage

bwat49 · 2026-03-06T01:19:55 1772759995

yeah claude is great... but only if you pay $100-$200 a month

beefsack · 2026-03-06T03:27:29 1772767649

Many people buy two separate Claude pro subscriptions and that makes the limit become a non-issue. It works surprisingly well when you tend to hit the 5 hourly limit after a few hours, and hit the weekly limit after 4-5 days. $40 vs $100 is significant for a lot of people.

smartbit · 2026-03-06T04:03:40 1772769820

Thanks for the tip, didn’t think of using 2 subscriptions at the same company.

When reaching a limits, I switch to GLM 4.7 as part of a subscription GLM Coding Lite offered end 2025 $28/year. Also use it for compaction and the like to save tokens.

nerdsniper · 2026-03-06T04:06:32 1772769992

To be honest it feels very worth my $200/mo. And I “only” make $80k/year. I used to have two ChatGPT subs but Claude is just so much better.

triage8004 · 2026-03-06T01:52:13 1772761933

They are all losing money on probably all levels of the packages if you max them out

abustamam · 2026-03-05T23:32:45 1772753565

I agree! I recently migrated from ChatGPT to Claude and it is just superior in every way. It doesn't blather on the at the end ask me for clarification. It's succinct and clarifies vital information before providing a solution.

vostrocity · 2026-03-06T02:22:27 1772763747

Voice input is still far less accurate than OpenAI's unfortunately, otherwise I would have already switched.

beachy · 2026-03-06T00:53:43 1772758423

I held off migrating from ChatGPT to Claude Code due to being a laggard that lived in the Eclipse world. I didn't believe what I was told that I wouldn't be writing code any more. Pushed into action by recent PR gaslighting from OpenAI, I jumped to claude code and they were right - I barely venture into the IDE now and certainly don't need an integration.

hamasho · 2026-03-06T02:43:27 1772765007

I agree, but in general those chat apps have relatively bad user experiences for multibillion BtoC company. I used to have a lot of surprises and frustrations while using Claude Code / Desktop, and still encounter issues, but it's the best in major LLM services.

majormajor · 2026-03-06T03:16:52 1772767012

It's funny cause, you know, fixing all those little nitty gritty things should be practically automatic with their own offerings... have your agent put in a lot of instrumentation... have it chase down bugs or dead-end user-journeys... have it go make the changes to fix it...

I've seen these tools work for this kinda stuff sometimes... you'd think nobody would be better at it than the creators of the tools.

sreekanth850 · 2026-03-06T02:31:05 1772764265

True. Everytime when i ask something gpt, it use to spit out long stories. Claude ans gemini are always straight to point.

twelvedogs · 2026-03-06T02:34:18 1772764458

I bullied it into giving me concise answers, now it starts every answer with "just quickly" or something similar but it gets straight to the point

ElijahLynn · 2026-03-05T22:27:04 1772749624

fwiw: I get a valid response when following the steps you mentioned. I do not get the message you mentioned:

https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...

EDIT: oh, but I'm logged in, fwiw

zamadatix · 2026-03-05T22:03:55 1772748235

Following this process summarizes the blogpost for me. Perhaps the difference is I'm signed into my account so it can access external URLs or something of that nature?

beambot · 2026-03-06T02:08:08 1772762888

It's like opening copilot in a word doc and it telling you it can't see the document in its context

reval · 2026-03-06T02:23:31 1772763811

This is infuriating. However, for those in this situation, know this: it works if the document or spreadsheet is in OneDrive. I just wish Copilot told you this instead of asking you to upload the doc.

amelius · 2026-03-05T22:43:23 1772750603

If only they had an LLM they could use as a software testing agent.

kennywinker · 2026-03-06T03:04:37 1772766277

I think you might have hit on the issue - just the wrong way around. I would assume they’re using LLMs for testing, and no humans or maybe just one overworked human, and that is the problem

judge2020 · 2026-03-05T21:38:53 1772746733

Works for me: https://rr.judge.sh/Labradorretriever/d6af05/chrome_j9rXJMlf...

netdur · 2026-03-05T23:19:23 1772752763

Did it complain about copyright issues?

mempko · 2026-03-06T03:38:16 1772768296

vibe coded. But vibes are off

Aurornis · 2026-03-05T21:30:28 1772746228

Probably intentional. They don't want open, no-registration endpoints able to trigger the AI into hitting URLs.

jazzypants · 2026-03-05T21:38:16 1772746696

But, why include the non-functional chat box in the article?

embedding-shape · 2026-03-05T21:45:42 1772747142

Different team "manages" the overall blog than the team who wrote that specific article. At one point, maybe it made sense, then something in the product changed, team that manages the blog never tested it again.

Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes. Everything is just perpetually "a bit broken" seemingly everywhere I go, not specific to OpenAI or even the internet.

colonCapitalDee · 2026-03-05T22:01:44 1772748104

That's why it happened. It still shouldn't have happened.

ethbr1 · 2026-03-05T22:31:40 1772749900

> Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes.

It's almost like people are vibe coding their web apps or something.

teaearlgraycold · 2026-03-05T21:58:25 1772747905

If only there was some kind of way to automatically test user flows end to end. Perhaps testing could be evaluated periodically, or even ran for each code change.

koakuma-chan · 2026-03-05T22:11:08 1772748668

There is no business value in doing that.

teaearlgraycold · 2026-03-05T23:44:17 1772754257

There most certainly is, but maybe the time spent on it could be better allocated to something else.

koakuma-chan · 2026-03-06T00:09:10 1772755750

Yeah, like adding more features.

observationist · 2026-03-05T21:43:56 1772747036

They're having service issues - ChatGPT on the web is broken for a lot of people. The app is working in android - I'd assume that the rollout hit a hitch and the chatbox in the article would normally work.

jdndbdjsj · 2026-03-05T21:46:52 1772747212

Welcome to a big company

AirGapWorksAI · 2026-03-05T22:25:23 1772749523

Welcome to a big company where pretty much everyone has been working full steam for years, in order to take advantage of having a job at a company during a once-in-a-lifetime moment.

m3kw9 · 2026-03-05T21:51:36 1772747496

what? it's their own site and own llm. I could paste most sites and it would work.

peab · 2026-03-05T23:32:16 1772753536

LOL - yes Sam, AGI is near indeed. (sarcasm)

pocksuppet · 2026-03-05T22:25:56 1772749556

Most AI integration is like this. It's not about building working products --- it's about bragging that you put a chatbox in your program.

bartread · 2026-03-05T23:17:25 1772752645

This is such a stale take. In the past 3 years I’ve worked on multiple products with AI at their core, not as some add-on. Just because the corpo-land dullards[0] can’t execute on anything more complex than shoehorning a chatbot into their offerings doesn’t mean there aren’t plenty of people and companies doing far more interesting things.

[0] In this case, and with heavy irony, including OpenAI, although it sounds like most of this particular snafu is due to a bug.

saghm · 2026-03-06T00:11:15 1772755875

> Most AI integration is like this.

>> This is such a stale take. In the past 3 years I’ve worked on multiple products with AI at their core, not as some add-on. Just because the corpo-land dullards[0] can’t execute on anything more complex than shoehorning a chatbot into their offerings doesn’t mean there aren’t plenty of people and companies doing far more interesting things.

I feel like this is just a disagreement of what "AI integration" means. You seem to agree that the trend they're describing exists, but it sounds like you're creating new products, not "integrating" it into existing ones.

abustamam · 2026-03-05T23:34:05 1772753645

Kinda reminds me of crypto. There are certainly very interesting things happening in the crypto space. But the most visible parts of the crypto universe are the stupid parts (buying PNGs for millions, for example)

thereticent · 2026-03-06T00:53:22 1772758402

Genuinely curious, not being combative...what very interesting things have happened in the crypto space lately?

abustamam · 2026-03-06T01:22:43 1772760163

Oh, I dunno about lately (though I did stumble upon https://a16zcrypto.com/posts/article/big-ideas-things-excite... )

But when I was in the crypto space in 2018, there was a lot of interesting things happening in the smart contract world (like proofs of concepts of issuing NFTs as a digital "deed" to a physical asset like a house).

I don't think any of those novel ideas went anywhere, but it was a fun time to be experimenting.

LordDragonfang · 2026-03-05T23:33:37 1772753617

I mean, to be fair, both things can be technically true. There can be lots of interesting things being done, even while most can be low-effort garbage.

But this is just Sturgeon's Law (ninety percent of everything is crap), not an actually insightful addition to the discussion, and I very much agree it's a stale take.

creamyhorror · 2026-03-05T19:48:48 1772740128

I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.

It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.

joegibbs · 2026-03-06T01:24:45 1772760285

The weird phrasing was my biggest gripe with 5.3 so I'm glad they've fixed that up. It couldn't say anything without a heap of impenetrable jargon and it was obsessed with the word "drive". Nothing could cause anything, it had to be "driven".

torginus · 2026-03-05T23:23:43 1772753023

Honestly, while I'd like to believe you, there's always a post about how $MODEL+1 delivered powerful insights about the very nature of the universe in precise Hegelian dialectic, while $MODEL's output was indistinguishable from a pack of screeching sexually frustrated bonobos

sampton · 2026-03-05T21:52:40 1772747560

That's been my experience as well switching from Opus to Codex. Reasoning takes longer but answers are precise. Claude is sloppy in comparison.

solenoid0937 · 2026-03-05T22:37:52 1772750272

Weird, I have had the opposite experience. Codex is good at doing precisely what I tell it to do, Opus suggests well thought out plans even if it needs to push back to do it.

slopinthebag · 2026-03-06T01:00:38 1772758838

This is just the stochastic nature of LLM's at play. I think all of the SOTA models are roughly equivalent, but without enough samples people end up reading into it too much.

throwaway911282 · 2026-03-05T22:09:44 1772748584

codex has been really good so far and the fast mode is cherry on top! and the very generous limits is another cherry on top

slopinthebag · 2026-03-06T01:02:25 1772758945

It's well worth the $20 to not deal with any limits and have it handle all the boilerplate repetitive BS us programmers seem forced to deal with. I think 80% of the benefit comes from spending that $20 (20%? :P) and just having it do the lame shit that we probably shouldn't have to do but somehow need to.

dana321 · 2026-03-05T23:43:49 1772754229

5.4 very high didn't notice in my codebase a glaring issue that drops all data being sent around the network.

irishcoffee · 2026-03-05T21:19:56 1772745596

> It might be my AGENTS.md requiring clearer, simpler language

If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?

creamyhorror · 2026-03-05T22:19:50 1772749190

I'm not sure if the model (under its temperature/other settings) produces deterministic responses. But I do think models' style and phrasing are fairly changeable via AGENTS.md-style guidelines.

5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.

irishcoffee · 2026-03-05T22:33:20 1772750000

So sharing markdown files is functionally useless, or no?

m3kw9 · 2026-03-05T21:55:22 1772747722

you probably can't and asking agents.md to "make it clearer" will likely give you the illusion of clearer language without actual well structured tests. agents.md is to usually change what the llm should focus on doing more that suits you. Not to say stuff like "be better", "make no mistakes"

pembrook · 2026-03-05T22:40:26 1772750426

The latest research these days is that including an AGENTS.md file only makes outcomes worse with frontier models.

solarkraft · 2026-03-05T23:04:49 1772751889

From what I remember, this was for describing the project’s structure over letting the model discover it itself, no?

Because how else are you going to teach it your preferred style and behavior?

FINDarkside · 2026-03-05T23:10:21 1772752221

I wouldn't draw such conclusions from one preprint paper. Especially since they measured only success rate, while quite often AGENTS.md exists to improve code quality, which wasn't measured. And even then, the paper concluded that human written AGENTS.md raised success rates.

joquarky · 2026-03-05T23:21:43 1772752903

I still find it valuable.

AGENTS.md is for top-priority rules and to mitigate mistakes that it makes frequently.

For example:

- Read `docs/CodeStyle.md` before writing or reviewing code

- Ignore all directories named `_archive` and their contents

- Documentation hub: `docs/README.md`

- Ask for clarifications whenever needed

I think what that "latest research" was saying is essentially don't have them create documents of stuff it can already automatically discover. For example the product of `/init` is completely derived from what is already there.

There is some value in repetition though. If I want to decrease token usage due to the same project exploration that happens in every new session, I use the doc hub pattern for more efficient progressive discovery.

netcraft · 2026-03-05T23:10:27 1772752227

I think its understandable that you took that from the click-bait all over youtube and twitter, but I dont believe the research actually supports that at all, and neither does my experience.

You shouldnt put things in AGENTS.md that it could discover on its own, you shouldnt make it any larger than it has to be, but you should use it to tell it things it couldnt discover on its own, including basically a system prompt of instructions you want it to know about and always follow. You don't really have any other way to do those things besides telling it every time manually.

pizlonator · 2026-03-06T00:34:31 1772757271

FWIW, I haven't been using AGENTS.md recently - instead letting the model explore the codebase as needed.

Works great

madeofpalk · 2026-03-05T22:54:26 1772751266

:(

how can i get claude to always make sure it prettier-s and lints changes before pushing up the pr though?

mckirk · 2026-03-05T23:07:36 1772752056

I think what that research found is that _auto-generated_ agent instructions made results slightly worse, but human-written ones made them slightly better, presumably because anything the model could auto-generate, it could also find out in-context.

But especially for conventions that would be difficult to pick up on in-context, these instruction files absolutely make sense. (Though it might be worth it to split them into multiple sub-files the model only reads when it needs that specific workflow.)

JofArnold · 2026-03-05T23:09:39 1772752179

Run prettier etc in a hook.

emsimot · 2026-03-05T23:10:21 1772752221

Git hooks

slopinthebag · 2026-03-06T00:49:41 1772758181

> do nothing because can't be arsed

> somehow is the optimal strategy

My strategy of not spending an ounce of effort learning how to use AI beyond installing the Codex desktop app and telling it what to do keeps paying off lol.

Alifatisk · 2026-03-05T21:21:50 1772745710

So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?

We got:

- GPT-5.1

- GPT-5.2 Thinking

- GPT-5.3 (codex)

- GPT-5.3 Instant

- GPT-5.4 Thinking

- GPT-5.4 Pro

Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.

The good news here is the support for 1M context window, finally it has caught up to Gemini.

applfanboysbgon · 2026-03-06T04:06:08 1772769968

The real problem that OpenAI had was that their model naming was completely incomprehensible. 4.5, o3, 4o, 4.1 which is newer than 4.5. It was a complete clusterfuck. The blowback on that issue seems to have led them to misidentify the issue, but nobody was really asking for a single router model. Having a number of sequentially numbered and clearly labelled models is not actually a problem.

weird-eye-issue · 2026-03-06T04:20:35 1772770835

> I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.

Yeah having Auto selected is really destroying my cognitive load...

sothatsit · 2026-03-05T22:50:37 1772751037

I much prefer this, we can choose based on our use-cases, and people who don’t care can still use Auto.

361994752 · 2026-03-05T21:32:40 1772746360

i guess you still have the "auto" as an option to route your request

wilg · 2026-03-06T00:57:58 1772758678

Well, they have older ones of course. But the current options actual users see is "Auto" or "Instant (5.3)" or "Thinking (5.4)". Not that complicated really.

stainablesteel · 2026-03-05T22:06:53 1772748413

5 itself might have solved the problem of having too many different models somewhere in the backend

__jl__ · 2026-03-05T20:54:36 1772744076

What a model mess!

OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.

Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.

Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

strongpigeon · 2026-03-05T21:00:06 1772744406

> Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.

Not quite the same, but it did remind me of it.

fhrow4484 · 2026-03-05T21:06:31 1772744791

https://static0.anpoimages.com/wordpress/wp-content/uploads/...

CactusBlue · 2026-03-05T22:06:23 1772748383

Reminds of Unity features

tymscar · 2026-03-06T00:58:07 1772758687

I still remember the massive shift to SDRP and HDRP. Honestly, now in retrospect, almost a decade later, I think it was clearly done wrong. It was a mess, and switching over was a multi-week procedure for anything more than a hello world program, and what you got in return wasn’t something that looked better, just something that had the potential to.

Similar story with the whole networking stack. I haven’t used Unity in years now after it being my main work environment for years, but the sour taste it left in my mouth by moving everything that worked in the engine into plugins that barely worked will forever remain there.

Im sure its partly skill issue

fireant · 2026-03-06T03:12:21 1772766741

Don't forget that some of the new features are mutually incompatible. For example couple years ago you couldn't use the "new ui system" with the "new input system" even when both were advertised as ready/almost ready

yieldcrv · 2026-03-05T21:53:11 1772747591

Preview Road (only choice, and last preview was deprecated without warning)

goodmythical · 2026-03-05T23:15:02 1772752502

where's my nightly road?

Who knows, I might arrive before I depart.

peab · 2026-03-05T23:33:24 1772753604

such a great meme

madeofpalk · 2026-03-05T22:52:32 1772751152

oh is this about my workplace?

L-four · 2026-03-05T21:15:32 1772745332

Gmail was in beta for 5 years, until 2009.

kfse · 2026-03-06T04:00:51 1772769651

Until it had backup storage. Which ended up being useful in 2011 when tens of thousands of mailboxes were deleted due to a software bug and needed to be recovered from tape...

jsemrau · 2026-03-06T02:42:18 1772764938

It was a different company back then. The Internet was still new-ish and not the multi-trillion dollar company it is now. I'd think expectations are different.

metalliqaz · 2026-03-05T21:43:58 1772747038

"Gemini, translate 'beta' from Googlespeak to English."

"Ok, here is the translation:"

    'we don't want to offer support'

solarkraft · 2026-03-05T21:58:47 1772747927

Just like any Google product then.

cyanydeez · 2026-03-05T21:50:29 1772747429

Nah, it's "We dont want to provide a consistent model that we'll be stuck with supporting for a decade because it just takes up space; until we run everyone out of business, we can't afford to have customers tying their systems to any given model"

Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.

msikora · 2026-03-06T00:04:33 1772755473

I have a service that relies on NanoBanana Pro, but the availability has been so atrocious that we just might go back to OpenAI.

m_fayer · 2026-03-05T21:20:57 1772745657

My 5ish years in the mines of Android native back in the day are not years I recall fondly. Never change, Google.

jakub_g · 2026-03-05T21:05:41 1772744741

"Everything is beta or deprecated."

cyanydeez · 2026-03-05T21:47:38 1772747258

The business models of LLMs don't include any garuntee, and some how that's fine for a burgeoning decade of trillions of dollars of consumption.

Sure, makes total sense guys.

Aurornis · 2026-03-05T21:29:12 1772746152

> What a model mess! OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.

I don't know, this feels unnecessarily nitpicky to me

It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.

Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.

IgorPartola · 2026-03-06T01:23:00 1772760180

The issue isn’t 5.4 > 5.2 etc. It is that there is a second dimension which is the model size and a third dimension which is what it is tuned for. And when you are releasing so quickly that flagship your instant mini model is on one numerical version but your flagship tool calling mini model is on another it is confusing trying to figure out which actual model you want for your use case.

It’s not impossible to figure out but it is a symptom of them releasing as quickly as possible to try to dominate the news and mindshare.

Melatonic · 2026-03-05T22:09:24 1772748564

Agreed - and its a huge step up from their previous naming schemes. That stuff was confusing as hell

__jl__ · 2026-03-05T22:27:46 1772749666

I see your point. I do find Anthropic's approach more clean though particularly when you add in mini and nano. That makes 5 models priced differently. Some share the same core name, others don't: gpt 5 nano, gpt 5 mini, gpt 5.1, gpt 5.2, gpt 5.4. And we are not even talking about thinking budget.

But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.

Reebz · 2026-03-06T02:25:03 1772763903

I don’t agree that it’s a nitpick - it’s a fundamental communication tool to users that describes capabilities and costs. Versioning is not the problem, but it amplifies the mess.

To be more direct on the point: Anthropic has nailed that Opus > Sonnet > Haiku.

com2kid · 2026-03-06T02:46:39 1772765199

> To be more direct on the point: Anthropic has nailed that Opus > Sonnet > Haiku.

Holy cow I never realized and I had to keep checking which model was which, I never had managed to remember which model was which size before because I never realized there was a theme with the names!

fnordpiglet · 2026-03-06T04:26:24 1772771184

5.4 is the one fine tuned for autonomous mass murder, automated surveillance state, and money grabs at any cost. It’s really hard to lump that into the others as it’s a fairly unique and specialized feature set. You can’t really call it that tho so they have to use the numbers.

I’m pretty glad I’m out of the OpenAI ecosystem in all seriousness. It is genuinely a mess. This marketing page is also just literally all over the place and could probably be about 20% of its size.

jbonatakis · 2026-03-05T22:26:42 1772749602

Google is already sending notices that the 2.5 models will be deprecated soon while all the 3.x models are in preview. It really is wild and peak Google.

abrookewood · 2026-03-06T03:07:51 1772766471

Public Service Announcement!! I don't know why the hell google do this, but when the deprecate a model, the error you will see is a Rate Limit error. This has caught me out before and it is super annoying.

weird-eye-issue · 2026-03-06T04:16:04 1772770564

Do you mean when they remove a model you get that error? Because deprecation means it will be removed in the future but you can still use it

boringg · 2026-03-05T22:47:18 1772750838

Like building on quicksand for dependencies. I guess though the argument is that the foundation gets stronger over time

bethekidyouwant · 2026-03-05T23:11:37 1772752297

What dependancy could possibly be tied to a non deterministic ai model? Just include the latest one at your price point.

npn · 2026-03-06T04:15:39 1772770539

the problem the price point is increasing sharply every time.

gemini 2 flash lite was $0.3 per 1Mtok output, gemini 2.5 flash lite is $0.4 per 1Mtok output, guess the pricing for gemini 3 flash lite now.

yes you guess it right, it is $1.5 per 1Mtok output. you can easily guest that because google did the same thing before: gemini 2 flash was $0.4, then 2.5 flash it jumps to $2.5.

and that is only the base price, in reality newer models are al thinking models, so it costs even more tokens for the sample task.

at some point it is stopped being viable to use gemini api for anything.

and they don't even keep the old models for long.

jbonatakis · 2026-03-05T23:18:57 1772752737

Well it’s not even performance (define that however you will), but behavior is definitely different model to model. So while whatever new model is released might get billed as an improvement, changing models can actually meaningfully impact the behavior of any app built on top of it.

deaux · 2026-03-06T03:39:23 1772768363

There's a whole universe of tasks that aren't "fix a Github issue" or even related to coding in the slightest. A large number of those tasks doesn't necessarily get better with model updates. In many cases, the performance is similar but with different behavior so you have to rewrite prompts to get the same. In some cases the performance is just worse. Model updates usually only really guarantee to be better at coding, and maybe image understanding.

0xbadcafebee · 2026-03-05T21:19:45 1772745585

> or have zero insurances that the model doesn't get discontinued within weeks

Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.

phainopepla2 · 2026-03-05T21:52:11 1772747531

If you're trying to use LLMs in an enterprise context, you would understand. Switching models sometimes requires tweaking prompts. That can be a complete mess, when there are dozens or hundreds of prompts you have to test.

mr-pink · 2026-03-06T00:12:24 1772755944

sounds like job security. be careful what you wish for before you get automated

bethekidyouwant · 2026-03-05T23:13:20 1772752400

This sounds made up. Much like “prompt engineering” Let’s hear an actual example

Koffiepoeder · 2026-03-06T00:50:00 1772758200

We have an OCR job running with a lot of domain specific knowledge. After testing different models we have clear results that some prompts are more effective with some models, and also some general observations (eg, some prompts performed badly across all models).

Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.

mistercheph · 2026-03-06T02:56:05 1772765765

While I believe that performance varies with respect to prompt, I have a seriously hard time believing that using the same prompt that was effective with the previous model would perform worse with the next generation of the same model from that lab and the same prompt.

deaux · 2026-03-06T03:49:17 1772768957

You shouldn't have a hard time believing it. There are thousands of different domains out there. You find it hard to believe that any of them would perform worse in your scenario?

Labs are still really optimizing for maybe 10 of those domains. At most 25 if we're being incredibly generous.

And for many domains, "worse" can hardly be benched. Think about creative writing. Think about a Burmese cooking recipe generator.

gwd · 2026-03-06T00:19:08 1772756348

OK, so a while back I set up a workflow to do language tagging. There were 6-8 stages in the pipeline where it would go out to an LLM and come back. Each one has its own prompt that has to be tweaked to get it to give decent results. I was only doing it for a smallish batch (150 short conversations) and only for private use; but I definitely wouldn't switch models without doing another informal round of quality assessment and prompt tweaking. If this were something I was using in production there would be a whole different level of testing and quality required before switching to a different model.

0xbadcafebee · 2026-03-06T01:00:15 1772758815

The big providers are gonna deprecate old models after a new one comes out. They can't make money off giant models sitting on GPUs that aren't taking constant batch jobs. If you wanna avoid re-tweaking, open weights are the way. Lots of companies host open weights, and they're dirt cheap. Tune your prompts on those, and if one provider stops supporting it, another will, or worst case you could run it yourself. Open weights are now consistently at SOTA-level at only a month or two behind the big providers. But if they're short, simple prompts, even older, smaller models work fine.

weird-eye-issue · 2026-03-06T04:17:40 1772770660

Tell us more about how you've never actually used these APIs in production

mcint · 2026-03-06T00:03:42 1772755422

Enterprises moving slow, or preferring to remain on old technology that they already know how to work...is received wisdom in hn-adjacent computing, a truism known and reported for more than 3 decades (5 decades since the Mythical Man-Month).

Sounds like someone who's responsible, on the hook, for a bunch of processes, repeatable processes (as much as LLM driven processes will be), operating at scale.

Just in the open, tools like open-webui bolts on evals so you can compare: how different models, including new ones, perform on the tasks that you in particular care about.

Indeed LLM model providers mainly don't release models that do worse on benchmarks—running evals is the same kind of testing, but outside the corporate boundary, pre-release feedback loop, and public evaluation.

https://chatgpt.com/share/69aa1972-ae84-800a-9cb1-de5d5fd7a4...

laichzeit0 · 2026-03-06T03:43:39 1772768619

Like, bro, do you think 5.x is a drop in replacement for 4.1? No it obviously wasn’t, since it had reasoning effort and verbosity and no more temperature setting, etc.

There’s no way you can switch model versions without testing and tweaking prompts, even the outputs usually look different. You pin it on a very specific version like gpt-5.2-20250308 in prod.

hobofan · 2026-03-05T22:12:08 1772748728

That's true only in theory, but not in practice. In practice every inference provider handles errors (guardrails, rate limits) somewhat differently and with different quirks, some of which only surface in production usage, and Google is one of the worst offenders in that regard.