There is another big change in gpt-4o-2024-08-06: It supports 16k output tokens compared to 4k before. I think it was only available in beta before. So gpt-4o-2024-08-06 actually brings three changes. Pretty significant for API users
1. Reliable structured outputs
2. Reduced costs by 50% for input, 33% for output
3. Up to 16k output tokens compared to 4k
I’ve noticed that lately GPT has gotten more and more verbose. I’m wondering if it’s a subtle way to “raise prices”, as the average response is going to incur I more tokens, which makes any API conversation to keep growing in tokens of course (each IN message concatenates the previous OUT messages).
GPT has indeed been getting more verbose, but revenue has zero bearing on that decision. There's always a tradeoff here, and we do our imperfect best to pick a default that makes the most people happy.
I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions (which requires formulation + typing + waiting for a response).
With regard to this new gpt-4o model: you'll find it actually bucks the recent trend and is less verbose than its predecessor.
> I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions
Maybe it's a 'technical' user divide, but that seems wrong to me. I would much rather a succinct answer that I can probe further or clarify if necessary.
Lately it's going against my custom prompt/profile whatever it's called - to tell it to assume some level of competence, a bit about my background etc., to keep it brief - and it's worse than it was when I created that out of annoyance with it.
Like earlier I asked something about some detail of AWS networking and using reachability analyser with VPC endpoints/peering connections/Lambda or something, and it starts waffling on like 'first, establish the ID of your Virtual Private Cloud Endpoint. Step 1. To locate the ID, go to ...'
I’ve noticed this as well with coding questions. I will give it problematic code and ask a question about behavior, but it will attempt to reply with a solution to a problem. And even if I prompt it to avoid providing solutions, it ignores my instruction and blasts out huge blocks of useless and typically incorrect code. And once it overwhelms my subtle inquiries with nonsense, it gets stuck repeating itself and I just have to start a new session over.
For me this is one of the strongest motivators for running LLMs locally- even if they’re measurably worse, they’re a far better tool because they don’t change behavior over time.
My description was of me as a human user of ChatGPT fwiw, not the OpenAI API.
I had it again earlier:
Me: give me a bucket policy for write access from alb
CGPT: [waffle about IP ranges that is totally incorrect; then starts telling me ALB doesn't typically write to S3 because it's usually an intermediary between clients and backend services like EC2 instances or Lambda functions - it already knows from chat context I am using the latter]
Me: [whacks stop because it's rapidly getting out of hand] yes it does for access and connection logs
CGPT: To allow Application Load Balancer (ALB) to write access and connections logs to an S3 bucket, you need to set up a bucket policy that [waffle waffle waffle]
Me: [stop] yes I know that's what I asked for
CGPT: Here is an example of an S3 bucket policy [...]
Me: invalid principal [as far as I can tell, a complete hallucination]
CGPT: [tries again]
Me: yes I already tried that, valid policy but ALB still doesn't have permission
CGPT: [nonsense intensifies]
In the end I sorted it much quicker from AWS docs, which is sort of saying something, because I do often struggle with them. Thought I'd give ChatGPT a chance here but it really wasn't helpful.
I’ve especially noticed this with gpt-4o-mini [1], and it’s a big problem. My particular use case involves keeping a running summary of a conversation between a user and the LLM, and 4o-mini has a really bad tendency of inventing details in order to hit the desired summary word limit. I didn’t see this with 4o or earlier models
Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?
> You may output only up to 500 words, if the best summary is less than 500 words, that's totally fine. If details are unclear, do not fill-in gaps, do leave them out of the summary instead.
they also spend more to generate more tokens. The more obvious reason is it seems like people rate responses better the longer they are. Lmsys demonstrated that GPT tops the leaderboard because it tends to give much longer and more detailed answers, and it seems like OpenAI is optimizing or trying to maximize lmsys.
Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.
My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.
I have not been able to get it to output anywhere close to the max though (even setting max tokens high). Are there any hacks to use to coax the model to produce longer outputs?
1. Reliable structured outputs 2. Reduced costs by 50% for input, 33% for output 3. Up to 16k output tokens compared to 4k
https://platform.openai.com/docs/models/gpt-4o