Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is another big change in gpt-4o-2024-08-06: It supports 16k output tokens compared to 4k before. I think it was only available in beta before. So gpt-4o-2024-08-06 actually brings three changes. Pretty significant for API users

1. Reliable structured outputs 2. Reduced costs by 50% for input, 33% for output 3. Up to 16k output tokens compared to 4k

https://platform.openai.com/docs/models/gpt-4o



I’ve noticed that lately GPT has gotten more and more verbose. I’m wondering if it’s a subtle way to “raise prices”, as the average response is going to incur I more tokens, which makes any API conversation to keep growing in tokens of course (each IN message concatenates the previous OUT messages).


GPT has indeed been getting more verbose, but revenue has zero bearing on that decision. There's always a tradeoff here, and we do our imperfect best to pick a default that makes the most people happy.

I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions (which requires formulation + typing + waiting for a response).

With regard to this new gpt-4o model: you'll find it actually bucks the recent trend and is less verbose than its predecessor.


> I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions

Maybe it's a 'technical' user divide, but that seems wrong to me. I would much rather a succinct answer that I can probe further or clarify if necessary.

Lately it's going against my custom prompt/profile whatever it's called - to tell it to assume some level of competence, a bit about my background etc., to keep it brief - and it's worse than it was when I created that out of annoyance with it.

Like earlier I asked something about some detail of AWS networking and using reachability analyser with VPC endpoints/peering connections/Lambda or something, and it starts waffling on like 'first, establish the ID of your Virtual Private Cloud Endpoint. Step 1. To locate the ID, go to ...'


I’ve noticed this as well with coding questions. I will give it problematic code and ask a question about behavior, but it will attempt to reply with a solution to a problem. And even if I prompt it to avoid providing solutions, it ignores my instruction and blasts out huge blocks of useless and typically incorrect code. And once it overwhelms my subtle inquiries with nonsense, it gets stuck repeating itself and I just have to start a new session over.

For me this is one of the strongest motivators for running LLMs locally- even if they’re measurably worse, they’re a far better tool because they don’t change behavior over time.


There’s an interesting discrepancy here.

Human users are charged by the number of messages, so longer responses are preferable because follow up questions use up your message allowance.

APIs are charged by token so shorter messages are preferable as you don’t pay for unnecessary tokens.


My description was of me as a human user of ChatGPT fwiw, not the OpenAI API.

I had it again earlier:

Me: give me a bucket policy for write access from alb

CGPT: [waffle about IP ranges that is totally incorrect; then starts telling me ALB doesn't typically write to S3 because it's usually an intermediary between clients and backend services like EC2 instances or Lambda functions - it already knows from chat context I am using the latter]

Me: [whacks stop because it's rapidly getting out of hand] yes it does for access and connection logs

CGPT: To allow Application Load Balancer (ALB) to write access and connections logs to an S3 bucket, you need to set up a bucket policy that [waffle waffle waffle]

Me: [stop] yes I know that's what I asked for

CGPT: Here is an example of an S3 bucket policy [...]

Me: invalid principal [as far as I can tell, a complete hallucination]

CGPT: [tries again]

Me: yes I already tried that, valid policy but ALB still doesn't have permission

CGPT: [nonsense intensifies]

In the end I sorted it much quicker from AWS docs, which is sort of saying something, because I do often struggle with them. Thought I'd give ChatGPT a chance here but it really wasn't helpful.


Do changes in verbosity tuning have a meaningful impact on the average "correctness" of the responses?

Also your about page is very suspicious for someone at an AI company ;).


I’ve especially noticed this with gpt-4o-mini [1], and it’s a big problem. My particular use case involves keeping a running summary of a conversation between a user and the LLM, and 4o-mini has a really bad tendency of inventing details in order to hit the desired summary word limit. I didn’t see this with 4o or earlier models

Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?

[1] https://sophiabits.com/blog/new-llms-arent-always-better#exa...


Did you try giving the model an "out"?

> You may output only up to 500 words, if the best summary is less than 500 words, that's totally fine. If details are unclear, do not fill-in gaps, do leave them out of the summary instead.


It's a subtle way to make it smarter. Making it write out the "thinking process" and decisions has always helped with reliability and quality.


they also spend more to generate more tokens. The more obvious reason is it seems like people rate responses better the longer they are. Lmsys demonstrated that GPT tops the leaderboard because it tends to give much longer and more detailed answers, and it seems like OpenAI is optimizing or trying to maximize lmsys.


Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.

My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.


That's actually pretty impressive... if they didn't dumb it down that is, which only time will tell.


I have not been able to get it to output anywhere close to the max though (even setting max tokens high). Are there any hacks to use to coax the model to produce longer outputs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: