Migrating Systems to GPT-5: Tricks and Pitfalls
GPT-5 - the highly anticipated latest version of OpenAI’s hit the streets a few weeks ago. Despite of some breathless commentary from influencers who had been given early access, the eventual release was a bit underwhelming (in a way that only something that would have seemed like science-fiction a few short years ago but now seems passe, can be). Aside from the quality of the model itself, which some people have claimed was more about lowering OpenAI’s costs than delivering a better result, there are some issues that the change to GPT-5 has introduced when integrating it into a product which we thought we should share.
Problem 1: You can’t accurately assess cost anymore, or set max output tokens
OpenAI have been accused of confusing model names in the past with similar sounding but different names like GTP 4o vs. GPT o4, and adjectives like mini, turbo, pro, nano, thinking and “deep research” added to model names. GPT 5 attempted to resolve this (kind of) by offering a single API that routes to different models under the covers. This problem is not completely resolved because they ALSO offer mini and nano versions of GPT-5. All of the GPT-5 family of models are reasoning models, you can’t disable reasoning, and reasoning uses up tokens. How many tokens? You can’t control that – there is no max reasoning tokens field, only a max output tokens.
Although you can’t disable reasoning altogether, you can set it to “low” which is a hint to the model as to whether it should reason or not. From our tests the model ALWAYS decided to reason, which took a minimum of 1000 tokens, even for a tiny single-word message. If you’ve set the maximum output tokens to 500, and GPT-5 decides to “reason” about your very simple message and burns through 1000 tokens it ends up returning an empty response.
OpenAI in their advice on prompting the reasoning models, suggest allocating at least 25,000 tokens for reasoning and adjust down accordingly, but if the final number you arrive at is only 1/10th of this that’s still a lot of tokens being used, and cost, for each request.
Problem 2: Model Speed
We gave GPT-5 the benefit of 5 days grace after the release, but in our tests the performance of the model was pretty bad and extremely unpredictable, even with reasoning set at low. The quickest response was 8 seconds, whilst the slowest was close to 35-40 seconds. This can be mitigated by streaming the response, but users will probably still tire of a response that is streamed so slowly. This is in sharp contrast to GPT-4.1 mini which responded in a predictable 3-4 seconds and felt lightning-fast by comparison when streamed.
Problem 3: Model Intelligence
GPT-5 launched to much fanfare and expectations regarding its capabilities, with some wide-eyed accelerationists believing it could be the first example of Artificial General Intelligence (AGI). OpenAI’s Sam Altman said:
"We think you will love using GPT-5 much more than any previous Al. It is useful it is smart it is fast [and] intuitive. GPT-3 was sort of like talking to a high school student. There were flashes of brilliance lots of annoyance but people started to use it and get some value out of it. GPT-4o maybe it was like talking to a college student… With GPT-5 now it's like talking to an expert - a legitimate PhD level expert in anything any area you need on demand they can help you with whatever your goals are."
Although some typical stumbling-blocks like counting the number of times the letter 'r' appears in the word strawberry had been special-cased, it wasn’t long before the usual set of problems that LLMs struggle with had been identified and called out, and nearly 5000 people successfully petitioned OpenAI to keep access to the older GPT-4o models in ChatGPT.
In our tests GPT-5 nano, mini and regular, with medium reasoning (1-2000 tokens) and small text inputs failed in comparison to GPT-4.1 mini. Instructions that we explicitly said not to include in the output were included. Reasoning seemed to be a wild-card here – some of our tests passed and then failed on subsequent runs, and it was hard to get consistent output.
Conclusion
We’d advise anyone thinking of migrating to GPT-5 to hold fire until some of these issues are explored further, or at the very least run a suite of tests to evaluate the quality, speed, and cost of the model relative to others. The issues with the additional uncontrollable cost of reasoning tokens could be mitigated by OpenAI by disabling reasoning altogether, however it’s possible this would further degrade the quality of the GPT-5 responses compared to the 4 family of models.
Share This Post
Get In Touch
Recent Posts

