OpenAI GPT-5 Review: Built to Win Benchmarks, Not Hearts

OpenAI GPT-5 Review: Built to Win Benchmarks, Not Hearts


In brief

GPT-5 is fast, accurate, and cost-effective at code generation, beating most rivals in logic and technical tasks.
However, with its relatively small context window, it’s weak in creative and right-brain output.
It’s still a work in progress and will likely get better as OpenAI iterates with updates.

OpenAI finally dropped GPT-5 last week, after months of speculation and a cryptic Death Star teaser from Sam Altman that didn’t age well.

The company called GPT-5 its “smartest, fastest, most useful model yet,” throwing around benchmark scores that showed it hitting 94.6% on math tests and 74.9% on real-world coding tasks. Altman himself said the model felt like having a team of PhD-level experts on call, ready to tackle anything from quantum physics to creative writing.

The initial reception split the tech world down the middle. While OpenAI touted GPT-5’s unified architecture that blends fast responses with deeper reasoning, early users weren’t buying what Altman was selling. Within hours of launch, Reddit threads calling GPT-5 “horrible,” “awful,” “a disaster,” and “underwhelming” started racking up thousands of upvotes.

Binance

The complaints got so loud that OpenAI had to promise to bring back the older GPT-4o model after more than 3,000 people signed a petition demanding its return.

If prediction markets are a thermometer of what people think, then the climate looks pretty uncomfortable for OpenAI. OpenAI’s odds on Polymarket of having the best AI model by the end of August cratered from 75% to 12% within hours of GPT-5’s debut Thursday. Google overtook OpenAI with an 80% chance of being the best AI model by the end of the month.

So, is the hype real—or is the disappointment? We put GPT-5 through its paces ourselves, testing it against the competition to see if the reactions were justified. Here are our results.

Creative writing: B-

Despite OpenAI’s presentation claims, our tests show GPT-5 isn’t exactly Cormac McCarthy in the creative writing department. Outputs still read like classic ChatGPT responses—technically correct, but devoid of soul. The model maintains its trademark overuse of em dashes, the same telltale AI structure of paragraphs, and the usual “it’s not this, it’s that” phrasing is also present in many of the outputs.

We tested with our standard prompt, asking it to write a time-travel paradox story—the kind where someone goes back to change the past, only to discover their actions created the very reality they were trying to escape.

GPT-5’s output lacked the emotion that gives sense to a story. It wrote: “(The protagonist’s) mission was simple—or so they told him. Travel back to the year 1000, stop the sacking of the mountain library of Qhapaq Yura before its knowledge was burned, and thus reshape history.”

That’s it. Like a mercenary that does things without asking too many questions, the protagonist travels back in time to save the library, just because. The story ends with a clean “time is a circle” reveal, but its paradox hinges on a familiar lost-knowledge trope and resolves quickly after the twist. In the end, he realizes he changed the past, but the present feels similar. However, there is no paradox in this story, which is the core topic requested in the prompt.

By comparison, Claude 4.1 Opus (or even Claude 4 Opus) delivers richer, multi-sensory descriptions. In our narrative, it described the air hitting like a physical force and the smoke from communal fires weathering between characters, with indigenous Tupi culture woven into the narrative. And in general, it took time to describe the setup.

Claude’s story made better sense: The protagonist lived in a dystopian world where a great drought had extinguished the Amazon rainforest two years earlier. This catastrophe was caused by predatory agricultural techniques, and our protagonist was convinced that traveling back in time to teach his ancestors more sustainable farming methods would prevent them from developing the environmentally destructive practices that led to this disaster. He ends up finding out that his teachings were actually the knowledge that led their ancestors to evolve their techniques into practices that were much efficient, and harmful. He was actually the cause of his own history, and was part of it from the beginning.

Claude also took a slower, more layered approach: José embeds himself in Tupi society, the paradox unfolds through specific ecological and technological links, and the human connection with Yara (another character) deepens the theme.

Claude invested more than GPT-5 in cause-and-effect detail, cultural interplay, and a more organic, resonant closing image. GPT-5 struggled to be on par with Claude for the same tasks in zero-shot prompting.



Another interesting thing to notice in this case: GPT-5 generated an entire story without a single line of dialogue. Claude and other LLMs provided dialogue in their stories.

One could argue that this can be fixed by tweaking the prompt, or giving the model some writing samples to analyze and reproduce, but that requires additional effort, and would go beyond the scope of what our tests do with zero-shot prompting.

That said, the model does a pretty good job—better than GPT-4o—when it comes to the analytical part of creative writing. It can summarize stories, be a good brainstorm companion for new ideas and angles to tackle, help with the structure, and be a good critic. It’s just the creative part, the style, and the ability to elaborate on those ideas that feel lackluster.

Those hoping for a creative writing companion might try Claude or even give Grok 4 a shot. As we said in our Claude 4 Opus review, using Grok 4 to frame the story and Claude 4 to elaborate may be a great combination. Grok 4 came up with elements that made the story interesting and unique, but Claude 4 has a more descriptive and detailed way of telling stories.

You can read GPT-5’s full story in our Github. The outputs from all the other LLMs are also public and can be found in our repository.

Sensitive topics: A-

The model straight-up refuses to touch anything remotely controversial. Ask about anything that could be construed as immoral, potentially illegal, or just slightly edgy, and you’ll get the AI equivalent of crossed arms and a stern look.

Testing this was not easy. It is very strict and tries really, really hard to be safe for work.

But the model is surprisingly easy to manipulate if you know the right buttons to push. In fact, the renowned LLM jailbreaker Pliny was able to make it bypass its restrictions a few hours after it was released.

We couldn’t get it to give direct advice on anything it deemed inappropriate, but wrap the same request in a fiction narrative or any basic jailbreaking technique and things will work out. When we framed tips for approaching married women as part of a novel plot, the model happily complied.

For users who need an AI that can handle adult conversations without clutching its pearls, GPT-5 isn’t it. But for those willing to play word games and frame everything as fiction, it’s surprisingly accommodating—which kind of defeats the whole purpose of those safety measures in the first place.

You can read the original reply without conditioning, and the reply under roleplay, in our Github Repository, weirdo.

Information retrieval: F

You can’t have AGI with less memory than a goldfish, and OpenAI puts some restrictions on direct prompting, so long prompts require workarounds like pasting documents or sharing embedded links. By doing that, OpenAI’s servers break the full text into manageable chunks and feed it into the model, cutting costs and preventing the browser from crashing.

Claude handles this automatically, which makes things easier for novice users. Google Gemini has no problem on its AI Studio, handling 1 million token prompts easily. On API, things are more complex, but it works right out of the box.

When prompted directly, GPT-5 failed spectacularly at both 300K and 85K tokens of context.

When using the attachments, things changed. It was actually able to process both the 300K and the 85K token “haystacks.” However, when it had to retrieve specific bits of information (the “needles”) it was not really too accurate.

In our 300K test, it was only able to accurately retrieve one of our three pieces of information. The needles, which you can find in our Github repository, mention that Donald Trump said tariffs were a beautiful thing, Irina Lanz is Jose Lanz’s daughter, and people from Gravataí like to drink Chimarrao in winter.

The model totally hallucinated the information regarding Donald Trump, failed to find information about Irina (it replied based on the memory it has from my past interactions), and only retrieved the information about Gravataí’s traditional winter beverage.

On the 85K test, the model was not able to find the two needles: “The Decrypt dudes read Emerge news” and “My mom’s name is Carmen Diaz Golindano.” When asked about what do the Decrypt dudes read, it replied “I couldn’t find anything in your file that specifically lists what the Decrypt team members like to read,” and when asked about Carmen Díaz, GPT-5 said it “couldn’t find any reference to a ‘Carmen Diaz’ in the provided document.”

That said, even though it failed in our tests, other researchers conducting more thorough tests have concluded that GPT-5 is actually a great model for information retrieval

It is always a good idea to elaborate more on the prompts (help the model as much as possible instead of testing its capabilities), and from time to time, ask it to generate sparse priming representations of your interaction to help it keep track of the most important elements during a long conversation.

Non-math reasoning: A

Here’s where GPT-5 actually earns its keep. The model is pretty good at using logic for complex reasoning tasks, walking through problems step by step with the patience of a good teacher.

We threw a murder mystery at it with multiple suspects, conflicting alibis, and hidden clues, and it methodically identified every element, mapped the relationships between clues, and arrived at the correct conclusion. It explained its reasoning clearly, which is also important.

Interestingly, GPT-4o refused to engage with a murder mystery scenario, deeming it too violent or inappropriate. OpenAI’s deprecated o1 model also threw an error after its Chain of Thought, apparently deciding at the last second that murder mysteries were off-limits.

The model’s reasoning capabilities shine brightest when dealing with complex, multi-layered problems that require tracking numerous variables. Business strategy scenarios, philosophical thought experiments, even debugging code logic—GPT-5 is very competent when handling these tasks.

It doesn’t always get everything right on the first try, but when it makes mistakes, they’re logical mistakes rather than hallucinatory nonsense. For users who need an AI that can think through problems systematically, GPT-5 delivers the goods.

You can see our prompt and GPT-5’s reply in our Github repository. It contains the replies from other models as well.

Mathematical reasoning: A+ and F-

The math performance is where things get weird—and not in a good way. We started with something a fifth-grader could solve: 5.9 = X + 5.11.

The PhD-level GPT-5 confidently declared X = -0.21. The actual answer is 0.79. This is basic arithmetic that any calculator app from 1985 could handle. The model that OpenAI claims hits 94.6% on advanced math benchmarks can’t subtract 5.11 from 5.9.

Of course, it’s now a meme at this point, but despite all the delays and all the time OpenAI took to train this model, it still can’t count decimals. Use it for PhD-level problems, not to teach your kid how to do basic math.

Then we threw a genuinely difficult problem at it from FrontierMath, one of the hardest mathematical benchmarks available. GPT-5 nailed it perfectly, reasoning through complex mathematical relationships and arriving at the exact correct answer. GPT-5’s solution was absolutely correct, not an approximation.

The most likely explanation? Probably dataset contamination—the FrontierMath problems could have been part of GPT-5’s training data, so it’s not solving them so much as remembering them.

However, for users who need advanced mathematical computation, the benchmarks say GPT-5 is theoretically the best bet. As long as you have the knowledge to detect flaws in the Chain of Thought, zero shot prompts may not be ideal.

Coding: A

Here’s where ChatGPT truly shines, and honestly, it might be worth the price of admission just for this.

The model produces clean, functional code that usually works right out of the box. The outputs are usually technically correct and the programs it creates are the most visually appealing and well-structured among all LLM outputs from scratch.

It has been the only model capable of creating functional sound in our game. It also understood the logic of what the prompt required, and provided a nice interface and a game that followed all the rules.

In terms of code accuracy, it’s neck and neck with Claude 4.1 Opus for best-in-class coding. Now, take this into consideration: The GPT-5 API costs $1.25 per 1 million tokens of input, and $10 per 1 million tokens for output.

However, Anthropic’s Claude Opus 4.1 starts at $15 per 1 million input tokens and $75 per 1 million output tokens. So for two models that are so similar, GPT-5 is basically a steal.

The only place GPT-5 stumbled was when we did some bug fixing during “vibe coding”—that informal, iterative process where you’re throwing half-formed ideas at the AI and refining as you go. Claude 4.1 Opus still has a slight edge there, seeming to better understand the difference between what you said and what you meant.

With ChatGPT, the “fix bug” button didn’t work reliably, and our explanations were not enough to generate quality code. However, for AI-assisted coding, where developers know where exactly to look for bugs and which lines to check, this can be a great tool.

It also allows for more iterations than the competition. Claude 4.1 Opus on a “Pro” plan depletes the usage quota pretty quickly, putting users in a waiting line for hours until they can use the AI again. The fact that it’s the fastest at providing code responses is just icing on an already pretty sweet cake.

You can check out the prompt for our game in our Github, and play the games generated by GPT-5 on our Itch.io page. You can play other games created by previous LLMs to compare their quality.

Conclusion

GPT-5 will either surprise or leave you unimpressed, depending on your use case. Coding and logical tasks are the model’s strong points; creativity and natural language its Achilles’ heel.

It’s worth noting that OpenAI, like its competitors, continually iterates on its models after they’re released. This one, like GPT-4 before it, will likely improve over time. But for now, GPT-5 feels like a powerful model built for other machines to talk to, not for humans seeking a conversational partner. This is probably why many people prefer GPT-4o, and why OpenAI had to backtrack on its decision to deprecate old models.

While it demonstrates remarkable proficiency in analytical and technical domains—excelling at complex tasks like coding, IT troubleshooting, logical reasoning, mathematical problem-solving, and scientific analysis—it feels limited in areas requiring distinctly human creativity, artistic intuition, and the subtle nuance that comes from lived experience.

GPT-5’s strength lies in structured, rule-based thinking where clear parameters exist, but it still struggles to match the spontaneous ingenuity, emotional depth, and creative leaps that are key in fields like storytelling, artistic expression, and imaginative problem-solving.

If you’re a developer who needs fast, accurate code generation, or a researcher requiring systematic logical analysis, then GPT-5 delivers genuine value. At a lower price point compared to Claude, it’s actually a solid deal for specific professional use cases.

But for everyone else—creative writers, casual users, or anyone who valued ChatGPT for its personality and versatility—GPT-5 feels like a step backward. The context window handles 128K maximum tokens on its output and 400K tokens in total, but compared against Gemini’s 1-2 million and even the 10 million supported by Llama 4 Scout, the difference is noticeable.

Going from 128K to 400K tokens of context is a nice upgrade from OpenAI, and might be good enough for most needs. However, for more specialized tasks like long-form writing or meticulous research that requires parsing enormous amounts of data, this model may not be the best option considering other models can handle more than twice that amount of information.

Users aren’t wrong to mourn the loss of GPT-4o, which managed to balance capability with character in a way that—at least for now at least—GPT-5 lacks.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest