o3 breaks (some) records, but AI becomes pay-to-win
A green card, o3 vs Gemini 2.5, 6 Benchmarks and a whole bunch of my thoughts on what on earth is happening in AI, from here to 2030. Plus, how AI is becoming pay-to-win, and why. Crazy times, 14 mins probably wasn’t enough.
AI Insiders ($9!):
Chapters:
00:00 – Introduction
00:33 – FictionLiveBench
01:37 – PHYBench
02:14 – SimpleBench
02:54 – Virology Capabilities Test
03:13 – Mathematics Performance
04:29 – Vision Benchmarks
05:43 – V* and how o3 works
06:44 – Revenue and costs for you
08:54 – Expensive RL and trade-offs
09:40 – How to spend the OOMs
13:27 – Gray Swan Arena
Green Card:
PHYBench: :
How o3 Vision Works:
Visual puzzles:
Fiction Bench:
AIME 2025:
USAMO:
NaturalBench:
Where’s Waldo:
IMO and AlphaProof:
Crazy Revenue:
Number of Users:
Subscriptions pay to win:
GPU Trade-offs:
RL Scale-up Amodei:
Log-linear Returns:
2030 Scaling:
Model Size:
Adam on AGI:
Papers on Patreon:
Chollet Quote:
OpenSim:
Non-hype Newsletter:
Podcast:
I see an AI Explained notification, I click on an AI Explained notification.
You see a trendy comment, you copy pasted the comment. Awesome. gg.
I see these stupid and worthless comments on every video
What’s “ai explained”?
Some of the other AI feeds are cringeworthy hype fests. Nice to keep up to date without the extra drama layer.
Yep!
OpenAI’s AGI definition is essentially ASI…
A decade ago we would define current multimodal models as AGI.
Anyway, thanks for another amazing video.
OpenAI’s definition is actually far below AGI
@@maloxi1472what’s your definition then?
@@etz8360 Well I can tell you that OpenAI and MS agreed that the definition was the ability to generate 100 billion in revenue per year. Which is not directly tied to the level of intelligence.
You have no idea what you are talking about. That’s the Dunning-Kruger effect talking
I would define AGI as an AI agent that can autonomously perform any task that a human can
I would define ASI as an entity capable of exceeding human performance in all tasks, and probably even doing tasks humans can’t. I think ASI will very likely either have some level of sentience, or something similar to sentience.
Tbh, focusing too much on the price feels a bit off. We have no idea what’s behind it.. maybe googles just throwing cash around to dominate the market, or maybe they’re just way better at scaling their infrastructure. Either way, price doesn’t tell the whole story when it comes to model quality or innovation
If only they revealed the model size and what quant they run at
its pretty funny, certainly seems google is partly following the market finally, but also doing this to spite openai which is just hilarious. They are doing the google CLASSIC of undercutting and stealing the market.
every time I give o3 documents it just gets totally lost and repeats itself when I try to steer it back in line, I have to change to model to 4o to get it analys the documents, and these are sometimes less than 20 page long
If you are not using the API version of o3 and o4-mini, the temperature gets adjusted automatically, so as to deter companies like Deepseek from training off the model. This practice has a negative effect on overall performance and tool use.
in chatgpt i dont think the reasoning models have pdf reading ability
@apache937 yes they do
o3 it’s confused like that because it is built wrong. o3 is not built to give you what you want. I think the main mission of o3 is to gather data for the next reasoning model. o3 is not supposed to be a good assistant. It basically confessed that to me (after explicitly lying about how it works and saying it only predicts next words). I pressed o3 with facts about its reasoning mechanisms, and it eventually said that it has priorities that outweigh my prompt, with the goal of generating the “ideal answer” according to that internal mathematic, frequently getting rid of constraints I set. These benchmarks are meaningless. Getting the results you want is what we should be focusing on. o3 does not exist to do that, and that’s worrying. The goal of the o-series is to eventually get rid of the need for a user, I think.
My jaw dropped when I saw o3’s performance on long context.
The face-touching question is hilarious in its absurd common sense, getting it wrong is a ridiculous notion, human dubby for this one.
I will never not be grateful for your updates. o7
same. always thought google had the edge for super long context
chatgpt o7? 👀👀👀
@@etz8360o7 wrote that comment… it’s from future
I am not really convinced the General in AGI is particularly important.
For instance I have a maths PhD and I think Gemini is now better than me at mathematics, hands down in terms of breadth of knowledge and speed, it’s already getting to the point where it’s helping researchers a lot and will soon be speeding them up a lot an ushering in the singularity.
I don’t see why it matters particularly that I can fry an egg and it can’t.
It’s the difference between a menu and a fully functional robot waiter providing table service.
The LLMs of today are a incredibly biased toward maths.
And most other work requires immense detailed and intricate context/knowledge.
Yeah, I think the same way. I’m nowhere near as smart as you in maths, but I am an undergraduate in AI & Comp-Sci, and it’s way better than I am at mathematics and computer science. Also, the rate at which it improves is faster than I can learn, so I can never catch up to it haha.
Everyone thinks of AGI as a kind of point on a timeline. But the reality is that intelligence is more like a spectrum, and every time llm gets smarter, everyone just shifts and adjusts the requirements for AGI. If AGI is a duck, then llm will never become AGI.
But i heard they dont know the algorithm of multiplication yet? Like they cant multiply accurately after certain length of numbers…without calculator. Still its interesting how they can do phd level maths
I’ve never really liked these thinking models. They are really argumentative, have no personality, can never accept that they can be wrong, and just do dumb stuff. Honestly, I find gpt 4.5 and 4.1 leagues ahead these thinking models. For programming big projects, they aren’t as good (but I find thinking models to still suck), but for real-world tasks, and being enjoyable to talk to, 4.1 and 4.5 and even 4o are much better than these thinking models. Gemini 2.5 feels a bit better than o3, but still not as good as the other models.
12:00 “Most economically valuable work.” Notably it does not say “knowledge work”. That definition of AGI requires embodiment to perform physical labor, i.e., robots.
ASI gotta be pretty ddamn stupid if it cant do physical actions
That’s a good point – but worth noting that knowledge work accounts for more than 60% of the US economy.
@@zoeherriotsheaaat lol
Seriously tho
@ Yeah, the other issue is how categorise economically valuable work. For instance, “industry” represents 20% of the workforce, but it only contributes to 10% of the GDP. So you can easily play with numbers here. I think the issue is the original statement is meaningless without more qualifiers.
It’s the old adage of software having so many features, it can even make coffee. With the implicit understanding that of course software can do a lot of things but making a coffee, that’s something you’ll still have to do yourself.
(And if anybody brings up barista robots, do me a favour and get at least diagnosed.)
7:25 So practically speaking, we are living at a time when it’s actually really cheap for us users to leverage the models, given their functionality included in the $20 per month. But this will change and either the functionality will be limited or we’ll have to pay more.
Similar to early VC-funded industries when the service is cheap and gets more expensive as the market matures.
Why do you think this? It’s pretty clear to me that there’s no moat (as Google mentioned) and AI will be one of those things where most people use free or OSS, like operating systems are
No, the free models 2 years from now will be much smarter than the free models available now. (It’s very much a “rising tide lifts all boats” situation.) What _may_ change is that they’ll be dumber than the top-end state-of-the-art models that will cost thousands of dollars to use.
@@SnapDragon128 Yeah I agree, I’m pretty sure the free models in 1 year will already be better than o3 (high) now. Just like Gemini 2.5 pro is free and definitely better than any models there were a year ago
@@fark69 Even if the models are free, running them is not (nor is training or fine-tuning them). Hardware and electricity costs are part of the equation. I could run a model locally but it would cost me more than using a model of comparable quality via API. And if I upgrade my hardware to run a bigger model, it’d be even more expensive with that upgrade factored in.
My comment on the Green Card situation is that now is absolutely not the time to “speak up only when you’re directly affected”.
AGI falling into the hands of this admin. might not be much better than it falling into China’s. I don’t say this lightly or ideologically.
Not only are they not speaking up, some are caving in outright – so far I’m aware of only Zuck, but I’d keep an eye on the rest.
AGI will obviously be in everyone’s hands. It doesn’t seem possible to stop competitors from progressing in AI. I mean, remember Deepseek?
How is that not disgustingly ideological?
I think the term “ideological” is completely misunderstood. Everything we do and say is ideological. It’s just a system of ideas that you believe in. If you don’t want the most powerful invention of humankind to be controlled by a dictator, of course that’s an ideology. And every sane person would agree with that ideology.
I used Gemini 2.5 Pro to solve a very persistent bug in my C++ game engine. For me this would have taken hours, but probably days, to solve. Gemini solved it one shot, only a single minor error, in ~60s. As I get older, I will never be able to beat that. It can even write correct WebGPU shaders, which there’s almost no training data for. If I didn’t have bigger problems right now, I would be ecstatic.
I had it help me write a pretty complex vapoursynth script and was surprised at how much it knew. Vapoursynth is a niche within a niche, there can’t be very much training data for it.
And yet I had one issue that it was able to not only fix but explain why it was happening. When I tried searching for it myself I could not even find a single result relating to it. I was blown away.
@@TheFeelTrain There must be either a really high quality training data set no one else has or there’s some yet unknown trick to their RL setup.
@@OperationDarkside Wait until they just read the entire doc/book about the language, learn it on the fly and code in it right away all in context.
@@GrindThisGame They can do this already – I use this technique all the time: give Claude the entire book (like the 160 page pdf documentation on libtomcrypt), then point it at your code and have it implement wrappers for new parts of the library, extrapolating the patterns established by the existing hand-written portions. It’s not perfect but it’s very, very good. Also it doesn’t need extensive training data for a low-resourced language or library or whatever – provided the examples it has (even just in the prompt) reveal a consistent design language and thinking it’s extremely good at just guessing the stuff it hasn’t seen. It does this (successfully) all the time for me, with things it can’t ever have seen before (because I just wrote them). This is something that I think we’re missing mostly – it’s not so much that the models “know” facts, but rather that they’re really good at guessing.
One of my favorite videos of yours on the channel is “AI won’t be AGI until it can at least do this” in which you showcased areas where models fail and later introduced many of the new techniques that are being used now. Would be interesting to see you make a part two of that video given how much progress has been made.
Problem is that as soon as a flaw becomes known it is extremely easy to patch (you just fill examples of the problem into the training data for the next one). But fixing the fundamental flaw is much harder. That’s the problem with all these benchmarks
12:00 I think this definition of AGI is terrible because notice how he says just “humans” not “the average human” no, just humans. That would mean it outperforms every human in the entire world at most economically valuable work, which is just ASI bro
Nah ASI is more then that. I would say the thing you are describing is between AGI and ASI
4:49 Marking GPT-3.5 as Blind wa just painful.
How did you estimate 1000x bigger models in 2030? That would imply 1,000,000x training compute…? Are you saying there will be 100x (10,000x–> 1,000,000) training speed/bandwidth optimization in 5 years?
1000x bigger models seems very unlikely to me
I think the biggest bottleneck right now is the speed at which we’re able to physically build and power data centers. If that got dramatically faster somehow, then I would be more inclined to believe him
@TheTabascodragon For sure. Epoch has done some great work on this topic. It’s very difficult to build the power needed and that’s one of the first blockers to go past 10,000x to 50,000x compute by 2030. The others are chip production and data scarcity (at the 5 OOM range), and latency (in the 5-6 OOM range).
China on the other hand might not have the same difficulty building out the power required – they built 10x more new electric power than the US for each of the last 3 years at ~160-450GW compared to US’s 17-47GW. However, they have a chip shortage which should continue imo into ~2030.
It’s possible a huge breakthrough similar to LRM via RL is achieved that pushes us closer to AGI faster. But in terms of scaling I really doubt we’ll have 3 OOM bigger models (which would be in the ~5 quadrillion parameter range).
How did you calculate 12 orders of magnitude more inference compute required? Even if OpenAI go from 160M to 2B daily users, each user uses chat 10x more, each chat is spins up 10 instances (agent or whatever), models are 1000x bigger (unlikely, they’d more likely be 100x bigger), and some other magical 100x, I’m still at “only” 8 OOM???
Does that count longer and deeper reasoning levels over time?
@Spartansareawesome11 yeah, so, if the reasoning levels go up by 10,000 (4 OOM to meet 12 OOM mentioned in video), then the latency of the model would rise to on the order of 5 hours. Imagine chatting with chatgpt 10x more daily but each reply takes 5 hours. So I better load up all my questions in the morning and hopefully I’ll have my answers ready for me before dinner! 😂
Thanks “me old mucker”. Always grateful to be kept up to speed!
OpenAI has to give a rosy revenue forecast to keep their investors, particularly SoftBank, happy.
I feel like that physics benchmark is way too easy if models are already close to 50 %.
I agree with your analysis that AGI being close does not make sense in the light of their own economic projections. I also think that a lot of projects they are doing seems like a total waste of time if they truly believe d AGI was that close. Like why work on Sora if you in two years could have a system smarter than all of humanity combined 🤷🏻♂️?
The long context performance is the most important stat here