o3 breaks (some) records, but AI becomes pay-to-win

A green card, o3 vs Gemini 2.5, 6 Benchmarks and a whole bunch of my thoughts on what on earth is happening in AI, from here to 2030. Plus, how AI is becoming pay-to-win, and why. Crazy times, 14 mins probably wasn’t enough.

AI Insiders ($9!):

Chapters:
00:00 – Introduction
00:33 – FictionLiveBench
01:37 – PHYBench
02:14 – SimpleBench
02:54 – Virology Capabilities Test
03:13 – Mathematics Performance
04:29 – Vision Benchmarks
05:43 – V* and how o3 works
06:44 – Revenue and costs for you
08:54 – Expensive RL and trade-offs
09:40 – How to spend the OOMs
13:27 – Gray Swan Arena

Green Card:
PHYBench: :
How o3 Vision Works:
Visual puzzles:
Fiction Bench:

AIME 2025:
USAMO:
NaturalBench:
Where’s Waldo:
IMO and AlphaProof:
Crazy Revenue:
Number of Users:
Subscriptions pay to win:
GPU Trade-offs:
RL Scale-up Amodei:
Log-linear Returns:
2030 Scaling:
Model Size:
Adam on AGI:
Papers on Patreon:

Chollet Quote:
OpenSim:

Non-hype Newsletter:

Podcast:

Joe Lilli
 

  • @CalConrad says:

    I see an AI Explained notification, I click on an AI Explained notification.

  • @OnigoroshiZero says:

    OpenAI’s AGI definition is essentially ASI…
    A decade ago we would define current multimodal models as AGI.

    Anyway, thanks for another amazing video.

    • @maloxi1472 says:

      OpenAI’s definition is actually far below AGI

    • @etz8360 says:

      @@maloxi1472what’s your definition then?

    • @zoeherriot says:

      @@etz8360 Well I can tell you that OpenAI and MS agreed that the definition was the ability to generate 100 billion in revenue per year. Which is not directly tied to the level of intelligence.

    • @mr.nicolas4367 says:

      You have no idea what you are talking about. That’s the Dunning-Kruger effect talking

    • @TheTabascodragon says:

      I would define AGI as an AI agent that can autonomously perform any task that a human can

      I would define ASI as an entity capable of exceeding human performance in all tasks, and probably even doing tasks humans can’t. I think ASI will very likely either have some level of sentience, or something similar to sentience.

  • @George-h9j says:

    Tbh, focusing too much on the price feels a bit off. We have no idea what’s behind it.. maybe googles just throwing cash around to dominate the market, or maybe they’re just way better at scaling their infrastructure. Either way, price doesn’t tell the whole story when it comes to model quality or innovation

    • @apache937 says:

      If only they revealed the model size and what quant they run at

    • @pmuz482 says:

      its pretty funny, certainly seems google is partly following the market finally, but also doing this to spite openai which is just hilarious. They are doing the google CLASSIC of undercutting and stealing the market.

  • @viyye says:

    every time I give o3 documents it just gets totally lost and repeats itself when I try to steer it back in line, I have to change to model to 4o to get it analys the documents, and these are sometimes less than 20 page long

    • @GeoMeridium says:

      If you are not using the API version of o3 and o4-mini, the temperature gets adjusted automatically, so as to deter companies like Deepseek from training off the model. This practice has a negative effect on overall performance and tool use.

    • @apache937 says:

      in chatgpt i dont think the reasoning models have pdf reading ability

    • @viyye says:

      @apache937  yes they do

    • @juliankohler5086 says:

      o3 it’s confused like that because it is built wrong. o3 is not built to give you what you want. I think the main mission of o3 is to gather data for the next reasoning model. o3 is not supposed to be a good assistant. It basically confessed that to me (after explicitly lying about how it works and saying it only predicts next words). I pressed o3 with facts about its reasoning mechanisms, and it eventually said that it has priorities that outweigh my prompt, with the goal of generating the “ideal answer” according to that internal mathematic, frequently getting rid of constraints I set. These benchmarks are meaningless. Getting the results you want is what we should be focusing on. o3 does not exist to do that, and that’s worrying. The goal of the o-series is to eventually get rid of the need for a user, I think.

  • @Dannnneh says:

    My jaw dropped when I saw o3’s performance on long context.
    The face-touching question is hilarious in its absurd common sense, getting it wrong is a ridiculous notion, human dubby for this one.
    I will never not be grateful for your updates. o7

  • @jonp3674 says:

    I am not really convinced the General in AGI is particularly important.

    For instance I have a maths PhD and I think Gemini is now better than me at mathematics, hands down in terms of breadth of knowledge and speed, it’s already getting to the point where it’s helping researchers a lot and will soon be speeding them up a lot an ushering in the singularity.

    I don’t see why it matters particularly that I can fry an egg and it can’t.

    • @Gafferman says:

      It’s the difference between a menu and a fully functional robot waiter providing table service.

    • @konstantinlozev2272 says:

      The LLMs of today are a incredibly biased toward maths.
      And most other work requires immense detailed and intricate context/knowledge.

    • @facts-ec4yi says:

      Yeah, I think the same way. I’m nowhere near as smart as you in maths, but I am an undergraduate in AI & Comp-Sci, and it’s way better than I am at mathematics and computer science. Also, the rate at which it improves is faster than I can learn, so I can never catch up to it haha.

    • @lucid8302 says:

      Everyone thinks of AGI as a kind of point on a timeline. But the reality is that intelligence is more like a spectrum, and every time llm gets smarter, everyone just shifts and adjusts the requirements for AGI. If AGI is a duck, then llm will never become AGI.

    • @anearthian894 says:

      But i heard they dont know the algorithm of multiplication yet? Like they cant multiply accurately after certain length of numbers…without calculator. Still its interesting how they can do phd level maths

  • @sircramthel8664 says:

    I’ve never really liked these thinking models. They are really argumentative, have no personality, can never accept that they can be wrong, and just do dumb stuff. Honestly, I find gpt 4.5 and 4.1 leagues ahead these thinking models. For programming big projects, they aren’t as good (but I find thinking models to still suck), but for real-world tasks, and being enjoyable to talk to, 4.1 and 4.5 and even 4o are much better than these thinking models. Gemini 2.5 feels a bit better than o3, but still not as good as the other models.

  • @r0bophonic says:

    12:00 “Most economically valuable work.” Notably it does not say “knowledge work”. That definition of AGI requires embodiment to perform physical labor, i.e., robots.

    • @apache937 says:

      ASI gotta be pretty ddamn stupid if it cant do physical actions

    • @zoeherriot says:

      That’s a good point – but worth noting that knowledge work accounts for more than 60% of the US economy.

    • @theWACKIIRAQI says:

      @@zoeherriotsheaaat lol

      Seriously tho

    • @zoeherriot says:

      @ Yeah, the other issue is how categorise economically valuable work. For instance, “industry” represents 20% of the workforce, but it only contributes to 10% of the GDP. So you can easily play with numbers here. I think the issue is the original statement is meaningless without more qualifiers.

    • @unvergebeneid says:

      It’s the old adage of software having so many features, it can even make coffee. With the implicit understanding that of course software can do a lot of things but making a coffee, that’s something you’ll still have to do yourself.

      (And if anybody brings up barista robots, do me a favour and get at least diagnosed.)

  • @maks_st says:

    7:25 So practically speaking, we are living at a time when it’s actually really cheap for us users to leverage the models, given their functionality included in the $20 per month. But this will change and either the functionality will be limited or we’ll have to pay more.

    Similar to early VC-funded industries when the service is cheap and gets more expensive as the market matures.

    • @fark69 says:

      Why do you think this? It’s pretty clear to me that there’s no moat (as Google mentioned) and AI will be one of those things where most people use free or OSS, like operating systems are

    • @SnapDragon128 says:

      No, the free models 2 years from now will be much smarter than the free models available now. (It’s very much a “rising tide lifts all boats” situation.) What _may_ change is that they’ll be dumber than the top-end state-of-the-art models that will cost thousands of dollars to use.

    • @Yasmina-n3u9x says:

      @@SnapDragon128 Yeah I agree, I’m pretty sure the free models in 1 year will already be better than o3 (high) now. Just like Gemini 2.5 pro is free and definitely better than any models there were a year ago

    • @moozooh says:

      @@fark69 Even if the models are free, running them is not (nor is training or fine-tuning them). Hardware and electricity costs are part of the equation. I could run a model locally but it would cost me more than using a model of comparable quality via API. And if I upgrade my hardware to run a bigger model, it’d be even more expensive with that upgrade factored in.

  • @OverLordGoldDragon says:

    My comment on the Green Card situation is that now is absolutely not the time to “speak up only when you’re directly affected”.
    AGI falling into the hands of this admin. might not be much better than it falling into China’s. I don’t say this lightly or ideologically.
    Not only are they not speaking up, some are caving in outright – so far I’m aware of only Zuck, but I’d keep an eye on the rest.

    • @fark69 says:

      AGI will obviously be in everyone’s hands. It doesn’t seem possible to stop competitors from progressing in AI. I mean, remember Deepseek?

    • @Houshalter says:

      How is that not disgustingly ideological?

    • @missoats8731 says:

      I think the term “ideological” is completely misunderstood. Everything we do and say is ideological. It’s just a system of ideas that you believe in. If you don’t want the most powerful invention of humankind to be controlled by a dictator, of course that’s an ideology. And every sane person would agree with that ideology.

  • @OperationDarkside says:

    I used Gemini 2.5 Pro to solve a very persistent bug in my C++ game engine. For me this would have taken hours, but probably days, to solve. Gemini solved it one shot, only a single minor error, in ~60s. As I get older, I will never be able to beat that. It can even write correct WebGPU shaders, which there’s almost no training data for. If I didn’t have bigger problems right now, I would be ecstatic.

    • @TheFeelTrain says:

      I had it help me write a pretty complex vapoursynth script and was surprised at how much it knew. Vapoursynth is a niche within a niche, there can’t be very much training data for it.

      And yet I had one issue that it was able to not only fix but explain why it was happening. When I tried searching for it myself I could not even find a single result relating to it. I was blown away.

    • @OperationDarkside says:

      @@TheFeelTrain There must be either a really high quality training data set no one else has or there’s some yet unknown trick to their RL setup.

    • @GrindThisGame says:

      @@OperationDarkside Wait until they just read the entire doc/book about the language, learn it on the fly and code in it right away all in context.

    • @CyanOgilvie says:

      @@GrindThisGame They can do this already – I use this technique all the time: give Claude the entire book (like the 160 page pdf documentation on libtomcrypt), then point it at your code and have it implement wrappers for new parts of the library, extrapolating the patterns established by the existing hand-written portions. It’s not perfect but it’s very, very good. Also it doesn’t need extensive training data for a low-resourced language or library or whatever – provided the examples it has (even just in the prompt) reveal a consistent design language and thinking it’s extremely good at just guessing the stuff it hasn’t seen. It does this (successfully) all the time for me, with things it can’t ever have seen before (because I just wrote them). This is something that I think we’re missing mostly – it’s not so much that the models “know” facts, but rather that they’re really good at guessing.

  • @malikmartin7410 says:

    One of my favorite videos of yours on the channel is “AI won’t be AGI until it can at least do this” in which you showcased areas where models fail and later introduced many of the new techniques that are being used now. Would be interesting to see you make a part two of that video given how much progress has been made.

    • @fark69 says:

      Problem is that as soon as a flaw becomes known it is extremely easy to patch (you just fill examples of the problem into the training data for the next one). But fixing the fundamental flaw is much harder. That’s the problem with all these benchmarks

  • @pigeon_official says:

    12:00 I think this definition of AGI is terrible because notice how he says just “humans” not “the average human” no, just humans. That would mean it outperforms every human in the entire world at most economically valuable work, which is just ASI bro

  • @nicdemai says:

    4:49 Marking GPT-3.5 as Blind wa just painful.

  • @2hcy says:

    How did you estimate 1000x bigger models in 2030? That would imply 1,000,000x training compute…? Are you saying there will be 100x (10,000x–> 1,000,000) training speed/bandwidth optimization in 5 years?

    1000x bigger models seems very unlikely to me

    • @TheTabascodragon says:

      I think the biggest bottleneck right now is the speed at which we’re able to physically build and power data centers. If that got dramatically faster somehow, then I would be more inclined to believe him

    • @2hcy says:

      @TheTabascodragon  For sure. Epoch has done some great work on this topic. It’s very difficult to build the power needed and that’s one of the first blockers to go past 10,000x to 50,000x compute by 2030. The others are chip production and data scarcity (at the 5 OOM range), and latency (in the 5-6 OOM range).

      China on the other hand might not have the same difficulty building out the power required – they built 10x more new electric power than the US for each of the last 3 years at ~160-450GW compared to US’s 17-47GW. However, they have a chip shortage which should continue imo into ~2030.

      It’s possible a huge breakthrough similar to LRM via RL is achieved that pushes us closer to AGI faster. But in terms of scaling I really doubt we’ll have 3 OOM bigger models (which would be in the ~5 quadrillion parameter range).

  • @2hcy says:

    How did you calculate 12 orders of magnitude more inference compute required? Even if OpenAI go from 160M to 2B daily users, each user uses chat 10x more, each chat is spins up 10 instances (agent or whatever), models are 1000x bigger (unlikely, they’d more likely be 100x bigger), and some other magical 100x, I’m still at “only” 8 OOM???

    • @Spartansareawesome11 says:

      Does that count longer and deeper reasoning levels over time?

    • @2hcy says:

      @Spartansareawesome11  yeah, so, if the reasoning levels go up by 10,000 (4 OOM to meet 12 OOM mentioned in video), then the latency of the model would rise to on the order of 5 hours. Imagine chatting with chatgpt 10x more daily but each reply takes 5 hours. So I better load up all my questions in the morning and hopefully I’ll have my answers ready for me before dinner! 😂

  • @humunu says:

    Thanks “me old mucker”. Always grateful to be kept up to speed!

  • @annaczgli2983 says:

    OpenAI has to give a rosy revenue forecast to keep their investors, particularly SoftBank, happy.

  • @CasualTortoise says:

    I feel like that physics benchmark is way too easy if models are already close to 50 %.

    I agree with your analysis that AGI being close does not make sense in the light of their own economic projections. I also think that a lot of projects they are doing seems like a total waste of time if they truly believe d AGI was that close. Like why work on Sora if you in two years could have a system smarter than all of humanity combined 🤷🏻‍♂️?

  • @eyeofthetiger7 says:

    The long context performance is the most important stat here

  • >