o3 – wow

o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.

AI Insiders ($9!):

FrontierMath:

Chollet Statement:
MLC Paper:

AlphaCode 2:
Human Performance on ARC-AGI:
Wei Tweet ‘3 months’:
Deliberative Alignment Paper:
Brown Safety Tweet:
Swe-Bench Verified:
Amodei Prediction:
David Dohan: 16 hours
OpenAI Personal Writing:

John Hallman Tweet:

00:00 – Introduction
01:19 – What is o3?
03:18 – FrontierMath
05:15 – o4, o5
06:03 – GPQA
06:24 – Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 – 1st Caveat
09:03 – Compositionality?
10:16 – SimpleBench?
13:11 – ARC-AGI, Chollet
20:25 – Safety Implicaitons

AI Insiders:

Non-hype Newsletter:

Podcast:

Joe Lilli
 

  • @JonSmith-v7b says:

    You already know i was waiting for this video with my eye twitching

    • @JonnyCook says:

      He’s like the only guy in this space that doesn’t use clickbait titles. Everyone else just flat out states that OpenAI created AGI, even though no one actually thinks that.

    • @everythingiscoolio says:

      @@JonnyCook The title to this video is literally “wow”. That’s a textbook definition of clickbait. It’s the reason I clicked it. What are you talking about, my man?

    • @joey199412 says:

      @@everythingiscoolio It’s wow because he was actually (for the first time) wowed. Not to make more people click.

    • @everythingiscoolio says:

      @@joey199412 Right…. I’m glad you believe that.

    • @alansmithee419 says:

      @@everythingiscoolio Depends if you determine clickbait by whether it’s fairly used or not.
      I would at the least only criticise something for clickbait if the video doesn’t deliver on its promise.

  • @Kenton. says:

    To think this is the worst the models will ever be. ChatGPT only released 2 years ago, now we are talking about it scoring among top experts in nearly every field. Insane

    • @johanavril1691 says:

      well actually chatgpt is largely based on gpt3 and gpt3 was trained about 4 and half years ago. But that not very different ig

    • @dertythegrower says:

      They pay people all day to answer questions.. its not THAT genius yet..

    • @merlinwarage says:

      @@johanavril1691 What? GPT1 was trained in 2017, after Google developed the transformer. Btw the same logic/architecture existed since 1965.

    • @bournechupacabra says:

      “scoring near top experts in every field” is quite an exaggeration, but yeah that’s still impressive

    • @johanavril1691 says:

      @@merlinwarage hm yeah I see what you mean but the technology behind chatgpt is gpt3 and it did take 4 years to get from that to what we have today. But also your right and I’m right too just a lil different

  • @booshong says:

    I’ve been waiting so hard for this video. What a turnaround time

    • @codycast says:

      I’ve seen comments like this. If you already watched the OpenAI video about o3, just what are you so anxious to watch someone’s recap for?

    • @booshong says:

      @@codycast Besides the cursory argument that any company’s promo content is inherently biased, third-party analysis is a critical part of science. And I think this channel is great at that.

  • @Anthemius-p4n says:

    Genuinely what is happening anymore. We are heading towards the most crazy and important year in human history. What a time to be alive, wow

    • @HCforLife1 says:

      yeah, we are heading into neo-feudalism, human-trafficing on enormous scale, eradication of brain usage, extremely low quality of life and spike of violence. Yeah! What a time to be alive!

    • @apocryphalshepherd says:

      The first day of the new year is a wednesday, the next thursday and then friday, of course: WTF!

      Maybe something is cooking, lol

    • @Kajenx says:

      It’s also kind of horrifying. It’s easy to imagine a utopian ending to AGI suddenly existing, but I don’t want to live through the time between now and that ending. I feel like the only jobs humans will be qualified for are manual labor, and there’s a lot of people with a huge stake in the current economic system sticking around.

    • @myspace_forever says:

      I’d rather be alive 30 years ago

    • @Gafferman says:

      And yet… Nothing really has changed still. Shops, businesses, governments… Nothing is different.

  • @FranXiT says:

    The future is looking so immensely interesting. What a time to be alive!

  • @panzerofthelake4460 says:

    You know, I find it absurd that “Open”AI and other proprietary AI labs get to benefit from open-source research and publications while offering relatively speaking little to nothing in return. It’s an unfair advantage. Reminds me of the game” Werewolf” by Davidoff, where an informed minority almost always wins against the uninformed majority.

    • @brianmi40 says:

      And yet OpenAI give the world free access to ChatGPT for 4 months before even the first competitor showed up, and they’ve done the same with each advancing model since then.

      And I think you could find a lot of AI researchers who would agree that simply OpenAI REVEALING o1 Preview when they did and what/how it did simply from a surface perspective inspired a whole lot of new research.

      Methinks thou judgest too harshly.

    • @okaydetar821 says:

      How much more do you want them to give?

    • @Formalec says:

      I think they are very open and fast in releases for being close source companies.

    • @imperson7005 says:

      ​@@okaydetar821Everything. Most if not all advanced technology today was paid for by our taxes. It was built using us as research. It was refined by the colleges and infrastructure we produce and sustain. It is our right

    • @test-zg4hv says:

      I don’t think they have infinite money so yeah

  • @DrBreadstick says:

    The real benchmark is real world examples. Remember: Current benchmarks are like a laboratory testing ground. Yes, the questions asked might be real world examples, but they will be written in a way that’s clear and state actual objectives/goals. A sterile set of hard but clear questions.
    The real world is different. If I get a well written objective from a stakeholder for me to implement, it’ll be a first. The actual world is nitty gritty. Full of nuances, filled with human error and everything needs refinement first.
    Therefore, my benchmark: Read a typical agile development user story written by some key-user or stakeholder and try to implement it in such a fashion that it’ll pass testing and is production ready within a certain timelimit. If it can do that, I’ll sit back down.

    • @jaazz90 says:

      The opposite incentive will also start ticking: people that are able to formulate precise specs to feed into neural net will be sought after.

    • @irjonesy says:

      These arguments against AI are fascinating. They are getting more and more complex. Eventually we will say: “alright it’s fine and all but tell me when it’s able to run and optimize all technological systems on earth simultaneously”.
      This moving of the goal posts reminds me of the arguments for god that traditionally religious types have used for hundreds of years as scientific discoveries have opened the doors to the unknown.
      God is the weather.
      No, wait, god is the sun.
      Actually, it seems god is space.
      Hmm, I suppose god is dark energy? Quantum probability?

    • @MuhammadRaiyan135 says:

      Smart religious scholars have never argued God is the weather or dark space energy mate. Without religion you have no objective morality, even if you worship science and “neoliberal notions of progress” ​@irjonesy

    • @sirius-harry says:

      ​@MuhammadRaiyan135​that’s exactly the point, how do you define smart scholars? We literally have religious books which considers natural elements as gods.
      Also, morality has nothing do to with religion. World is full of religious folks who are pathetic human beings. Not saying, being atheist makes you a good human either!
      But critical thinking is definitely helpful in being a considerate, reasonable and understanding person.

    • @John-d8p says:

      @MuhammadRaiyan135 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 Of course, it had to be a piss-full.

  • @pandoraeeris7860 says:

    We’re in the middle of the intelligence explosion.

    • @kodykendall says:

      Yep. Early version of it. Iteration loop is tightening — the next big milestone is when we remove the human in the loop.

    • @andybaldman says:

      Too bad society no longer cares about intelligence.

    • @wesley6442 says:

      If only we could figure out how we as humans intuitively grasp or understand a basic concept such as some of these puzzles, to think about thinking, how we reason and replicate that in a machine/AI but faster and better

  • @JamesJohnson-iq5wb says:

    You know it’s good when the simple bench guy is impressed with OpenAI

    • @pik910 says:

      He has a name! It is AI Explained guy.

    • @wesley6442 says:

      I felt superior in that, yeah it can test better than me but I can pick up a ball and throw it, walk up stairs etc.. then I saw a video of robotic humanoids with pressure sensors on their hands achieving a high degree of dexterity.. and it has me worried, it may do human better than me lol

  • @kyneticist says:

    imho we’re overfitting our expectations to the utility of benchmarks. AGI doesn’t need to be a true genius, or a genius in any number of fields to be considered AGI – As a general intelligence, it “just” needs to be proficient at operating reliably and thoughtfully in the real world.

    • @carlosamado7606 says:

      I think there are very specific simple things it needs to do well that it doesn’t yet. One of which being better memory. Personalisation to user will also be important. Also for robots it would make sense to create a sense of awareness. Not as in sentience, but knowing what its model is, name given by the owner, info on important things the owner sees important, etc, etc. It is possible with models now but there is far too much hallucination still. But if we combine that with a dextrous enough robot, then yes is already very transformative on its own.
      Many jobs will get displaced by that alone.

    • @techrvl9406 says:

      Totally agree—benchmarks are a useful tool, but they might not capture the full essence of what makes intelligence “general.” It’s like measuring a chef’s skill by how well they bake bread—it tells you something, but not everything. AGI, as you said, isn’t about being a genius in every field; it’s about adaptability and reliability in the complexities of the real world.

      What I find fascinating is how we define “operating thoughtfully.” Should an AGI aim for practical problem-solving across domains, or do you think there’s room for it to develop something akin to intuition? Benchmarks are only one piece of the puzzle—how would you measure AGI’s ability to navigate uncharted territory, like moral dilemmas or cultural nuances?

    • @danielarvidsson3676 says:

      And be able to learn in real time. That is the big thing still missing in todays AI systems.

    • @mb2776 says:

      @@danielarvidsson3676 exactly! People don’t know that any AI needs to be stopped to train.

    • @PJ-hi1gz says:

      @@mb2776you don’t need to stop it per se to retrain and redeploy a new version of the model.

  • @Shaunmcdonogh-shaunsurfing says:

    I don’t know what’s more impressive, 03 being announced so soon or the turnaround time of your coverage on it. Fantastic work.

    • @brianmi40 says:

      Imago with o3 announced SO VERY SOON after simply GO LIVE for o1, but I get the point.

    • @sth128 says:

      Later it’ll be revealed that AI Explained is actually being run by an ASI that achieved omnipotence and traveled back in time to guide AGI alignment with Simple Bench.

      Sadly the alignment is not with humanity. Instead it’s Roko’s Basilisk.

    • @Yobs2K says:

      @@sth128 you just got me thinking about Roko’s Basilisk having to kill itself if time travel is possible and it didn’t travel back in time to help create itself faster

    • @kecksbelit3300 says:

      it’s not soon they just released o1 so late o1’s training data is from 2023

    • @0xunknown336 says:

      What’s more impressive is the cost compute per task being used by o3, 1k usd per one task!

  • @taumag says:

    When comparing the bell curves, the mean of AI is now higher than the mean of humanity. The Turing test was “can you tell the difference between a computer and a person?” Now, AGI is “can we create a test that the smartest human can pass but the dumbest computer can’t?”

    • @Mirror_Lotus says:

      It’s only natural for humans to move their goal posts. It’s one of the oldest plays in the book: Denial.

    • @memofromessex says:

      I don’t believe it’s capable of outthinking a human if the information is created by a human.

    • @cluelesssoldier says:

      @@memofromessex The difference now is that it is being taught HOW to think, not just WHAT to think – we already see low level emergent reasoning skills from current leading models, this takes it to a whole other level.

    • @samuctrebla3221 says:

      ​@@Mirror_Lotusthe turing test is a moving goalpost by definition.

    • @zvexevz says:

      The question was in fact “are there tests that the dumbest humans can easily pass but which the top AI models cannot?” As long as the answer is yes, then we know we aren’t at AGI.

  • @noone-ld7pt says:

    Damn… OpenAI crushing ARC was not on my bingo card for 2024 (or even 2025). o1 was an impressive jump in performance, but o3 proves that the performance jump was not even the real point; it’s the completely paradigm breaking ability of being able to solve anything with an objectively correct answer. That feels like a profound change in potential and I don’t really know how I feel about it.

    • @andybaldman says:

      You should feel worried.

    • @k14pc says:

      agree on everything

    • @joey199412 says:

      We now have definitive proof that human intelligence won’t be needed in 10 years time. What the implications and results of that will be, we don’t know. But the world is now permanently changed. This is a defining moment of our future as a species. I never said these words before and I never thought about it like this, until it was proven right now that you can just scale up test time compute to essentially answer any question you can reason the answer to.

    • @ClaimClam says:

      o3 is just a bs smokecreen to makeup for the fact they cant even ship gpt5

    • @oranges557 says:

      ​@@ClaimClampeople like you are super lame

  • @silpheedTandy says:

    an upside of a channel not using clickbaitty titles, is that when the title is as dramatic as just “wow”, you can trust that the contents really is unusually impressive and maybe unexpected!

  • @MePeterNicholls says:

    We had “hold on to your papers” now we have “adjust your timelines!”

    • @somebody-anonymous says:

      We’re glad to inform you that your appointment at the biomatter recycle plant has been moved up to next monday!

    • @fabp.2114 says:

      @@somebody-anonymous who recycles for whom. who made what for whom. what is?

    • @sebkeccu4546 says:

      At this day google Gemini Flash 2 remains the best model available, because o3 will only be available in februari according to the livestream (the minis will be end janauri). Google Gemini is actually available since this week and overperforms o1-pro in quality but also performance (time to compute). I’m really curious how o3 will compare to gemini flash 2 in bench marks. Especially in the qualitive tests. Because the computing time of o3 surely looks slower then the new Gemini Flash 2

    • @somebody-anonymous says:

      @@fabp.2114 gotta make those stamps

    • @fabp.2114 says:

      @@somebody-anonymous ad astra et ultra

  • @spanke2999 says:

    if you want to summarize human existence, I guess it is the sentence “we really need to start focusing on safety…”

    • @OperationDarkside says:

      That’s what a couple with bad self-control usually says after the first baby, but we all know that it’ll be at least 2 more babies and their financial collapse, before they take proper measures.

    • @wesley6442 says:

      I think of that song by metallica, “some kind of monster” because here we are playing with a force we don’t fully comprehend nor may realize the depths of its capabilities, or even if it is actively deceiving us.. like all the while, we were playing with a deadly grizzly bear where we thought it was a cute puppy

  • @dcgamer1027 says:

    Im a little concerned about the power of hte majority being lost. The reason we have to get along and live with each other, in part, is because 1 strong ape can still be beaten by 2 weaker apes working together. But what if that one ape has 1000 autonomous drones to defend itself? What if it has robots to create food and entertainment for itself?
    Im not just concerned about what the AI will do if it is unsafe, Im concerned with what some humans would do if they no longer need the rest of us.

  • @williamjmccartan8879 says:

    Thank you for sharing your time and work Phillip, its been a crazy year man, Merry Christmas to you and your family and any elves who might be assisting you, cheers

  • @matthewuzhere says:

    tbh i’m still pretty unconvinced by a lot of these benchmarks. they showed o1 being pretty smart too but it really doesn’t seem to be able to have a conversation about it’s own answer or recognize a repeated mistake it’s making over and over again. or to adapt as the users needs/requests change. makes it feel like the model is not really getting more intelligent, just better at specific processes. or maybe a better way to put it: compared to the ideal of AGI, it’s still somewhat narrow intelligence, just has a lot more narrow intelligence in different domains. idk how you would do a benchmark that would quantify its ability to converse with a human in these subjects, correct it’s own mistakes, etc—maybe that’s just not measurable. but if it was, i suspect that’s where it would become much more obvious that these models are not AGI and are not even particularly close to it. you have talked a little about this idea in the past I think, but not a ton from what i remember. would love to hear your thoughts.

    full disclosure: i am only halfway through the vid, and i have not tried o3 myself. will update this comment if finishing the video changes what i think here majorly

    • @CoolIcingcake3467 says:

      by your definition. frontiermath, swebench-verified etc… is flawed.
      we could even generalize with this reasoning that ‘all benchmark is flawed’,

    • @odiseezall says:

      yeah and the Sun could also.. like.. just switch off tomorrow

    • @Bolidoo says:

      @matthewuzhere  Imo this lack of ‘common sense’ is absolutely there and it’s a significant limitation for it’s utility. I wouldn’t be surprised if o3 had similar limitations. But models are getting a lot better in that area. Hallucinations where extremely worse just a few years ago. Nowadays sonnet 3.5 gets surprisingly a 41.4% and o1-preview a 41.7%. I think it’s a surprisingly hard task for llms, but progress suggests it may not be a hard wall after all.

    • @cherubin7th says:

      @@CoolIcingcake3467 Well, yes all benchmark is flawed is something everyone should know. Outside of AI we see great benchmark poor real world performance all the time.

  • @_Escaflowne_ says:

    No clickbait + proper factual covering without over hype => subscribed

  • >