o3 – wow

o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.

AI Insiders ($9!):

FrontierMath:

Chollet Statement:
MLC Paper:

AlphaCode 2:
Human Performance on ARC-AGI:
Wei Tweet ‘3 months’:
Deliberative Alignment Paper:
Brown Safety Tweet:
Swe-Bench Verified:
Amodei Prediction:
David Dohan: 16 hours
OpenAI Personal Writing:

John Hallman Tweet:

00:00 – Introduction
01:19 – What is o3?
03:18 – FrontierMath
05:15 – o4, o5
06:03 – GPQA
06:24 – Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 – 1st Caveat
09:03 – Compositionality?
10:16 – SimpleBench?
13:11 – ARC-AGI, Chollet
20:25 – Safety Implicaitons

AI Insiders:

Non-hype Newsletter:

Podcast:

New Higgsfield Video AI Surprised Us!

Mindblowing o3 Prompts, OpenAI Models & More AI Use Cases

ChatGPT Remembers EVERYTHING About You Now 🤯

You Need to Try This New AI Agent (Genspark Super Agent)

o3 and o4-mini – they’re great, but easy to over-hype

OpenAI’s GPT 4.1 – Absolutely Amazing!

‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2.0: 7 Updates Critically Analysed

When You Combine 3 AIs, You Get THIS

Joe Lilli

@JonSmith-v7b says:

December 21, 2024 at 12:55 am

You already know i was waiting for this video with my eye twitching

@JonnyCook says:

December 21, 2024 at 2:14 am

He’s like the only guy in this space that doesn’t use clickbait titles. Everyone else just flat out states that OpenAI created AGI, even though no one actually thinks that.

Reply
@everythingiscoolio says:

December 21, 2024 at 10:26 am

@@JonnyCook The title to this video is literally “wow”. That’s a textbook definition of clickbait. It’s the reason I clicked it. What are you talking about, my man?

Reply
@joey199412 says:

December 21, 2024 at 11:22 am

@@everythingiscoolio It’s wow because he was actually (for the first time) wowed. Not to make more people click.

Reply
@everythingiscoolio says:

December 21, 2024 at 12:40 pm

@@joey199412 Right…. I’m glad you believe that.

Reply
@alansmithee419 says:

December 21, 2024 at 2:16 pm

@@everythingiscoolio Depends if you determine clickbait by whether it’s fairly used or not.
I would at the least only criticise something for clickbait if the video doesn’t deliver on its promise.

Reply

@Kenton. says:

December 21, 2024 at 12:56 am

To think this is the worst the models will ever be. ChatGPT only released 2 years ago, now we are talking about it scoring among top experts in nearly every field. Insane

@johanavril1691 says:

December 21, 2024 at 1:17 am

well actually chatgpt is largely based on gpt3 and gpt3 was trained about 4 and half years ago. But that not very different ig

Reply
@dertythegrower says:

December 21, 2024 at 1:38 am

They pay people all day to answer questions.. its not THAT genius yet..

Reply
@merlinwarage says:

December 21, 2024 at 1:40 am

@@johanavril1691 What? GPT1 was trained in 2017, after Google developed the transformer. Btw the same logic/architecture existed since 1965.

Reply
@bournechupacabra says:

December 21, 2024 at 1:49 am

“scoring near top experts in every field” is quite an exaggeration, but yeah that’s still impressive

Reply
@johanavril1691 says:

December 21, 2024 at 1:59 am

@@merlinwarage hm yeah I see what you mean but the technology behind chatgpt is gpt3 and it did take 4 years to get from that to what we have today. But also your right and I’m right too just a lil different

Reply

@booshong says:

December 21, 2024 at 1:01 am

I’ve been waiting so hard for this video. What a turnaround time

@codycast says:

December 21, 2024 at 3:09 am

I’ve seen comments like this. If you already watched the OpenAI video about o3, just what are you so anxious to watch someone’s recap for?

Reply
@booshong says:

December 21, 2024 at 4:46 am

@@codycast Besides the cursory argument that any company’s promo content is inherently biased, third-party analysis is a critical part of science. And I think this channel is great at that.

Reply

@Anthemius-p4n says:

December 21, 2024 at 1:02 am

Genuinely what is happening anymore. We are heading towards the most crazy and important year in human history. What a time to be alive, wow

@HCforLife1 says:

December 21, 2024 at 9:30 am

yeah, we are heading into neo-feudalism, human-trafficing on enormous scale, eradication of brain usage, extremely low quality of life and spike of violence. Yeah! What a time to be alive!

Reply
@apocryphalshepherd says:

December 21, 2024 at 10:18 am

The first day of the new year is a wednesday, the next thursday and then friday, of course: WTF!

Maybe something is cooking, lol

Reply
@Kajenx says:

December 21, 2024 at 10:29 am

It’s also kind of horrifying. It’s easy to imagine a utopian ending to AGI suddenly existing, but I don’t want to live through the time between now and that ending. I feel like the only jobs humans will be qualified for are manual labor, and there’s a lot of people with a huge stake in the current economic system sticking around.

Reply
@myspace_forever says:

December 21, 2024 at 10:39 am

I’d rather be alive 30 years ago

Reply
@Gafferman says:

December 21, 2024 at 11:45 am

And yet… Nothing really has changed still. Shops, businesses, governments… Nothing is different.

Reply

@FranXiT says:

December 21, 2024 at 1:16 am

The future is looking so immensely interesting. What a time to be alive!

@AlanMitchellAustralia says:

December 21, 2024 at 2:41 am

Hold on to your papers! 🧻

Reply
@looooool_guy says:

December 21, 2024 at 2:55 am

Wait… aren’t you the Forsen comment troll 😶‍🌫️😶‍🌫️

Reply
@wonmoreminute says:

December 21, 2024 at 4:31 am

Interesting is one way to put it.

Reply
@earlaweese says:

December 21, 2024 at 4:49 am

*This is SO fucking boring. What the hell? This was supposed to be the future. Where is the teleportation and levitation and telekinesis? This is nonsense!!*

Reply
@ster2600 says:

December 21, 2024 at 11:51 am

Frightening.

Reply

@panzerofthelake4460 says:

December 21, 2024 at 1:16 am

You know, I find it absurd that “Open”AI and other proprietary AI labs get to benefit from open-source research and publications while offering relatively speaking little to nothing in return. It’s an unfair advantage. Reminds me of the game” Werewolf” by Davidoff, where an informed minority almost always wins against the uninformed majority.

@brianmi40 says:

December 21, 2024 at 1:47 am

And yet OpenAI give the world free access to ChatGPT for 4 months before even the first competitor showed up, and they’ve done the same with each advancing model since then.

And I think you could find a lot of AI researchers who would agree that simply OpenAI REVEALING o1 Preview when they did and what/how it did simply from a surface perspective inspired a whole lot of new research.

Methinks thou judgest too harshly.

Reply
@okaydetar821 says:

December 21, 2024 at 2:03 am

How much more do you want them to give?

Reply
@Formalec says:

December 21, 2024 at 2:50 am

I think they are very open and fast in releases for being close source companies.

Reply
@imperson7005 says:

December 21, 2024 at 5:12 am

@@okaydetar821Everything. Most if not all advanced technology today was paid for by our taxes. It was built using us as research. It was refined by the colleges and infrastructure we produce and sustain. It is our right

Reply
@test-zg4hv says:

December 21, 2024 at 5:19 am

I don’t think they have infinite money so yeah

Reply

@DrBreadstick says:

December 21, 2024 at 1:18 am

The real benchmark is real world examples. Remember: Current benchmarks are like a laboratory testing ground. Yes, the questions asked might be real world examples, but they will be written in a way that’s clear and state actual objectives/goals. A sterile set of hard but clear questions.
The real world is different. If I get a well written objective from a stakeholder for me to implement, it’ll be a first. The actual world is nitty gritty. Full of nuances, filled with human error and everything needs refinement first.
Therefore, my benchmark: Read a typical agile development user story written by some key-user or stakeholder and try to implement it in such a fashion that it’ll pass testing and is production ready within a certain timelimit. If it can do that, I’ll sit back down.

@jaazz90 says:

December 21, 2024 at 4:05 am

The opposite incentive will also start ticking: people that are able to formulate precise specs to feed into neural net will be sought after.

Reply
@irjonesy says:

December 21, 2024 at 4:19 am

These arguments against AI are fascinating. They are getting more and more complex. Eventually we will say: “alright it’s fine and all but tell me when it’s able to run and optimize all technological systems on earth simultaneously”.
This moving of the goal posts reminds me of the arguments for god that traditionally religious types have used for hundreds of years as scientific discoveries have opened the doors to the unknown.
God is the weather.
No, wait, god is the sun.
Actually, it seems god is space.
Hmm, I suppose god is dark energy? Quantum probability?

Reply
@MuhammadRaiyan135 says:

December 21, 2024 at 5:54 am

Smart religious scholars have never argued God is the weather or dark space energy mate. Without religion you have no objective morality, even if you worship science and “neoliberal notions of progress” @irjonesy

Reply
@sirius-harry says:

December 21, 2024 at 6:28 am

@MuhammadRaiyan135that’s exactly the point, how do you define smart scholars? We literally have religious books which considers natural elements as gods.
Also, morality has nothing do to with religion. World is full of religious folks who are pathetic human beings. Not saying, being atheist makes you a good human either!
But critical thinking is definitely helpful in being a considerate, reasonable and understanding person.

Reply
@John-d8p says:

December 21, 2024 at 6:41 am

@MuhammadRaiyan135 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 Of course, it had to be a piss-full.

Reply

@pandoraeeris7860 says:

December 21, 2024 at 1:21 am

We’re in the middle of the intelligence explosion.

@kodykendall says:

December 21, 2024 at 5:17 am

Yep. Early version of it. Iteration loop is tightening — the next big milestone is when we remove the human in the loop.

Reply
@andybaldman says:

December 21, 2024 at 6:15 am

Too bad society no longer cares about intelligence.

Reply
@wesley6442 says:

December 21, 2024 at 5:43 pm

If only we could figure out how we as humans intuitively grasp or understand a basic concept such as some of these puzzles, to think about thinking, how we reason and replicate that in a machine/AI but faster and better

Reply

@JamesJohnson-iq5wb says:

December 21, 2024 at 1:25 am

You know it’s good when the simple bench guy is impressed with OpenAI

@pik910 says:

December 21, 2024 at 4:20 pm

He has a name! It is AI Explained guy.

Reply
@wesley6442 says:

December 21, 2024 at 5:13 pm

I felt superior in that, yeah it can test better than me but I can pick up a ball and throw it, walk up stairs etc.. then I saw a video of robotic humanoids with pressure sensors on their hands achieving a high degree of dexterity.. and it has me worried, it may do human better than me lol

Reply

@kyneticist says:

December 21, 2024 at 1:31 am

imho we’re overfitting our expectations to the utility of benchmarks. AGI doesn’t need to be a true genius, or a genius in any number of fields to be considered AGI – As a general intelligence, it “just” needs to be proficient at operating reliably and thoughtfully in the real world.

@carlosamado7606 says:

December 21, 2024 at 1:46 am

I think there are very specific simple things it needs to do well that it doesn’t yet. One of which being better memory. Personalisation to user will also be important. Also for robots it would make sense to create a sense of awareness. Not as in sentience, but knowing what its model is, name given by the owner, info on important things the owner sees important, etc, etc. It is possible with models now but there is far too much hallucination still. But if we combine that with a dextrous enough robot, then yes is already very transformative on its own.
Many jobs will get displaced by that alone.

Reply
@techrvl9406 says:

December 21, 2024 at 2:01 am

Totally agree—benchmarks are a useful tool, but they might not capture the full essence of what makes intelligence “general.” It’s like measuring a chef’s skill by how well they bake bread—it tells you something, but not everything. AGI, as you said, isn’t about being a genius in every field; it’s about adaptability and reliability in the complexities of the real world.

What I find fascinating is how we define “operating thoughtfully.” Should an AGI aim for practical problem-solving across domains, or do you think there’s room for it to develop something akin to intuition? Benchmarks are only one piece of the puzzle—how would you measure AGI’s ability to navigate uncharted territory, like moral dilemmas or cultural nuances?

Reply
@danielarvidsson3676 says:

December 21, 2024 at 8:15 am

And be able to learn in real time. That is the big thing still missing in todays AI systems.

Reply
@mb2776 says:

December 21, 2024 at 9:37 am

@@danielarvidsson3676 exactly! People don’t know that any AI needs to be stopped to train.

Reply
@PJ-hi1gz says:

December 21, 2024 at 10:51 am

@@mb2776you don’t need to stop it per se to retrain and redeploy a new version of the model.

Reply

@Shaunmcdonogh-shaunsurfing says:

December 21, 2024 at 1:33 am

I don’t know what’s more impressive, 03 being announced so soon or the turnaround time of your coverage on it. Fantastic work.

@brianmi40 says:

December 21, 2024 at 1:38 am

Imago with o3 announced SO VERY SOON after simply GO LIVE for o1, but I get the point.

Reply
@sth128 says:

December 21, 2024 at 1:44 am

Later it’ll be revealed that AI Explained is actually being run by an ASI that achieved omnipotence and traveled back in time to guide AGI alignment with Simple Bench.

Sadly the alignment is not with humanity. Instead it’s Roko’s Basilisk.

Reply
@Yobs2K says:

December 21, 2024 at 2:09 am

@@sth128 you just got me thinking about Roko’s Basilisk having to kill itself if time travel is possible and it didn’t travel back in time to help create itself faster

Reply
@kecksbelit3300 says:

December 21, 2024 at 2:09 am

it’s not soon they just released o1 so late o1’s training data is from 2023

Reply
@0xunknown336 says:

December 21, 2024 at 2:51 am

What’s more impressive is the cost compute per task being used by o3, 1k usd per one task!

Reply

@taumag says:

December 21, 2024 at 2:09 am

When comparing the bell curves, the mean of AI is now higher than the mean of humanity. The Turing test was “can you tell the difference between a computer and a person?” Now, AGI is “can we create a test that the smartest human can pass but the dumbest computer can’t?”

@Mirror_Lotus says:

December 21, 2024 at 5:19 am

It’s only natural for humans to move their goal posts. It’s one of the oldest plays in the book: Denial.

Reply
@memofromessex says:

December 21, 2024 at 7:47 am

I don’t believe it’s capable of outthinking a human if the information is created by a human.

Reply
@cluelesssoldier says:

December 21, 2024 at 8:23 am

@@memofromessex The difference now is that it is being taught HOW to think, not just WHAT to think – we already see low level emergent reasoning skills from current leading models, this takes it to a whole other level.

Reply
@samuctrebla3221 says:

December 21, 2024 at 9:05 am

@@Mirror_Lotusthe turing test is a moving goalpost by definition.

Reply
@zvexevz says:

December 21, 2024 at 9:36 am

The question was in fact “are there tests that the dumbest humans can easily pass but which the top AI models cannot?” As long as the answer is yes, then we know we aren’t at AGI.

Reply

@noone-ld7pt says:

December 21, 2024 at 2:16 am

Damn… OpenAI crushing ARC was not on my bingo card for 2024 (or even 2025). o1 was an impressive jump in performance, but o3 proves that the performance jump was not even the real point; it’s the completely paradigm breaking ability of being able to solve anything with an objectively correct answer. That feels like a profound change in potential and I don’t really know how I feel about it.

@andybaldman says:

December 21, 2024 at 6:16 am

You should feel worried.

Reply
@k14pc says:

December 21, 2024 at 7:29 am

agree on everything

Reply
@joey199412 says:

December 21, 2024 at 11:31 am

We now have definitive proof that human intelligence won’t be needed in 10 years time. What the implications and results of that will be, we don’t know. But the world is now permanently changed. This is a defining moment of our future as a species. I never said these words before and I never thought about it like this, until it was proven right now that you can just scale up test time compute to essentially answer any question you can reason the answer to.

Reply
@ClaimClam says:

December 21, 2024 at 12:08 pm

o3 is just a bs smokecreen to makeup for the fact they cant even ship gpt5

Reply
@oranges557 says:

December 21, 2024 at 12:14 pm

@@ClaimClampeople like you are super lame

Reply

@silpheedTandy says:

December 21, 2024 at 2:28 am

an upside of a channel not using clickbaitty titles, is that when the title is as dramatic as just “wow”, you can trust that the contents really is unusually impressive and maybe unexpected!

@not_a_sp00k says:

December 21, 2024 at 3:21 am

Matthew Berman take note

Reply
@K-3619 says:

December 21, 2024 at 3:49 am

@@not_a_sp00k glad someone else feels that way too bro it’s actually ridiculous

Reply
@IdkJustCookingDude says:

December 21, 2024 at 4:58 am

@@not_a_sp00klmao he’s gotten click happy

Reply
@alexei5231 says:

December 21, 2024 at 5:17 am

Yeah, unlike delusional channels going like ‘oMg AgI aChIeVeD!!!’

Reply
@TheAIPivot says:

December 21, 2024 at 5:30 am

I know. As soon as I saw that title on this channel, I paid attention.

Reply

@MePeterNicholls says:

December 21, 2024 at 2:51 am

We had “hold on to your papers” now we have “adjust your timelines!”

@somebody-anonymous says:

December 21, 2024 at 8:35 am

We’re glad to inform you that your appointment at the biomatter recycle plant has been moved up to next monday!

Reply
@fabp.2114 says:

December 21, 2024 at 10:03 am

@@somebody-anonymous who recycles for whom. who made what for whom. what is?

Reply
@sebkeccu4546 says:

December 21, 2024 at 10:04 am

At this day google Gemini Flash 2 remains the best model available, because o3 will only be available in februari according to the livestream (the minis will be end janauri). Google Gemini is actually available since this week and overperforms o1-pro in quality but also performance (time to compute). I’m really curious how o3 will compare to gemini flash 2 in bench marks. Especially in the qualitive tests. Because the computing time of o3 surely looks slower then the new Gemini Flash 2

Reply
@somebody-anonymous says:

December 21, 2024 at 10:24 am

@@fabp.2114 gotta make those stamps

Reply
@fabp.2114 says:

December 21, 2024 at 10:31 am

@@somebody-anonymous ad astra et ultra

Reply

@spanke2999 says:

December 21, 2024 at 4:28 am

if you want to summarize human existence, I guess it is the sentence “we really need to start focusing on safety…”

@OperationDarkside says:

December 21, 2024 at 11:10 am

That’s what a couple with bad self-control usually says after the first baby, but we all know that it’ll be at least 2 more babies and their financial collapse, before they take proper measures.

Reply
@wesley6442 says:

December 21, 2024 at 5:47 pm

I think of that song by metallica, “some kind of monster” because here we are playing with a force we don’t fully comprehend nor may realize the depths of its capabilities, or even if it is actively deceiving us.. like all the while, we were playing with a deadly grizzly bear where we thought it was a cute puppy

Reply

@dcgamer1027 says:

December 21, 2024 at 4:38 am

Im a little concerned about the power of hte majority being lost. The reason we have to get along and live with each other, in part, is because 1 strong ape can still be beaten by 2 weaker apes working together. But what if that one ape has 1000 autonomous drones to defend itself? What if it has robots to create food and entertainment for itself?
Im not just concerned about what the AI will do if it is unsafe, Im concerned with what some humans would do if they no longer need the rest of us.

@tsiryoliva6636 says:

December 21, 2024 at 1:59 pm

J’aime bien l’idée que les forts n’ont plus besoin de vivre avec les faibles et les jaloux

Reply
@everythingiscoolio says:

December 21, 2024 at 3:50 pm

Beautiful story about that written by Cixin Liu: https://en.wikipedia.org/wiki/For_the_Benefit_of_Mankind

Reply
@justtiredthings says:

December 21, 2024 at 5:24 pm

*This* is the AI safety problem, and there’s a reason that elites don’t talk about it. The whole trajectory of this paradigm shift is toward genocide.

Reply
@justtiredthings says:

December 21, 2024 at 5:25 pm

@@tsiryoliva6636 Cool, bro. Hope the leopards don’t eat your face

Reply

@williamjmccartan8879 says:

December 21, 2024 at 4:56 am

Thank you for sharing your time and work Phillip, its been a crazy year man, Merry Christmas to you and your family and any elves who might be assisting you, cheers

@aiexplained-official says:

December 21, 2024 at 10:08 am

Merry Christmas Bill, and everyone

Reply

@matthewuzhere says:

December 21, 2024 at 8:07 am

tbh i’m still pretty unconvinced by a lot of these benchmarks. they showed o1 being pretty smart too but it really doesn’t seem to be able to have a conversation about it’s own answer or recognize a repeated mistake it’s making over and over again. or to adapt as the users needs/requests change. makes it feel like the model is not really getting more intelligent, just better at specific processes. or maybe a better way to put it: compared to the ideal of AGI, it’s still somewhat narrow intelligence, just has a lot more narrow intelligence in different domains. idk how you would do a benchmark that would quantify its ability to converse with a human in these subjects, correct it’s own mistakes, etc—maybe that’s just not measurable. but if it was, i suspect that’s where it would become much more obvious that these models are not AGI and are not even particularly close to it. you have talked a little about this idea in the past I think, but not a ton from what i remember. would love to hear your thoughts.

full disclosure: i am only halfway through the vid, and i have not tried o3 myself. will update this comment if finishing the video changes what i think here majorly

@CoolIcingcake3467 says:

December 21, 2024 at 11:37 am

by your definition. frontiermath, swebench-verified etc… is flawed.
we could even generalize with this reasoning that ‘all benchmark is flawed’,

Reply
@odiseezall says:

December 21, 2024 at 12:44 pm

yeah and the Sun could also.. like.. just switch off tomorrow

Reply
@Bolidoo says:

December 21, 2024 at 1:16 pm

@matthewuzhere Imo this lack of ‘common sense’ is absolutely there and it’s a significant limitation for it’s utility. I wouldn’t be surprised if o3 had similar limitations. But models are getting a lot better in that area. Hallucinations where extremely worse just a few years ago. Nowadays sonnet 3.5 gets surprisingly a 41.4% and o1-preview a 41.7%. I think it’s a surprisingly hard task for llms, but progress suggests it may not be a hard wall after all.

Reply
@cherubin7th says:

December 21, 2024 at 3:16 pm

@@CoolIcingcake3467 Well, yes all benchmark is flawed is something everyone should know. Outside of AI we see great benchmark poor real world performance all the time.

Reply

@_Escaflowne_ says:

December 21, 2024 at 11:21 am

No clickbait + proper factual covering without over hype => subscribed

o3 – wow

Related Posts

Joe Lilli