o3 – wow
o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.
AI Insiders ($9!):
FrontierMath:
Chollet Statement:
MLC Paper:
AlphaCode 2:
Human Performance on ARC-AGI:
Wei Tweet ‘3 months’:
Deliberative Alignment Paper:
Brown Safety Tweet:
Swe-Bench Verified:
Amodei Prediction:
David Dohan: 16 hours
OpenAI Personal Writing:
John Hallman Tweet:
00:00 – Introduction
01:19 – What is o3?
03:18 – FrontierMath
05:15 – o4, o5
06:03 – GPQA
06:24 – Coding, Codeforces + SWE-verified, AlphaCode 2
08:13 – 1st Caveat
09:03 – Compositionality?
10:16 – SimpleBench?
13:11 – ARC-AGI, Chollet
20:25 – Safety Implicaitons
AI Insiders:
Non-hype Newsletter:
Podcast:
You already know i was waiting for this video with my eye twitching
He’s like the only guy in this space that doesn’t use clickbait titles. Everyone else just flat out states that OpenAI created AGI, even though no one actually thinks that.
@@JonnyCook The title to this video is literally “wow”. That’s a textbook definition of clickbait. It’s the reason I clicked it. What are you talking about, my man?
@@everythingiscoolio It’s wow because he was actually (for the first time) wowed. Not to make more people click.
@@joey199412 Right…. I’m glad you believe that.
@@everythingiscoolio Depends if you determine clickbait by whether it’s fairly used or not.
I would at the least only criticise something for clickbait if the video doesn’t deliver on its promise.
To think this is the worst the models will ever be. ChatGPT only released 2 years ago, now we are talking about it scoring among top experts in nearly every field. Insane
well actually chatgpt is largely based on gpt3 and gpt3 was trained about 4 and half years ago. But that not very different ig
They pay people all day to answer questions.. its not THAT genius yet..
@@johanavril1691 What? GPT1 was trained in 2017, after Google developed the transformer. Btw the same logic/architecture existed since 1965.
“scoring near top experts in every field” is quite an exaggeration, but yeah that’s still impressive
@@merlinwarage hm yeah I see what you mean but the technology behind chatgpt is gpt3 and it did take 4 years to get from that to what we have today. But also your right and I’m right too just a lil different
I’ve been waiting so hard for this video. What a turnaround time
I’ve seen comments like this. If you already watched the OpenAI video about o3, just what are you so anxious to watch someone’s recap for?
@@codycast Besides the cursory argument that any company’s promo content is inherently biased, third-party analysis is a critical part of science. And I think this channel is great at that.
Genuinely what is happening anymore. We are heading towards the most crazy and important year in human history. What a time to be alive, wow
yeah, we are heading into neo-feudalism, human-trafficing on enormous scale, eradication of brain usage, extremely low quality of life and spike of violence. Yeah! What a time to be alive!
The first day of the new year is a wednesday, the next thursday and then friday, of course: WTF!
Maybe something is cooking, lol
It’s also kind of horrifying. It’s easy to imagine a utopian ending to AGI suddenly existing, but I don’t want to live through the time between now and that ending. I feel like the only jobs humans will be qualified for are manual labor, and there’s a lot of people with a huge stake in the current economic system sticking around.
I’d rather be alive 30 years ago
And yet… Nothing really has changed still. Shops, businesses, governments… Nothing is different.
The future is looking so immensely interesting. What a time to be alive!
Hold on to your papers! 🧻
Wait… aren’t you the Forsen comment troll 😶🌫️😶🌫️
Interesting is one way to put it.
*This is SO fucking boring. What the hell? This was supposed to be the future. Where is the teleportation and levitation and telekinesis? This is nonsense!!*
Frightening.
You know, I find it absurd that “Open”AI and other proprietary AI labs get to benefit from open-source research and publications while offering relatively speaking little to nothing in return. It’s an unfair advantage. Reminds me of the game” Werewolf” by Davidoff, where an informed minority almost always wins against the uninformed majority.
And yet OpenAI give the world free access to ChatGPT for 4 months before even the first competitor showed up, and they’ve done the same with each advancing model since then.
And I think you could find a lot of AI researchers who would agree that simply OpenAI REVEALING o1 Preview when they did and what/how it did simply from a surface perspective inspired a whole lot of new research.
Methinks thou judgest too harshly.
How much more do you want them to give?
I think they are very open and fast in releases for being close source companies.
@@okaydetar821Everything. Most if not all advanced technology today was paid for by our taxes. It was built using us as research. It was refined by the colleges and infrastructure we produce and sustain. It is our right
I don’t think they have infinite money so yeah
The real benchmark is real world examples. Remember: Current benchmarks are like a laboratory testing ground. Yes, the questions asked might be real world examples, but they will be written in a way that’s clear and state actual objectives/goals. A sterile set of hard but clear questions.
The real world is different. If I get a well written objective from a stakeholder for me to implement, it’ll be a first. The actual world is nitty gritty. Full of nuances, filled with human error and everything needs refinement first.
Therefore, my benchmark: Read a typical agile development user story written by some key-user or stakeholder and try to implement it in such a fashion that it’ll pass testing and is production ready within a certain timelimit. If it can do that, I’ll sit back down.
The opposite incentive will also start ticking: people that are able to formulate precise specs to feed into neural net will be sought after.
These arguments against AI are fascinating. They are getting more and more complex. Eventually we will say: “alright it’s fine and all but tell me when it’s able to run and optimize all technological systems on earth simultaneously”.
This moving of the goal posts reminds me of the arguments for god that traditionally religious types have used for hundreds of years as scientific discoveries have opened the doors to the unknown.
God is the weather.
No, wait, god is the sun.
Actually, it seems god is space.
Hmm, I suppose god is dark energy? Quantum probability?
Smart religious scholars have never argued God is the weather or dark space energy mate. Without religion you have no objective morality, even if you worship science and “neoliberal notions of progress” @irjonesy
@MuhammadRaiyan135that’s exactly the point, how do you define smart scholars? We literally have religious books which considers natural elements as gods.
Also, morality has nothing do to with religion. World is full of religious folks who are pathetic human beings. Not saying, being atheist makes you a good human either!
But critical thinking is definitely helpful in being a considerate, reasonable and understanding person.
@MuhammadRaiyan135 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣 Of course, it had to be a piss-full.
We’re in the middle of the intelligence explosion.
Yep. Early version of it. Iteration loop is tightening — the next big milestone is when we remove the human in the loop.
Too bad society no longer cares about intelligence.
If only we could figure out how we as humans intuitively grasp or understand a basic concept such as some of these puzzles, to think about thinking, how we reason and replicate that in a machine/AI but faster and better
You know it’s good when the simple bench guy is impressed with OpenAI
He has a name! It is AI Explained guy.
I felt superior in that, yeah it can test better than me but I can pick up a ball and throw it, walk up stairs etc.. then I saw a video of robotic humanoids with pressure sensors on their hands achieving a high degree of dexterity.. and it has me worried, it may do human better than me lol
imho we’re overfitting our expectations to the utility of benchmarks. AGI doesn’t need to be a true genius, or a genius in any number of fields to be considered AGI – As a general intelligence, it “just” needs to be proficient at operating reliably and thoughtfully in the real world.
I think there are very specific simple things it needs to do well that it doesn’t yet. One of which being better memory. Personalisation to user will also be important. Also for robots it would make sense to create a sense of awareness. Not as in sentience, but knowing what its model is, name given by the owner, info on important things the owner sees important, etc, etc. It is possible with models now but there is far too much hallucination still. But if we combine that with a dextrous enough robot, then yes is already very transformative on its own.
Many jobs will get displaced by that alone.
Totally agree—benchmarks are a useful tool, but they might not capture the full essence of what makes intelligence “general.” It’s like measuring a chef’s skill by how well they bake bread—it tells you something, but not everything. AGI, as you said, isn’t about being a genius in every field; it’s about adaptability and reliability in the complexities of the real world.
What I find fascinating is how we define “operating thoughtfully.” Should an AGI aim for practical problem-solving across domains, or do you think there’s room for it to develop something akin to intuition? Benchmarks are only one piece of the puzzle—how would you measure AGI’s ability to navigate uncharted territory, like moral dilemmas or cultural nuances?
And be able to learn in real time. That is the big thing still missing in todays AI systems.
@@danielarvidsson3676 exactly! People don’t know that any AI needs to be stopped to train.
@@mb2776you don’t need to stop it per se to retrain and redeploy a new version of the model.
I don’t know what’s more impressive, 03 being announced so soon or the turnaround time of your coverage on it. Fantastic work.
Imago with o3 announced SO VERY SOON after simply GO LIVE for o1, but I get the point.
Later it’ll be revealed that AI Explained is actually being run by an ASI that achieved omnipotence and traveled back in time to guide AGI alignment with Simple Bench.
Sadly the alignment is not with humanity. Instead it’s Roko’s Basilisk.
@@sth128 you just got me thinking about Roko’s Basilisk having to kill itself if time travel is possible and it didn’t travel back in time to help create itself faster
it’s not soon they just released o1 so late o1’s training data is from 2023
What’s more impressive is the cost compute per task being used by o3, 1k usd per one task!
When comparing the bell curves, the mean of AI is now higher than the mean of humanity. The Turing test was “can you tell the difference between a computer and a person?” Now, AGI is “can we create a test that the smartest human can pass but the dumbest computer can’t?”
It’s only natural for humans to move their goal posts. It’s one of the oldest plays in the book: Denial.
I don’t believe it’s capable of outthinking a human if the information is created by a human.
@@memofromessex The difference now is that it is being taught HOW to think, not just WHAT to think – we already see low level emergent reasoning skills from current leading models, this takes it to a whole other level.
@@Mirror_Lotusthe turing test is a moving goalpost by definition.
The question was in fact “are there tests that the dumbest humans can easily pass but which the top AI models cannot?” As long as the answer is yes, then we know we aren’t at AGI.
Damn… OpenAI crushing ARC was not on my bingo card for 2024 (or even 2025). o1 was an impressive jump in performance, but o3 proves that the performance jump was not even the real point; it’s the completely paradigm breaking ability of being able to solve anything with an objectively correct answer. That feels like a profound change in potential and I don’t really know how I feel about it.
You should feel worried.
agree on everything
We now have definitive proof that human intelligence won’t be needed in 10 years time. What the implications and results of that will be, we don’t know. But the world is now permanently changed. This is a defining moment of our future as a species. I never said these words before and I never thought about it like this, until it was proven right now that you can just scale up test time compute to essentially answer any question you can reason the answer to.
o3 is just a bs smokecreen to makeup for the fact they cant even ship gpt5
@@ClaimClampeople like you are super lame
an upside of a channel not using clickbaitty titles, is that when the title is as dramatic as just “wow”, you can trust that the contents really is unusually impressive and maybe unexpected!
Matthew Berman take note
@@not_a_sp00k glad someone else feels that way too bro it’s actually ridiculous
@@not_a_sp00klmao he’s gotten click happy
Yeah, unlike delusional channels going like ‘oMg AgI aChIeVeD!!!’
I know. As soon as I saw that title on this channel, I paid attention.
We had “hold on to your papers” now we have “adjust your timelines!”
We’re glad to inform you that your appointment at the biomatter recycle plant has been moved up to next monday!
@@somebody-anonymous who recycles for whom. who made what for whom. what is?
At this day google Gemini Flash 2 remains the best model available, because o3 will only be available in februari according to the livestream (the minis will be end janauri). Google Gemini is actually available since this week and overperforms o1-pro in quality but also performance (time to compute). I’m really curious how o3 will compare to gemini flash 2 in bench marks. Especially in the qualitive tests. Because the computing time of o3 surely looks slower then the new Gemini Flash 2
@@fabp.2114 gotta make those stamps
@@somebody-anonymous ad astra et ultra
if you want to summarize human existence, I guess it is the sentence “we really need to start focusing on safety…”
That’s what a couple with bad self-control usually says after the first baby, but we all know that it’ll be at least 2 more babies and their financial collapse, before they take proper measures.
I think of that song by metallica, “some kind of monster” because here we are playing with a force we don’t fully comprehend nor may realize the depths of its capabilities, or even if it is actively deceiving us.. like all the while, we were playing with a deadly grizzly bear where we thought it was a cute puppy
Im a little concerned about the power of hte majority being lost. The reason we have to get along and live with each other, in part, is because 1 strong ape can still be beaten by 2 weaker apes working together. But what if that one ape has 1000 autonomous drones to defend itself? What if it has robots to create food and entertainment for itself?
Im not just concerned about what the AI will do if it is unsafe, Im concerned with what some humans would do if they no longer need the rest of us.
J’aime bien l’idée que les forts n’ont plus besoin de vivre avec les faibles et les jaloux
Beautiful story about that written by Cixin Liu: https://en.wikipedia.org/wiki/For_the_Benefit_of_Mankind
*This* is the AI safety problem, and there’s a reason that elites don’t talk about it. The whole trajectory of this paradigm shift is toward genocide.
@@tsiryoliva6636 Cool, bro. Hope the leopards don’t eat your face
Thank you for sharing your time and work Phillip, its been a crazy year man, Merry Christmas to you and your family and any elves who might be assisting you, cheers
Merry Christmas Bill, and everyone
tbh i’m still pretty unconvinced by a lot of these benchmarks. they showed o1 being pretty smart too but it really doesn’t seem to be able to have a conversation about it’s own answer or recognize a repeated mistake it’s making over and over again. or to adapt as the users needs/requests change. makes it feel like the model is not really getting more intelligent, just better at specific processes. or maybe a better way to put it: compared to the ideal of AGI, it’s still somewhat narrow intelligence, just has a lot more narrow intelligence in different domains. idk how you would do a benchmark that would quantify its ability to converse with a human in these subjects, correct it’s own mistakes, etc—maybe that’s just not measurable. but if it was, i suspect that’s where it would become much more obvious that these models are not AGI and are not even particularly close to it. you have talked a little about this idea in the past I think, but not a ton from what i remember. would love to hear your thoughts.
full disclosure: i am only halfway through the vid, and i have not tried o3 myself. will update this comment if finishing the video changes what i think here majorly
by your definition. frontiermath, swebench-verified etc… is flawed.
we could even generalize with this reasoning that ‘all benchmark is flawed’,
yeah and the Sun could also.. like.. just switch off tomorrow
@matthewuzhere Imo this lack of ‘common sense’ is absolutely there and it’s a significant limitation for it’s utility. I wouldn’t be surprised if o3 had similar limitations. But models are getting a lot better in that area. Hallucinations where extremely worse just a few years ago. Nowadays sonnet 3.5 gets surprisingly a 41.4% and o1-preview a 41.7%. I think it’s a surprisingly hard task for llms, but progress suggests it may not be a hard wall after all.
@@CoolIcingcake3467 Well, yes all benchmark is flawed is something everyone should know. Outside of AI we see great benchmark poor real world performance all the time.
No clickbait + proper factual covering without over hype => subscribed