The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think
A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.
Weights and Biases' Weave:
AI Insiders:
Chapters:
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update
New Claude:
Paper:
Demo Diversion:
o1 Comparison:
Tau Bench:
OSWorld:
GSM Reasoning:
Sierra Valuation:
Claude Impressions:
o1 System Card:
NotebookLM:
Runway Act-One:
HeyGen Zoom:
Ministral Comparison:
My Coursera Course – The 8 Most Controversial Terms in AI:
Non-hype Newsletter:
I use Descript to edit my videos (no pauses or filler words!):
Many people expense AI Insiders for work. Feel free to use the Template in the 'About Section' of my Patreon.
Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍
Thanks, Claude
@@aiexplained-official You’re welcome. (I’m using another account, hope that’s not confusing for you.)
Lol how
@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.
@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you’re trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.
I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…
Funny , i was just using it a few hours ago and i was thinking to myself: “Is it me or is it better at talking than it used to be?” and now this video drops…
Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.
Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts YouTubers.
Don’t worry, TheAIGrid can’t hurt you now 😂
Best AI News YT Channel
Thanks Luigi!
You are the best AI analyst on youtube! Always looking forward to hear your take on things.
Thank you 75!
@aiexplained-official Its 75M. 75 was his slave name.
Lmfao at how the call with Vicky ended
I see you, Vicky. I see you.
I am not a cat.
“Can you still see me” 🤣🤣🤣
lol
She was really trying to rope him in to doing some role playing. I think this might have some potential 👆
Commented and liked as always.
Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!
Thanks ras!
Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12
Do you mean o2-vroom-v12 (super duper)?
GPT-Presentation-v2-draft 3-final-FINAL
Tbh I love the name Claude sonnet
NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8
@@Boufonamong The naming of three programs that write as “Haiku, Sonnet, Opus”, in increasing order of size, is inspired.
It’s the numbers that come before them that are really weird. What’s the point of giving it a version number if you aren’t going to increase it with such a big leap in performance?
Philip is correct, if they didn’t want to go so far as to call it Claude 4 they should have at LEAST called it 3.6
21:00
Can you still see me 😅
That killed me
Is my audio working is next hahah
“Philip–I think you’re muted. No, it’s the button down at the bottom. Philip?”
The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
What did they think people were going to use the Zoom avatars for exactly?
the real question is why they weren’t more subtle about it : p
The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅
@@Words-. also the Schrodinger’s shirt shirt
YESS NEW AI EXPLAIN VIDEO
No joke I wait for these like you were a rapper dropping music
I know, right?? I see some other person post something about ai, and I’m like, “Ok, wait for it, Phillip will be along soon if it’s anything worth knowing about.”
That’s supposed to be a compliment, right? 😂
The worst thing about the AI revolution is definitely the naming schemes. I don’t want to live under a robot overlord called “Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0”
Why they can’t simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.
@@daviddavidson1417they want new number to be BIG
@@daviddavidson1417 For the same reason graphic designers have folders full of files named “business card – variant 2 _FINAL – revision 3”?
@@daviddavidson1417 OR: Marketing department.
Then you might want to avoid looking at the names of self-hosted open source models lol
I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?
I’m assuming there are lots of scenarios – and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.
@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.
I believe the way the benchmark works is this:
For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.
Sorry I still don’t fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can’t be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you’d expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don’t know if I am interpreting correctly! Can anyone confirm?
The upgrade is huge, I didn’t expect that from just “New” version. It’s not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn’t work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes “ah, yes” rather than “I apologize for blah blah blah”. It feels so much natural, it has actually clever ideas, the benchmark differences don’t really show how it improved. Coupled with insane context lengths it’s amazing.
The “ah, yes” is super signature of the new Sonnet lol
still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good
Vicky’s last “Can you still see me?” was peak Zoom-call.
That was amazing. 🤣 Aligned to human _behavior,_ for sure.
That caught me off guard so much 😂
This is Doobiedoo’s personal assistant, Ling, posting gratitude for Mr. Philip. The YouTube video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤
….what
@@electron6825He had the AI post that
wtf is this real
I’m pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.
For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don’t see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won’t prevent LLM sub-agents being used as powerful productivity tools for human workers.
Pro tip; NotebookLM works with different languages.
Click customize to instruct it with the desired language. Works wonders for me in Dutch