The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.

Weights and Biases' Weave:

AI Insiders:

Chapters:
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update

New Claude:

Paper:
Demo Diversion:

o1 Comparison:

Tau Bench:
OSWorld:
GSM Reasoning:
Sierra Valuation:
Claude Impressions:
o1 System Card:
NotebookLM:
Runway Act-One:
HeyGen Zoom:
Ministral Comparison:

My Coursera Course – The 8 Most Controversial Terms in AI:

Non-hype Newsletter:

I use Descript to edit my videos (no pauses or filler words!):

Many people expense AI Insiders for work. Feel free to use the Template in the 'About Section' of my Patreon.

@johnbrennan7965 says:

October 23, 2024 at 7:57 pm

Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍

@aiexplained-official says:

October 23, 2024 at 8:21 pm

Thanks, Claude

@JohnLewis-old says:

October 23, 2024 at 8:49 pm

@@aiexplained-official You’re welcome. (I’m using another account, hope that’s not confusing for you.)

@sup3a says:

October 23, 2024 at 8:53 pm

Lol how

@TheRealUsername says:

October 23, 2024 at 10:10 pm

@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.

@Luxcium says:

October 23, 2024 at 10:58 pm

@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you’re trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.

I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…

@AfifFarhati says:

Funny , i was just using it a few hours ago and i was thinking to myself: “Is it me or is it better at talking than it used to be?” and now this video drops…

@SeerWS says:

October 23, 2024 at 9:04 pm

Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.

@sanesanyo says:

October 23, 2024 at 8:04 pm

Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts YouTubers.

@OriginalRaveParty says:

October 24, 2024 at 6:29 am

Don’t worry, TheAIGrid can’t hurt you now 😂

@luigi.0533 says:

October 23, 2024 at 8:12 pm

Best AI News YT Channel

@aiexplained-official says:

October 23, 2024 at 8:22 pm

Thanks Luigi!

@75M says:

October 23, 2024 at 8:19 pm

You are the best AI analyst on youtube! Always looking forward to hear your take on things.

@aiexplained-official says:

October 23, 2024 at 8:20 pm

Thank you 75!

@denjamin2633 says:

October 24, 2024 at 12:50 pm

@aiexplained-official Its 75M. 75 was his slave name.

@carterellsworth7844 says:

October 23, 2024 at 8:34 pm

Lmfao at how the call with Vicky ended

@AAjax says:

October 23, 2024 at 9:12 pm

I see you, Vicky. I see you.

@apester2 says:

October 23, 2024 at 9:42 pm

I am not a cat.

@aieousavren says:

October 23, 2024 at 10:05 pm

“Can you still see me” 🤣🤣🤣

@Octo_Fractalis says:

October 23, 2024 at 10:28 pm

lol

@00CooG00 says:

October 24, 2024 at 2:32 am

She was really trying to rope him in to doing some role playing. I think this might have some potential 👆

@rasmusfoy says:

October 23, 2024 at 8:35 pm

Commented and liked as always.

Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!

@aiexplained-official says:

October 23, 2024 at 8:45 pm

Thanks ras!

@mimameta says:

October 23, 2024 at 8:39 pm

Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12

@user-sl6gn1ss8p says:

October 23, 2024 at 8:57 pm

Do you mean o2-vroom-v12 (super duper)?

@41-Haiku says:

October 23, 2024 at 11:10 pm

GPT-Presentation-v2-draft 3-final-FINAL

@Boufonamong says:

October 23, 2024 at 11:11 pm

Tbh I love the name Claude sonnet

@electron6825 says:

October 24, 2024 at 2:01 am

NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8

@kylemorris5338 says:

October 24, 2024 at 2:19 am

@@Boufonamong The naming of three programs that write as “Haiku, Sonnet, Opus”, in increasing order of size, is inspired.
It’s the numbers that come before them that are really weird. What’s the point of giving it a version number if you aren’t going to increase it with such a big leap in performance?
Philip is correct, if they didn’t want to go so far as to call it Claude 4 they should have at LEAST called it 3.6

@MACD69 says:

October 23, 2024 at 8:51 pm

21:00

Can you still see me 😅

@cf3744 says:

October 23, 2024 at 9:20 pm

That killed me

@cf3744 says:

October 23, 2024 at 9:20 pm

Is my audio working is next hahah

@41-Haiku says:

October 23, 2024 at 11:13 pm

“Philip–I think you’re muted. No, it’s the button down at the bottom. Philip?”

@Voltlighter says:

The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
What did they think people were going to use the Zoom avatars for exactly?

@user-sl6gn1ss8p says:

October 23, 2024 at 8:53 pm

the real question is why they weren’t more subtle about it : p

@Words-. says:

October 23, 2024 at 10:04 pm

The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅

@user-sl6gn1ss8p says:

October 23, 2024 at 10:11 pm

@@Words-. also the Schrodinger’s shirt shirt

@robertopena6621 says:

October 23, 2024 at 8:52 pm

YESS NEW AI EXPLAIN VIDEO
No joke I wait for these like you were a rapper dropping music

@MiminNB says:

October 23, 2024 at 10:27 pm

I know, right?? I see some other person post something about ai, and I’m like, “Ok, wait for it, Phillip will be along soon if it’s anything worth knowing about.”

@ArnaudMEURET says:

October 24, 2024 at 11:49 am

That’s supposed to be a compliment, right? 😂

@jonp3674 says:

October 23, 2024 at 8:57 pm

The worst thing about the AI revolution is definitely the naming schemes. I don’t want to live under a robot overlord called “Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0”

@daviddavidson1417 says:

October 24, 2024 at 2:52 am

Why they can’t simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.

@ryzikx says:

October 24, 2024 at 5:17 am

@@daviddavidson1417they want new number to be BIG

@chrism1503 says:

October 24, 2024 at 6:11 am

@@daviddavidson1417 For the same reason graphic designers have folders full of files named “business card – variant 2 _FINAL – revision 3”?

@chrism1503 says:

October 24, 2024 at 6:13 am

@@daviddavidson1417 OR: Marketing department.

@errgo2713 says:

October 24, 2024 at 7:29 am

Then you might want to avoid looking at the names of self-hosted open source models lol

@therainman7777 says:

October 23, 2024 at 9:19 pm

I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?

@frabcus says:

October 23, 2024 at 9:22 pm

I’m assuming there are lots of scenarios – and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.

@therainman7777 says:

October 23, 2024 at 9:24 pm

@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.

@sleepykitten2168 says:

October 23, 2024 at 11:15 pm

I believe the way the benchmark works is this:
For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.

@jeremydouglas1763 says:

October 24, 2024 at 9:15 am

Sorry I still don’t fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can’t be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you’d expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don’t know if I am interpreting correctly! Can anyone confirm?

@jamqdlaty says:

October 23, 2024 at 9:20 pm

The upgrade is huge, I didn’t expect that from just “New” version. It’s not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn’t work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes “ah, yes” rather than “I apologize for blah blah blah”. It feels so much natural, it has actually clever ideas, the benchmark differences don’t really show how it improved. Coupled with insane context lengths it’s amazing.

@yesnoidk says:

October 23, 2024 at 11:20 pm

The “ah, yes” is super signature of the new Sonnet lol

@apache937 says:

October 24, 2024 at 1:33 pm

still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good

@jasontang6725 says:

October 23, 2024 at 9:21 pm

Vicky’s last “Can you still see me?” was peak Zoom-call.

@41-Haiku says:

October 23, 2024 at 11:08 pm

That was amazing. 🤣 Aligned to human _behavior,_ for sure.

@Lvxurie says:

October 23, 2024 at 11:56 pm

That caught me off guard so much 😂

@d00bied00 says:

October 23, 2024 at 9:52 pm

This is Doobiedoo’s personal assistant, Ling, posting gratitude for Mr. Philip. The YouTube video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤

@electron6825 says:

October 24, 2024 at 3:06 am

….what

@pmarreck says:

October 24, 2024 at 4:23 am

@@electron6825He had the AI post that

@magicityjack3018 says:

October 24, 2024 at 1:26 pm

wtf is this real

@jumpstar9000 says:

October 23, 2024 at 11:40 pm

I’m pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.

@AidanofVT says:

October 24, 2024 at 3:17 pm

For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don’t see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won’t prevent LLM sub-agents being used as powerful productivity tools for human workers.

@thehighhnotes says:

October 24, 2024 at 4:17 pm

Pro tip; NotebookLM works with different languages.

Click customize to instruct it with the desired language. Works wonders for me in Dutch

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

Related Posts

Joe Lilli