• Home
  • AI

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

A new state of the art LLM (at least for creative writing and basic reasoning) but what lies behind the numbers that were put out? Is it for real, and are AI agents about to grab your mouse and shake your cursor? Plus, results on my own Simple Bench, and new tools from Runway (Act-One), HeyGen (Zoom Calls) and an updated NotebookLM. AI, without the hype.

Weights and Biases' Weave:

AI Insiders:

Chapters:
00:00 – Introduction
00:57 – Claude 3.5 Sonnet (New) Paper
02:06 – Demo
02:58 – OSWorld
04:29 – Benchmarks compared + OpenAI Response
08:30 – Tau-Bench
13:09 – SimpleBench Results
17:05 – Yellowstone Detour
17:29 – Runway Act-One
18:44 – HeyGen Interactive Avatars + Demo
21:06 – NotebookLM Update

New Claude:

Paper:
Demo Diversion:

o1 Comparison:

Tau Bench:
OSWorld:
GSM Reasoning:
Sierra Valuation:
Claude Impressions:
o1 System Card:
NotebookLM:
Runway Act-One:
HeyGen Zoom:
Ministral Comparison:

My Coursera Course – The 8 Most Controversial Terms in AI:

Non-hype Newsletter:

I use Descript to edit my videos (no pauses or filler words!):

Many people expense AI Insiders for work. Feel free to use the Template in the 'About Section' of my Patreon.

Joe Lilli
 

  • @johnbrennan7965 says:

    Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍

    • @aiexplained-official says:

      Thanks, Claude

    • @JohnLewis-old says:

      @@aiexplained-official You’re welcome. (I’m using another account, hope that’s not confusing for you.)

    • @sup3a says:

      Lol how

    • @TheRealUsername says:

      ​@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.

    • @Luxcium says:

      @@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you’re trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.

      I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…

  • @AfifFarhati says:

    Funny , i was just using it a few hours ago and i was thinking to myself: “Is it me or is it better at talking than it used to be?” and now this video drops…

    • @SeerWS says:

      Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.

  • @sanesanyo says:

    Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts YouTubers.

  • @luigi.0533 says:

    Best AI News YT Channel

  • @75M says:

    You are the best AI analyst on youtube! Always looking forward to hear your take on things.

  • @carterellsworth7844 says:

    Lmfao at how the call with Vicky ended

  • @rasmusfoy says:

    Commented and liked as always.

    Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!

  • @mimameta says:

    Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12

    • @user-sl6gn1ss8p says:

      Do you mean o2-vroom-v12 (super duper)?

    • @41-Haiku says:

      GPT-Presentation-v2-draft 3-final-FINAL

    • @Boufonamong says:

      Tbh I love the name Claude sonnet

    • @electron6825 says:

      NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8

    • @kylemorris5338 says:

      @@Boufonamong The naming of three programs that write as “Haiku, Sonnet, Opus”, in increasing order of size, is inspired.
      It’s the numbers that come before them that are really weird. What’s the point of giving it a version number if you aren’t going to increase it with such a big leap in performance?
      Philip is correct, if they didn’t want to go so far as to call it Claude 4 they should have at LEAST called it 3.6

  • @MACD69 says:

    21:00

    Can you still see me 😅

  • @Voltlighter says:

    The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
    What did they think people were going to use the Zoom avatars for exactly?

  • @robertopena6621 says:

    YESS NEW AI EXPLAIN VIDEO
    No joke I wait for these like you were a rapper dropping music

  • @jonp3674 says:

    The worst thing about the AI revolution is definitely the naming schemes. I don’t want to live under a robot overlord called “Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0”

  • @therainman7777 says:

    I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?

    • @frabcus says:

      I’m assuming there are lots of scenarios – and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.

    • @therainman7777 says:

      @@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.

    • @sleepykitten2168 says:

      I believe the way the benchmark works is this:
      For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.

    • @jeremydouglas1763 says:

      Sorry I still don’t fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can’t be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you’d expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don’t know if I am interpreting correctly! Can anyone confirm?

  • @jamqdlaty says:

    The upgrade is huge, I didn’t expect that from just “New” version. It’s not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn’t work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes “ah, yes” rather than “I apologize for blah blah blah”. It feels so much natural, it has actually clever ideas, the benchmark differences don’t really show how it improved. Coupled with insane context lengths it’s amazing.

  • @jasontang6725 says:

    Vicky’s last “Can you still see me?” was peak Zoom-call.

  • @d00bied00 says:

    This is Doobiedoo’s personal assistant, Ling, posting gratitude for Mr. Philip. The YouTube video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤

  • @jumpstar9000 says:

    I’m pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.

  • @AidanofVT says:

    For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don’t see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won’t prevent LLM sub-agents being used as powerful productivity tools for human workers.

  • @thehighhnotes says:

    Pro tip; NotebookLM works with different languages.

    Click customize to instruct it with the desired language. Works wonders for me in Dutch

  • >