• Home
  • AI

Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

Llama 3.1 is here, and if anything, it’s paper is even more impressive. It’s like Meta want to reveal the secret sauce of LLMs. I go through the highlights of all 92 pages and test Llama 405B on the SIMPLE bench, my new private, vetted general intelligence benchmark, against GPT 4o and Turbo, Gemini 1.5 Pro and Claude 3.5 Sonnet.

Weights and Biases Link:  

AI Insiders:

Llama 3.1 Paper:

Llama 405B Web Release:

Zuckerberg Money Run Out Interview:

Zuckerberg Llama 4:

OpenAI Losses:

Nvidia Nims:

Open Source Definition:

Data Disappearing:

Infinite Bench:

ARC-A12 Challenge:

Adversarial Examples:

‘Systemic Risk’:

Zuckerberg-Hawley Letter:

Groq Demo:

Altman Letter:

Lmsys Response:

Scale Private Leaderboard:

AI Insiders:

Non-hype Newsletter:
GenAI Hourly Consulting:

Joe Lilli
 

  • @rickandelon9374 says:

    Best Artificial Intelligence reporter on the planet. Period.

  • @Lishtenbird says:

    3:31 Companies like Reddit may not have had permissions for selling “their” data either.

    • @YeeLeeHaw says:

      The entire data ownership is complete bonkers to begin with. Imagine if we charged each other in real life for all information we gave away in everyday conversations.

  • @Radicoly says:

    > Meta drops a 90 page, 12 hours long manifesto in dense technical literature on computer scientific information

    > AI Explained 10^-100000ths of a second later. “So I read the whole thing. Here’s an entire twenty minute video essay on it.”

    How do we know YOU aren’t the AI?

  • @Neomadra says:

    It’s gonna be a busy week. Mistral Large 2 just released with 123B params and supposedly almost on par with Llama 3.1 405B

  • @novachromatic says:

    3:17 Zuckerberg used the word open-source more than he used the word AI in that paragraph 😂

  • @saipien says:

    – 00:00 🦙 Llama 3.1 model intro and comparison with competitors.

    – 03:26 🧠 AI challenges and data filtering.

    – 07:59 📊 Benchmark scaling laws and challenges with hardware.

    – 11:40 💡 Private Benchmark and model performance comparison.

    – 15:02 🛡 Adversarial tests impact and contamination detection.

    – 17:48 💬 Safety metrics, refusal rates, and model vulnerabilities.

    – 22:49 🧠 Llama 3 performance vs. competitors.

    – 23:30 📹 Insights on data training using Instagram reels.

    – 24:29 🍽 Mention of additional experiments and toolkits for AI applications.

  • @michaelroberts9587 says:

    Mistral just realest Mistral Large 2, 123B and it trades blows with Llama 405B

  • @AI_native says:

    _”They are language models, not reality simulators”_

    This is a super pertinent takeaway!

  • @revo2499 says:

    Could you please test the new Mistral Large 2 model with your SIMPLE Bench? I checked a dozen tricky questions and this model answered almost all of them correctly. I am very curious to see what score it will get.

  • @timseguine2 says:

    What I liked about this release, is that it is a lot more scientific in its approach than a lot of the major LLM stuff lately has been. I feel like this is finally a pretty good characterization of what the decoder only transformer architecture is fully capable of.

    And I think the open source thing is an important thing to point out. But of the large AI labs, I think it is only fair to give them credit that they are trying to be more open than the other labs. At least they have full source code for inference and have open weights. And they have historically had better license terms with every LLM they have released.

  • @unvergebeneid says:

    4:11 just back from a skiing trip are we?

  • @artemiyshadrin1980 says:

    I can imagine your frustration (probably mixed with excitement) when you were already finishing this video and noticed that Mistral had just dropped their new large model lol

  • @psylocyn says:

    You prove that click bait isn’t necessary, a channel can succeed on merit

  • @londonl.5892 says:

    As per usual, the consistency and speed are incredible. Well done!

  • @marc_frank says:

    good job creating a new test where the models score low. everybody is boasting about getting over 90% but that’s the point where they should set new goals.

  • @MrSchweppes says:

    “‘Substantial further improvements of these models are on the horizon’ – this quote captures the paper’s most important point. All major players in the field agree: we are not nearing the plateau of scaling laws. Great video, Phillip! It was a pure joy to watch! 👍

    • @IvanSoregashi says:

      Not sure they are referring to scaling here.

    • @MrSchweppes says:

      @@IvanSoregashi I’m pretty sure that the next generation of models will use at least one order of magnitude more compute. It’s very doubtful that this trend will stop anytime soon.

    • @leonfa259 says:

      @@MrSchweppes One order of magnitude more compute is not that much in AI, according to the scaling laws you can only expect 15% less loss. Between GPT-3 and 4 were 4 orders of magnitude difference.

    • @MrSchweppes says:

      @@leonfa259 Hence “at least”

    • @MrSchweppes says:

      @@leonfa259 I’m not sure you are right. 4 orders of magnitude is 10,000X more compute. GPT-3 was trained on supercomputer of 10,000 V100s. GPT was trained 2 years later on a supercomputer of 25,000 A100s. Despite the improvements in both hardware and software, I’m very doubtful that is 10,000X more compute.

  • @JohnVance says:

    Maybe I’m just old, but it’s still wild to me to hear actual fiduciaries at real public companies discuss AGI not only as possible, but as a strategic goal. To think just two years ago many experts were still arguing AGI isn’t possible even in principle. I know folks are complaining about everything slowing down, but to this old 40-year-old things are still moving at a breakneck pace.

  • @sofia.eris.bauhaus says:

    thanks for clearing up the open source thing. open weights (under actually free licenses) models are still important, and we probably won’t get an open source training set that is competitive with the entirety of the internet and whatnot.

    but it’s also important that we get models whose training is actually fully reproducible, and where people can potentially check everything that goes into it. in particular more open source mechanisms for synthetic data and self-training.

  • @ZalexMusic says:

    The weirdest part about this era is how Zuck is returning to human form.

    • @chrisanderson7820 says:

      Side benefit of Meta’s training data improvement work, the Zuckborg also gets more human.

  • @Jack-gl2xw says:

    Zuckerburg with the style… damn. Looking like a suave surfer dude

  • >