• Home
  • AI

Gemini 2.5 Pro – It’s a Darn Smart Chatbot … (New Simple High Score)

Gemini gets a new record on Simple Bench, and several other benchmarks. I’ll go deep to explore its nuances, including how it deceptively reverse engineers answers, does better on certain coding benchmarks than others, may have a universal ‘conceptual language’ …

… and more. Plus practical tips, a note on security and Kling vs Veo 2 guest appearance.

AI Insiders ($9!):

Chapters:
00:00 – Introduction
00:36 – Fiction Bench
02:41 – Practicality – YouTube urls + Security – cut-off date
03:42 – Coding
06:22 – WeirdML Bench
07:01 – Simple Bench Record High
11:23 – Reverse Engineering!
13:22 – Anthropic Paper
17:49 – 3 Caveats

Gemini 2.5 Updated:

Fiction Live Bench:

WeirdML:

Anthropic Thoughts:

Search Study:

Live bench:
Paper:

LiveCode Bench:

SWE-Verified:

Non-hype Newsletter:

Podcast:

Joe Lilli
 

  • @philswims says:

    Caught a video at the notification! I’ve been impressed by 2.5 pro, so I’m looking forward to hearing about your experience.

  • @rakibhasan6218 says:

    thank you for all the work you are doing for the community.

  • @GambitRaps says:

    This thing argued with me for 10 minutes that it is 2024 and I am wrong about it being 2025, even after having it conduct multiple searches on Google confirming from several sources the current date. It tried to tell me the search results must be skewed by my misunderstanding of the current date. Never been gaslit by an LLM before but here we are

    • @TheVistastube says:

      It’s very good at working within a codebase or solving problems or troubleshooting – but terrible with facts; also it’s good to have a mode thats not a complete sycophant. I literally feel like walking on eggshells while talking to Claude or GPT because I might influence the direction of troubleshooting – so far this seems real good- we’ll see if Google let us keep the good model.

    • @andutei says:

      Are you sure you selected gemini 2.5 pro? I just tried and it gave the exact date (with grounding search enabled).

    • @tylermoore4429 says:

      Grok nails this. But I think your point holds that as frozen models, LLM’s are generally not good at temporality. Also at self-doubt, uncertainty-handling and saying “I don’t know.”

    • @Zek23 says:

      The LLM’s internal knowledge is inherently not capable of knowing the present date, but it ought to be able to interpret the tools it has access to that can give it that information. Like a search, or hell just a basic clock tool.

    • @GambitRaps says:

      @@anduteiYes, it was 2.5 Pro, I told it that it is Gemini 2.5 Pro which just came out on March 25th 2025 and it responded basically like, “I think you’re mistaken, the most recent model of Gemini is 1.5, and today is March 26th, 2024. I understand that you believe it is currently 2025, but you’re off by one year.” I asked it like, Occam’s Razor, do you think it’s more likely that a human user is literally wrong about what year it is, and that its own searches confirm the wrong date, or that the model itself is wrong? It said the simpler explanation is actually that I, the user, was mistaken about the current year. lol

  • @antoineferrere3968 says:

    @Philip, Gemini 2.5 is amazing at coding. I have a 40k codebase, very complex features and architecture and… it is actually better than Sonnet 3.7. Its not just the context window. It’s the quality of ´thought’ and ability to find the issue, debug, step back, and not simply throwing more code and logs all the time. I am genuinely impressed.

  • @jamesbrock_au5997 says:

    Gemini 2.5 Pro Experimental can also handle MP3s. I dumped an entire album into it, and asked it to write up a review – it wrote an in depth track by track review of the entire album, even providing information such as the key, BPM, genre and style. It’s literally mind boggling!

    • @Dannnneh says:

      Wait what!?

    • @Neomadra says:

      Did it guess the key, bpm, genre / style correctly?

    • @GeekProdigyGuy says:

      Did you strip the metadata? It seems far more likely that it simply memorized facts about most popular music.

    • @jamesbrock_au5997 says:

      @@NeomadraYes, I did 3 different albums before I ran out of quota. Got everything spot on.

    • @jamesbrock_au5997 says:

      @@GeekProdigyGuyNo, they were ripped from Apple Music 🤫 and then uploaded. One I dumped all individual files, another I used an online site to combine the files into one large MP3.

      Two of the albums were released after its knowledge cutoff date.

  • @IN-pr3lw says:

    2.5 Pro just fixed my Linux laptop kernel issues under 30 minutes. Before, I spent more than ~3 hours on each model like o3-mini, Sonnet 3.7, 4o and they kept going round in circles. Gemini 2.5 pro instantly understood what to do and it was effortless

  • @johnyharris says:

    This channel is one of the very few reason’s I’m still on YouTube. Thanks for your efforts, they are hugely appreciated.

  • @OriginalRaveParty says:

    It completed a coding task that no other model has been able to complete for me.

    That’s my anecdotal evidence that it’s the current best in the world, and I absolutely love what Claude 3.7 can accomplish.

  • @noah-m1r2v says:

    As a scientist I’m blown away by the knowledge of Gemini 2.5. Obviously it is a niche usecase but the way it answers questions about detailed/obscure PhD-level biology and correctly reasons using that information kind of gives me chills. And I have used a lot of other models including claude 3.7, o1 pro, notebook lm with source material, etc. Clearly building models to advance science is a major goal of deepmind and I think this shows considerable progress in that area. Coding is saturated in the sense that it is an optimization target for many companies, so the differences there are going to be tighter.

    • @noah-m1r2v says:

      Just to expand, I would compare claude 3.7 to a very smart and highly motivated undergrad – great for brainstorming and correct at a high level, but starts to break down when you dig into the details. Mostly right most of the time, but not really so knowledgeable that it would help me in areas where I am an expert, nor great at coming up with new ideas (Don’t get me wrong, I love using claude for most tasks and think it is excellent at coding).

    • @kvinkn588 says:

      Coding is not saturated in that way, had gemini one shot a bugfix in a large codebase that neither I nor claude, deepseek, gpt-o3-mini could figure out for several days. Blew me away really…

    • @noah-m1r2v says:

      @ Wow that’s awesome to hear! I haven’t yet used gemini 2.5 much for coding but i’m excited to!

    • @Tahazif_TheCool22 says:

      I’m blown away too with its capabilities!
      As a student prepping for JEE advanced, which typically deals with advanced level of physics chemistry and maths, google really released a beast for me! It’s visual capabilities are top notch, it beats o1 and o3-mini-high in every test I have given so far within my field of study.
      A more interesting thing was that it is really good at organic chemistry now, neither o1 or o3-mini-high could solve or even get close to that. Got every difficult question right, even gave me new insights while testing the problems!
      I don’t know but I think they nerfed o1, now it cannot even understand basic positions of benzene (like ortho, meta, para) and incorrectly identifies the substituents. Same with o3-mini-high.
      Google really cooked and we should appreciate it.

    • @theyreatinthecatsndogs says:

      I’m a street scientist

  • @mitchrobinsonau says:

    The importance of your work cannot be overstated, Philip. It is vital for the public to have a grounded source of information on the developments of AI – something which is going to become more and more intrinsic to our lives moving forward. All the best from Australia!

  • @EdwardVanWinkle says:

    As a writer, I just wanna say that 120k tokens covers a lot of full-length novels. Novellas usually hover around 32k tokens. I appreciate your vids, btw!

    • @psylocyn says:

      Not enough! I want to put in all the Aubrey Maturin novels so I can have long conversations with the characters

  • @AdvantestInc says:

    These kinds of benchmark-driven evaluations help ground the hype. Appreciate the transparency in walking through both wins and gaps.

  • @ced1401 says:

    I love to talk about maths and theoretical physics with AIs. Since the release of GPT3 i have been asking questions to the models to test them. Since Grok3 and even more so with Gemini 2.5 pro, i now ask questions to learn. Both of them have helped me grasp concepts and the geometrical and physical meaning of certain equations that i only considered “maths i will never understand” before. Those new gen models are simply incredible.

    • @UncoveredTruths says:

      they get so, so much wrong. i would be cautious if i were you, you wouldn’t know any better if you were a beginner

    • @ced1401 says:

      @@UncoveredTruths I agree, i noticed i tend to be easily impressed by models when talking about things i know nothing about. But for me, where they shine is precisely when talking about things you know well. They can help you put new words on old ideas, create links between concepts you never thought to associate. I’m 54, not the brightest bulb in the room, so i don’t understand things deeply, but i know quite a bit in maths and physics. And as i have the skill to check the maths and read/write/check the equations, not much risk of being bullshited on those subjects. So you’re right about remaining cautious, but i am too: AI are incredibly efficient at helping me understanding things i know. And honestly, maybe i’m lucky (or stupid), but i don’t find them so often wrong in technical matters.

  • @eugenes9751 says:

    A better test I found for this, is giving it a detailed manual and asking it to list the chapter titles, or the first word in every paragraph, and letting it read through the entire document until it runs out of output. This shows you exactly how much it read and can remember at once, as it’s forced to have to read through the entire doc word by word. Gemini 2.0 can handle about 80k of individual context, 2.5 actually handles all 1M, at once. It’s incredible. This is a true game changer.

  • @existenceisillusion6528 says:

    I was waiting for Google to get back on top, and it’s finally happened. I tried Gemini 2.5 Pro, and it’s as good as previous Gemini models were bad. I agree, they probably won’t be king of the hill for much longer.

  • @drhxa says:

    Congrats to google team for getting 50%+ on simplebench! Brilliant. I honestly expected it to underperform – around 30% based on my using it for some coding tasks that are simple for claude 3.7 w/ thinking. But then again, Gemini does very well at some things. It’s def good at a different set if tasks than some other models.

    These differences are fascinating. Thanks for the video

  • @thecoolbeanz10 says:

    Your development as a channel and creator has been stellar to see Phil – I can imagine you having a docuseries that’s wildly popular even with the general public in future.

    Truly a voice of reason in this wild period of the world since GPT-3 sir!

  • @3meiju says:

    There arent many yt content creators I can listen to for 20 mins straight… tbh, rn, I can’t recall even one other:))
    Great video

  • @ghulammahboobahmadsiddique8272 says:

    The fact that AI has a universal ‘conceptual language’ is just further evidence that LLMs do actually reason. Same with the poem example that Anthropic showed. And I’ve used Gemini 2.5 Pro extensively the past few days. In 99% of cases, it’s way better than any other model even though for a few prompts, Flash Thinking somehow still beat it. And the reason seems to be that Gemini 2.5 Pro refuses to think for more than 20-30 seconds while most other models are more than happy to think for minutes on end. Still quite impressive that it’s so much better even with that probably self-imposed constraint. And it’s been fascinating to see Simple Bench scores slowly rise and now finally cross the 50% threshold. At this rate, it’ll beat the human average within a few months. And you’re definitely right about one thing: Google may have taken the crown but in this race, no lab manages to keep the crown for long.

    • @Raulikien says:

      That’s one of the key insights that got my interested. Humans are really making another form of universal intelligence, huh

  • @raj34 says:

    I found this channel when I trying to figure out what Q* was all about. AI Explained is the gold standard for AI videos. Thank you so much for all you do!

  • >