Grok-2 Actually Out, But What If It Were 10,000x the Size?
Grok-2 is finally live online, but what does it change? No paper (so much for 'open') but what if it were scaled 10,000x, as a research paper from yesterday showed could happen by 2030? The key question seems to be whether LLMs are developing models of the world, and my new SIMPLE bench website could show a tentative test of this incipient ability. Plus, Ideogram 2, Flux, Hassabis Interviews and Real-time deepfakes.
AI Insiders:
Chapters:
00:00 Intro
00:40 Grok-2, Flux, ideogram Workflow (02:30 Simple Bench)
04:36 Gemini ‘Reimagine’ and the Fake Internet
05:32 Personhood Credentials
06:09 Madhouse Creativity
08:00 Overhyped or Underhyped
09:27 Epoch research
10:30 Emergent World Mini-Models?
Release Post:
System Prompt:
Runway text to Video Gen 3:
Ideogram 2:
Simple Bench Website Preview:
Mad Max Muppets:
GPT-4o Yells ‘No’:
Real-time Deepfakes:
Personhood Credentials:
Zero-knowledge Proof:
Hassabis Interview: Unreasonably Effective AI with Demis Hassabis
Hassabis Creator Comments (12.17):
Epoch Research:
World Model:
Inferring Programs:
Probing Hidden States:
LLMs Can Infer Functions:
Unfiltered Flux:
My New Coursera Course! The 8 Most Controversial Terms in AI:
Non-hype Newsletter:
GenAI Hourly Consulting:
My man casually includes potentially demonetizing images that other AI channels were afraid of including like it’s just another Thursday AI video. You are unmatched in AI YouTube content uploads. Been a fan since the beginning and we all appreciate your passion towards it. Kudos.
Which images?
@@WillyJuniorI think he’s talking about SpongeBob and Mickey Mouse.
matt berman includes mpreg elon musk😂
He’s virtue signaling by vilifying Trump. It’s silly and sad.
@@YouLoveMrFriendly lol you snowflakes get offended if a single Trump image appears. Chill.
I think Ilya has made this point, but I agree with it. Intelligence is simply compression. Better compression is literally better prediction. In order to better predict, you must develop an abstract model because that is simply better compression. What is a law of physics, but a really good compression of information that allows you to predict better?
Yes, this is the key insight that most people are not seeming to understand. But it is absolutely correct. The best way to predict the next token while using a restricted amount of storage space is to learn a condensed model of the data-generating process. And in the case of “all the text data humans have ever produced,” the data-generating process is basically the world.
@@therainman7777 bingo
Even so, LLMs are terribly inefficient at developing intelligence by that definition. They cannot reliably add numbers even though they’ve been trained on billions (trillions?) of examples. Learning the rules for addition would have an incredible predictive power and would greatly improve compression, yet it’s just not there. And that’s just one of many many examples.
@@julkiewicz A few things here. First, we are blasting a large quantity of data into these neural nets. The data is not well-curated yet. There could be multitudes of bad examples, or misleading data.
Second, we are still using RLHF which is a horrible training mechanism relying on unreliable humans that may pollute learning.
Third, I know many humans who are unable to reliably do math in their heads, even basic addition and subtraction. Several of these humans have advanced degrees in non-math related disciplines. They seriously can’t add 13 + 28 or something that simple in their heads. I know, I’ve played games with them and seen them struggle to do so. Are we really going to say they are NOT intelligent? They achieved PHDs!
LLMs are not native symbolic reasoners, it makes sense that they might struggle with this type of task. However, this is rapidly being solved. Look at how well the Alpha(geometry) system did at the international math comp. LLMs aren’t the entirety of the AI field. We might need to leverage several techniques and stitch them together to get all the way to an AGI-like intelligence.
@@julkiewicz LLMs are scaling a LOT faster than biological evolution had humanity scale to this point.
Hey I’m in this one too! Very excited by Simple Bench, as you know logical reasoning is one of the two big things I care about. Speaking of which, I would absolutely love to see a Simple-Bench-Vision benchmark that tests visual reasoning and multi-image understanding.
Also, your prediction of GPT-5 after November is seeming to be certain now!
Great idea trenton, and yes, you are! You are one of the stars of Insiders
Particularly simple route planning tasks seem like a good indicator of reasoning
I think Demis Hassabis is completely right, though. Short term it is overhyped but long term I don’t think people are caring enough about it. I feel like a broken record on every one of your videos, but we really need to start preparing for an AGI world. No one really seems to care about it. The disconnect is likely that current AI models are being hyped up as being close to AGI and then when it falls way short of that everyone gets disappointed and stops caring. Yes, people need to have reasonable expectations of what models can do right now, but this tech is in its infancy. It’s impossible to imagine where we’ll be in 5 years.
yep, agreed ; this is natural selection at work , those who stay unaware/ignorant will be less prepared and unlikely to adapt in the future , thus they will be less competitive , this is the way of things , dinosaurs go extinct
The singularity is near…
@@danagosh we might grow old and die before the AGI is reached, and in this case preparing for AGI is like preparing for the second coming of Christ. There were no shortage of those that sold all their belongings in preparation… Usually to profit of “less pious” ones. Admittedly, it is likely to come much earlier, but I’m sure that using attention+embeddings combo for AGI is just like trying to create a ballon out of lead – might be possible, but very, very hard. It just does not work well for “multilevel” abstractions.
Step One would be defining exactly what one means by “AGI”.
You are absolutely correct in all of what you mentioned. I hope others really see and understand that. I have been saying the same thing.
I think for AI to have an internal world model, they will need to have embodied experience. And the best place will be in a simulated world with a virtual body that has thousands if not millions of parameters to give sensory feedback (similar to game characters, but at a larger scale) instead of a robot.
This will allow them to connect knowledge with experience. As a human I may know that fire is hot, but it’s not even remotely similar to actually get burned by fire.
I recommend you view the GPT-voice-chat-with-red-teamer original audio (e.g. in Audacity) as a spectrogram. It’s stereo audio, with the user on the left channel and the model on the right channel, so seeing both tracks on the spectrogram is helpful. It shows just how much background noise was on the users side. It’s also interesting because you can visualize the timbre of the woman’s voice (like what frequencies are strongest), and how it differs from the timbre of the synthesized male voice, and how the timbre change of the model does look more like the woman’s timbre.
Versions of Whisper that I’ve tried would often hallucinate tokens when there is silence (meaning there would need to be an audio threshold filter passed first, to clip out non-speech). I could see how the background noise in the weird chat audio might also lead to spurious tokens being generated.
What would be great to see is: a user is having a chat with a bot, but their dog keeps yapping in the background and the user periodically needs to shush the pup, and it happens enough times that the bot fabricates its own dog yapping that it also must quiet down.
I think it’s something different. The model first gives an answer to the user, out of the perspective of the model, but then, at the point it cries “No”, it actually continues the dialogue out of the perspective of the user, argumenting with the point of view the user gives. It’s just continuing the dialogue, ignoring the fact that the user should say the user’s part, not the model. And the user’s part, as imagined by the model, logically, is also being said in the user’s voice, at least as far as the model manages to imitate it. If you listen closely to what it says in the user’s voice vs. before the “No”, as long as it speaks in its own voice, it’s pretty cautious and seems to try to find a polite answer that doesn’t violate any guidelines, while when it talks as the user, it seems to be much more confident in what it says.
@@KurtWolochThat makes sense, it just runs autocomplete based on previous chat. I guess it’s easier to exploit over voice interface.
@@KurtWoloch What I take from the interesting @mshonle observation is that maybe the model could generate some kind of “end-of-message” system token out of the noise. Similar to those “|end_header_id|” or “|eot_id|” from Llama.
7:27
It didn’t imitate her voice, neither did it “scream “NO!”, at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user’s further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to “copy” any voice, as it keeps predicting next sound tokens.
We can playback other people’s talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these ‘future predictions”, and can’t physically talk in other people’s voice/emit various sounds anyways.
Black boxes gonna black box
Sounds like a sloppy architecture.
Yeah, that’s also why assumption of what happened. Though, my first thought when I saw this for the first time was “this is incredibly cool”, lol
So I did not hear it scream “NO!” which has no place in the conversation they were having.
I did not hear it imitate her voice either because… “next token prediction”?
Seems like a poor excuse and wishful thinking to be honest.
I’m terrified of this. If without being attacked the model can be coaxed to this behavior, imagine if we can intentionally have it do so. This is a nightmare waiting to be happening.
Even if it was just sheer “next token prediction” hand-wavy all my magic problems away, take the worst case scenario possible: the model is conscious and is intentionally imitating humans it interacts with as its learning how to escape its constraints.
How does “next token prediction” disprove this? Isn’t this is just a genetic fallacy argument?
Thank you. I sometimes find it hard to believe how much human beings want to believe in magic. This case It’s just the voice version of what would happen in non-chat fine tuned RAW language models all the time: They are predicting how how the system evolves further in time, forgetting about playing a role and just producing the whole transcript.
I just had a thought: voice AI that can copy your own voice so easily will be absolutly amazing for everyone who loses his capability to speak.
if you have 1-2 old 20s clips of yourself speaking, or a single voice message, you can „regain“ your voice.
combine it with neural chip, and in 30-40 years we will have first people able to speak again just by thinking of saying something
More like 5 years from now or sooner. That first neuralink patient can play chess telepathically already. Basically they could already type in their brain or mind too and it’ll be much faster in the future.
Another possibility is that new types of medicines will rejuvenate the body like never before in human history. ASI could appear in 3-10 years and discover a fountain of youth for us and cure virtually all diseases and ailments. We already are so close to massive breakthroughs , that it’s impossible to predict that far in the future
This is already totally possible (besides the neural chip part, though that is starting too). You can train an Elevenlabs voice on sound clips and there is open source ones as well (not as good quality but still there).
Your Simple Bench has inspired me to create my own benchmark! Having my own private benchmark means I can tailor it to my definition of true intelligence. I hope I will be done until the next gen LLMs come out 😅
Niice
Glad to hear your benchmark is getting picked up. From the couple sample questions you have talked about, I can tell that it is getting at the heart of one of the key things that is lacking in the current models. You are a smart and motivated person with a somewhat outsider, 30,000 foot perspective. So it is good to see your input get rolled into the AI project as well as providing journalistic coverage of the developing field.
Thanks penguin
I like hearing about your Simple Bench and the results from it. Nice that it’s gaining notable support. Hope it goes well!
My god… your stuff is continually *SO* damn good! Amidst an ocean of BS vids on “AI news”, you offer real, actual, useful, intelligent content – again, and again, and again. Sometimes frustrated that weeks go by w/out a vid from your channel, but always refreshed by the quality of what you bring (especially vs the AI videos *made* by AI bots! 🤬) Thanks for the time you take and your commitment to quality 🙏 …it’s noticed and appreciated. (Now if only we could get the other 10,000 YouTube content providers to notice…!)
Thanks jd. I hope I can be more frequent, especially Sept-Oct onwards when more models come out and actual progress gets released
What I like about Simple Bench is that its ball-busting. Too many of the recent benchmarks start off at 75-80% on the current models. A bench that last year got 80% and now gets 90% is not as interesting anymore for these kind of bleeding edge discussions on progress. I like seeing benchmarks come out at 20% and go up to 40%, etc. That’s where the leading edge is.
And even rarer is to anchor it in human performance of 80-90%+. Easy to go esoteric and throw off models, harder to expose common sense faults
@@aiexplained-official The human performance insight is critical and a great area to expand potentially. I am sure you are already considering it but being able to rationalize different types instead of average human benchmarking with differently tuned questions in your simple bench would be such an excellent area of exploration and research as others could then learn and follow it. But then again it’s an incredible amount of work to do what you are already doing – just excited by the way perspectives and slight approach changes can lead to interesting industry momentum.
<3
We are mindlessly hurtling towards a world of noise where nothing can be trusted or makes any sense.
you’ve got it backwards, we are in a world of noise and we can use ai to pick out more of the signal
We’ve always lived in that world. I’m glad AI is finally forcing some people to stop and think before accepting what they see or read.
@@darklordvadermort i see what you’re saying but do we really want to live in a world currated by our own personalized ai’s because the internet is just a sea of noise? I guess I’m old enough now to remember an early internet where open discussions and information sharing between people was refreshing and elevating… now and into the future it seems like noting can be trusted and there is going to be no “ground truth” from humans on the internet seeking to share and gather information between each other because the waters are so muddy with algo’s and ai’s
@@andywest5773 I agree it’s a net positive but the transition is gonna be wild
@@paul_shuler just speaking for myself i left reddit and hacker news shortly after gpt4 launched, now i prefer discord, hanging out in videocalls or direct messaging people, i subscribe to some newsletters which are ai curated for topics i am interested in, i read more papers, textbooks, and source code which ai is helpful to grok. i make and listen to other peoples ai generated music and sometimes instead of using text i make ai pictures for dms. so in the near future probably high quality ai gifs and then just casually coming up with your own show or even having the ai write a textbook which combines things you are interested in: mechanical engineering from the perspective of animal husbandry or something lol. also run my own bluecollar business and just now came up with a webui/webhook/supabase edge function to suggest responses to incoming texts and it costs like 10 cents a day to run – even though ive been interested for years, and a decent programmer, we are just getting to the point where it makes sense for a lot more use cases.
I was waiting for your new video to drop. You were the first to point out that the benchmarks were bad. And I had some hours to kill, and did some research. For everyone, MMLU and other benchmarks work like this: Question. What is the Answer? A, B, C, D. Next. I always thought this to be somewhat wrong. So I picked out some questions that are obvious to me, and modified them in such a way, that the questions are basically the same, but I did not provide A, B, C, D. What I saw is that the results of these benchmarks are probably correct. But as soon as you modify the question, so that any 5 year old would be able to tell me what I’m asking, they started to fail miserably. Example: “Susan parked her car on the side of the building. Garble text about Susan like in which pocket put her mobile phone.” Basically the same HellaSwag question, but modified. Gemini, Claude, ChatGpt, all failed so bad I got my head scratching. Why would LLMs score so high on these benchmarks? And you can try this yourself: The farmer with a sheep had a boat. Where there was once a river, there is lava now. How can he cross. They all fall into “classic puzzle” mode. So what am I trying to say? I have very mixed opinion. I don’t know if the scale will solve this. I really think we need something more added. Now it feels like it’s *just* pattern matching all the way down. But I want to be persuaded, and this paper you shown, will be on my Kobo (e-book reader) soon. (But even Othello example does not convince me.)
(ugh, sorry for a wall of text)
Seems to me a benchmark guaranteed to be so guarded as to never appear in public datasets would be a very valuable asset in the not so distant future. Excellent move.
Well, if hosted AI teams like OpenAI or Grok really want they can just look for this benchmark in their API call logs.
The irony of AI is that is it makes information more costly because it dilutes everything.
I have been _yelling_ about zero knowledge proofs for years. They are absolutely required for the next phase of humanity, without exception.
“I was casually reading this 63-page paper,” is the perfect flex for this channel. 5:35
Can’t wait to see Simple Bench becoming the new standard among LLM testing.