Q* – Clues to the Puzzle?

Are these some clues to the Q* (Q star) mystery? Featuring barely noticed references, YouTube videos, article exclusives and more, I put together a theory about OpenAI’s apparent breakthrough. Join me for the journey and let me know what you think at the end.

AI Explained Bot:

AI Explained Twitter:

Lukasz Kaiser Videos:

Let’s Verify Step by Step:
The Information Exclusive:
Reuters Article:
Original Test Time Compute Paper
OpenAI Denial:
DeepMind Music:
Altman Angelo:
Karpathy:
STaR:
Noam Brown Tweets:
Q Policy:
Sutskever Alignment:

Non-Hype, Free Newsletter:

Joe Lilli
 

  • @gaborfuisz9516 says:

    Who else is addicted to this channel

  • @DevinSloan says:

    Ah, the Q* video I have been waiting for from the only youtuber i really trust on the subject. Thanks!

  • @SaInTDomagos says:

    Dude woke up and thought to himself, how thorough will I be today and said: “Yes!” You definitely should get some interviews with those top researcher’s.

  • @nathanfielding8587 says:

    I’m truly grateful for this channel. Finding accurate news about almost anything is hard as heck, and having accurate AI news is especially important. We can’t afford to be mislead.

  • @Peteismi says:

    The Q* as an optimizing search through the action space sounds quite plausible. Just like the A* algorithm that is more of a generic optimal path finding algorithm.

    • @adfaklsdjf says:

      ohhh that Q* / A* link is very interesting!

    • @productjoe4069 says:

      This was my thought too. Possibly using edits of the step-by-step reasoning as the edges, or some more abstract model. You could then weight the edges by using a verifier that only needs to see a bounded context (the original, the edited, and the prompt) to say whether or not the edit is of high quality. It’s sort of like graph-of-thought, but more efficient.

    • @ZeroUm_ says:

      A* was my first thought as well, it’s such a famous, CompSci graduate level algorithm.
      (Sagittarius A* is also the name of the Milky Way’s central supermassive black hole)

    • @mawungeteye657 says:

      Even if it’s just speculative it’s a decent idea for an actual study. Wish someone would test it.

    • @sensorlock says:

      I was thinking something along this line too. Is there a way to prune chains of thought, like A* prunes minimax?

  • @DaveShap says:

    This is way better than breaking AES-192.

  • @pedxing says:

    THIS was the technical dive I’ve wanted to find for the last few days. thank you so much for taking the time to dig into the development of these papers and the technologies they represent.

  • @bobtivnan says:

    Wow. Very impressive investigative journalism. No other AI channel does their homework better than you. Well done sir.

  • @a.s8897 says:

    you are my first source for AI news, you go deep into the details and do not cut corners, like a true teacher

  • @Madlintelf says:

    We all spent the last week watching the soap opera drama and listening to wild ideas and nobody put it all together in a nice package with a bow on it until you posted this video. It is a theory, but one that is well thought out has references, and seems extremely logical. Thanks for putting so much work into this, but it’s not falling on deaf ears, we truly appreciate you. Thanks, Bill Borgeson

  • @aiexplained-official says:

    My computer crashed 7 times while making this video and I had a hard deadline to get a flight. There is little of my normal editing in here, or captions, just my raw investigation! Do follow the links for more details.

  • @zandrrlife says:

    I would say he’s actually understating the dramatic impact CoT has on multi-modal output. Also things get wacky when you combine vertical CoT iteratively reflecting horizontal CoT outputs(actual outputted tokens). Increasing model inner monologue(computation width) across layers is def the wave.

    Again why I think synthetic data/hybrid data curation cost will soon match model pretraining. Even if you’re perturbating existing data, you can lift it’s salient density to better fit this framework. Also why I keep saying local models are the way and why I’ve been obsessed with increasing representational capacity in smaller models.

  • @dcgamer1027 says:

    I’d expect the Q to refer to Q-learning. Human beings think/function by predicting the future and acting upon those predictions, at least at a subconscious level. The way we make these predictions is by simulating our environment and observing what would happen in different variations of that simulation given the different choices we make. We then pick the future we feel is best and take the actions to manifest that future.
    I think a good example might be walking through a messy room with legos everywhere. You observe that environment(the room) identify the hazards(legos) then plan out a course through the room of where you can step to be safe(not step on lego). You would imagine that stepping in one spot would mean you are stuck or would step on a lego, so that whole route is bad and you try another. Repeat till you find a solution or decide there isn’t one and just pick some legos up, or give up, or whatever. Of course not everyone does this, some people just walk on through without thought and either accept stepping on legos or regretting that they did not stop to think. These emotional responses of acceptance of consequences or regretting them is more akin to reinforcement learning imo. There are times when you need to act without thought, for example, if the room was on fire you might not have the time (or compute) to plan it all out.

    The Q learning stuff, in the context of these LLMs, seems like it would be their version of simulating the future/environment. It would generate a whole bunch of potential options(futures) then pick the best one. The difficult task there is creating a program that knows what the best option actually is, but they apparently already have that figured out.

    My bet is we will need to add in a few different systems of ‘thought’ that the AI can choose from given different contexts and circumstances, these different methods of decision-making will become tools for the AI to use and deploy and at that point it will really look like AGI. That’s just my guess and who knows how many tools it will even need.
    Either way it’s cool to see progress and all this stuff is so cool and exciting.
    Now to go look for some mundane job so I can eat and pay off student loans lmao, post-money world come quickly plz XD.

    • @gregoryallen0001 says:

      normally a long post like this will be trash so THANK YOU for this helpful and engaging response ❤

    • @RichardGrigonis says:

      Many years ago AI researchers speculated how to represent “thoughts.” One approach was to treat them essentially as “mental objects,” the other was to resort to possible worlds theory.

    • @GS-tk1hk says:

      What you described is just reinforcement learning, Q-learning is a specific algorithm for solving the RL objective and the “Q” refers to the Q-function, which has a specific meaning in RL. It seems likely that Q* refers to the Q-function (and star generally means “optimal”), but not necessarily the Q-learning algorithm.

    • @kokopelli314 says:

      But if you have the whole world in q learning you can just use your intelligence to make money and pay someone to sweep up the room

    • @lucasblanc1295 says:

      Anyone that played a bit with those LLMs intuitively know that already. I prompt it all the time chain-of-thought and other reasoning methods like “Write a truth table to check for errors in our logic”. The major issue I always arrive at, is that it always ends up getting stuck somewhere along its line of reasoning and it needs human intervention. This happens exactly because it was never taught how to think and structure its thoughts, it was just a side-effect of language. I believe once its able to reason through mathematical problems with the proper proofs, it will be able to generalize for any fields due to its lateral knowledge transfer. So, they will just need to keep fine-tuning the model towards that direction, effectively creating a feedback loop of improving its capability at reasoning correctly, so that it will require less parameters and less compute for the same quality. And adding on top of that new breakthroughs such as bigger context window, AGI is just matter of quantity and quality of the same technique.

      Just run that thing in a loop, because that’s how thinking happens. It’s a trial and error process. Then, fine-tune it at being better at trial-and-error processes, instead of simply giving seemingly useful answers. We were simply being lazy about it by tuning it towards being useful quickly, without caring about how it’s doing it in the first place.

      It is already AGI, but it’s severely misaligned, just like GPT-3 was impressive before Chat fine-tuning. Now, we are fine-tuning Chat as Q*. It’s just a step.

      After Q*, it will probably be fine-tuned for improvement at further generalization, instead of simply the domain of math/programming.

      This will be tricky to train, humans don’t generate textual content for the sake of thinking through it, perhaps only mathematical proofs get there, and it’s extremely time-consuming. Because we make assumptions about the reader’s pre-existing intelligence, we tell information through text without ever showing our full thought process.

      In other words, we are truly starting to fine-tune it for using text for thinking, not simply generating cute answers to fool humans. This may seem obvious, but I don’t think people get this.

  • @apester2 says:

    I was in two minds about whether to take the Q* thing seriously until you posted about it. Now I accept that it is atleast not just sensational hype. Thanks for keeping us up to date!

  • @caiorondon says:

    This channel outpaces in quality ANY other channel on AI News in YouTube. The way you try your best to keep the hype out and reduce the amount of speculation is really something to be proud of and really what makes your content so different from other creators.

    You sir, is the only channel in the topic that I am happy to watch (and like) every video. ❤

    Cheers from Brazil!

  • @nescirian says:

    At 17:20 Lukacs Kaiser says multi-modal chain of thought would be basically a simulation of the world. Unpacking this, you can think of our own imaginations as essentially a multi-modal “next experience predictor”, which we run forwards as part of planning future actions. We imagine a series of experiences, evaluate the desirability of those experiences, and then make choices to select the path to the desired outcome. This description of human planning sounds a lot like Q-learning – modeling the future experience space as a graph of nodes, where the nodes are experiences and the edges are choices, then evaluating paths through that space based on expected reward. An A* algorithm could also be used to navigate the space of experiences and choices, possibly giving rise to the name Q*, but it’s been many years since I formally studied abstract pathfinding as a planning method for AI, and as far as I can tell from googling just now over my morning coffee, it seems like the A* Algorithm would not be an improvement over the markov decision process traditionally used to map the state space underlying Q-learning.

    My extrapolation gets a bit muddy at that point, but maybe there’s something there. To me, a method that allows AI to choose a path to a preferred future experience would seem a valuable next step in AI development, and a possible match for both the name Q* and the thoughts of a researcher involved with it.

  • @rcnhsuailsnyfiue2 says:

    18:49 I believe Q* is a reference to the “A* search algorithm” in graph theory. Machine learning is fundamentally described by graph theory, and an algorithm like A* (which traverses each layer of a graph as efficiently as possible) would make total sense.

    • @bl2575 says:

      It was also my though when I heard the algorithm name. It is basically a cost minimization algorithm to reach a target node. Difficult part in this context is figuring out what heuristic to use to evaluate if a step of reasoning is closer to answering the question than another one. Maybe that where the Q-learning policy play a role.

  • @gmmgmmg says:

    The New York Times or another major newspaper should hire you, seriously. The amount and quality of research and the way you explain and convey AI news and information is truly remarkable. You are currently my favourite yt channel.

  • @MasonPayne says:

    A* is an algorithm mainly used in path finding. Which works very similar to what you described as Q. Imagine the idea landscape as a set of information you need to search through to find a path to the answer. That is what I think they mean by Q*

  • @DavidsKanal says:

    “You need to give the model the ability to think longer than it has layers” is what really sticks with me, it’s such an obvious next step for LLMs which currently run in constant time. Let’s see where this leads!

  • >