Orca: The Model Few Saw Coming

The first model set to be opensourced that actually comes close to ChatGPT, and is just 13B (that's small enough for a laptop). The 51 page report from Microsoft was released just 48 hours ago but I have gone through it all, and bring relevant insights from 5 other papers.

By imitating the logic and explanations of GPT 4 (and using GPT 3.5 as an assistant), as well as by training on diverse tasks and an order of magnitude more examples, we have Orca. I will showcase it on a dozen benchmarks and go through in detail how it works and why.

I will also end on comments from Sam Altman and Ilya Sutskever on whether Opensource will catch-up…

Orca Paper:
False Promise Paper:

FLAN:
Vicuna:
No Moat memo:
LLM Leaderboard:
AGIEval:
BIG-Bench Hard:
Language Models as Tool Makers:
Altman Interview:
DERA Paper:
Let's Verify Step by Step:

Non-Hype, Free Newsletter:

Joe Lilli
 

  • @laslog says:

    Matching gpt4 performance in some areas with 13B! INSANE!

    • @aiexplained-official says:

      Yeah just a couple. More of a GPT 3.5 rival though.

    • @EddieBurke says:

      @@aiexplained-official considering I was using GPT-2 to help me write stuff like 2 years ago, it is still fucking unreal how fast open source how caught up to commercial.

    • @blablabic2024 says:

      It will definitely surpass closed source in couple of years. OpenAI is a great platform that will become obsolete in cca. five years and along with it the market cap of software giants. Even the hardware giants will get displaced in their market share due to slew of open source hardware (RISC-V et al). We’re not talking about moats, we’re talking about dams that just got breached….

    • @EddyLeeKhane says:

      ​@@EddieBurkehave they though?

      Still looks like rote memorization by fine-tuning to me

    • @dv_interval42 says:

      @@EddyLeeKhane Yep. Real progress would be apparent, we wouldn’t have to bend over backwards to justify actual leaps!

  • @michabrugger7664 says:

    This is top quality content! Thanks for keeping me up to date 🙂

  • @atpray says:

    If a 13B parameter can do that, I cannot imagine what GPT-4 with further improvements can do.

    • @zyansheep says:

      @@fontende more data isn’t the only way to improve a model…

    • @MindFactoryAI says:

      @@zyansheep Right, essentially they have done the easiest thing, text prediction, with a bit of inverse RL. There is a much more complex space of composite models, inductive biases, reasoning processes and loss functions as yet unexplored.

    • @martiddy says:

      ​@Phobos Deimos Actually Meta is working on a multimodal AI that includes images, text, audio, depth information and more.

    • @bartpelle3460 says:

      @@fontende *beep beep boop, I am a bot and this action was performed by CommentGPT*

  • @DaveShap says:

    “Everyone and their gramma used it for whatever” – spoken in an erudite English accent. My life is infinitely more complete for having heard you utter these words. Thank you.

    • @aiexplained-official says:

      Thanks David. My BBC British accent is a perfect disguise if I don’t actually know something. No one would guess.

    • @DaveShap says:

      We Americans have been thoroughly trained – Oh yeah this guy sounds SUPER credible! 😀

    • @GuinessOriginal says:

      @@DaveShap it’s Grandma David 😉 just an FYI in case you weren’t sure

  • @FrancoisPesce says:

    I find it intriguing how this research has leveraged imitation learning to essentially ‘distill’ the extensive knowledge of large language models such as ChatGPT and GPT-4 into Orca. My interpretation is that the choice of 5 million distilled examples essentially creates a filtration process, condensing and harnessing the most valuable insights from the sea of information these large models have processed.

    Remember, these large language models have been trained on an incredibly diverse range of data, which includes valuable knowledge, but also less useful or even misleading information. The challenge in training these models has been to sift through this ‘noise’, identifying the truly useful signals. When we reach training convergence, these models have either generated a functional approximation of the data’s meaning, or perhaps have pruned the absurdities in the dataset by plateauing towards the end of the training process, thereby not allocating further weight to insignificant data.

    To provide an analogy, consider the breadth of Wikipedia, with its 6 million English articles. Only a fraction of these, about 0.1% (or 6000), are ‘Featured Articles’, denoting the highest quality. I would compare the approach of this research to understanding the world through these 6000 best articles. By focusing on content that has been consensually deemed as superior, you likely cover a vast spectrum of world knowledge.

    It’s as if the Orca model is absorbing the distilled wisdom of the large language models, somewhat akin to a student learning from a world-class tutor. And while this may not fully capture the intricate reasoning process of the original models, it clearly leads to significant improvements in performance, as demonstrated in the zero-shot reasoning benchmarks and academic tests mentioned in the paper. This suggests that learning from step-by-step explanations, whether they originate from humans or advanced AI models, holds great promise for advancing model capabilities and skills.

    • @GuinessOriginal says:

      I suspect you’re on the right track, but it might be more akin to the simpler 80-20 rule. It’s likely that 80% of user requests can be met using only 20% of the training data. The exact numbers aren’t important but you get my point. I wonder how meta’s megabyte hierarchical architecture will affect this too, it looked very interesting to me.

    • @ivan24zg says:

      Once a neural network “groks” something you can remove activations that were used only for memorization before generalization occurred. But these redundant activations have NOT been pruned from models like ChatGPT after the training is done, they are still there. The distillation process done by ORCA fast-tracks the generalizations, ultimately reducing the size of the network needed to learn something. ChatGPT can probably be reduced to 10% of its size (or less) if they pruned it after the training is done. The 50%-400% gains that ORCA has over Vicuna are absurd, and indicates that we are nowhere near the diminishing returns threshold. Once all the algorithmic optimizations are done, the consumer-grade LLMs will probably end up more powerful than anyone ever imagined. OpenAI is free to spend millions to train the network, but they CANNOT protect extraction of the knowledge from the network to produce better models. And that guy at the end claiming that there will always be a “gap” is deluded – we only need to produce AGI *once*, and after that it’s in a self-driving mode.

    • @GuinessOriginal says:

      @@ivan24zg is what talking about similar to having a sparse architecture in your neural network or is that something completely different?

    • @sebastianjost says:

      While my initial reaction was to expect some major knowledge gaps in Orca, I now noticed, that humans are often taught in a similar way.

      Consider a mathematical theorem. There are often many different proofs developed and refined over hundreds of years. Students are usually taught the most elegant proof, not the original or worse: all the different proofs. While the remaining proofs could stilll provide valuable insights, if you see enough proofs of different theorems, you should still learn sufficiently much. That’s what we seem to expect from humans at least.

      While the objectiveness in maths is rarely found in other areas, the same principle should apply to other areas as well. I think that helps in understanding why Orca is so good/ competetive.

    • @clray123 says:

      I wonder if Orca also loves to go in loops like a mad parrot when you disable sampling and go for greedy token generation. This sort of absurd behavior (with probability of the next tokens being amplified by whatever the model has already spit out) makes me quite skeptical of whether the small models really are very “smart” … or just more successful at parroting what’s contained in the smart model.

  • @octia2817 says:

    I would love more open-source model content!

  • @ahtoshkaa says:

    Yes please! More about open source LLMs would be great from you since you study everything is so much detail. Don’t be too hasty to edit stuff out. I’m sure people will love listening to a 30-40+ minute video when it comes from you.

    • @TarninTheGreat says:

      Oh yeah, I’d watch hour long videos from him every day. I don’t think that’s what he wants to make; but yeah man, don’t worry about length, you’re making the best content on the subject, how long it goes is not a concern.

    • @StevenAkinyemi says:

      I would too. Absolutely

    • @OlebileWareus says:

      i would

    • @ParameterGrenze says:

      Agreed. Your viewers are not in the 10 min attention span crowd. Don’t try to game your vids from best practices optimized for mainstream viewers.

  • @nihilistoner says:

    You know this already, but we all appreciate your work so much. Thank you! 🙂

  • @GoldenBeholden says:

    This field has been an absolute joy to be a part of these past few months.

  • @reinerheiner1148 says:

    This paper is more important then one might think. It could lead the way to ai learning like humans. Because it already shows that a) learning easier stuff before more difficult stuff improves learning even in an LLM and b) that the better the explanation for an answer, the more it will understand through reasoning. Going down that path could mean hugely decreased training times while further improving the llms reasoning capabilities. It is like memorizing vs understanding. And yes please talk more about open source llms. Especially the ones that work with langchain. Thanks for the video!

    • @Silduril says:

      Exactly what I was thinking! Super exciting stuff 😀

    • @clray123 says:

      It is naive to assume that the AI is “learning like humans”. Humans do not learn by memorizing millions of text examples.

    • @jeff__w says:

      @@clray123 Absolutely. And I doubt the model “understands” anything by reasoning. What these explanations do is allows the model to refine its neural net so that its verbal output emulates the verbal behavior that we would call “reasoning” in humans.

    • @reinerheiner1148 says:

      @@clray123 you did not understand my message. What I was getting at was that currently, yes, the AI is memorizing by going though loads of data unlike humans. But this paper shows that memorizing is inferior to understanding, aka for each sample to provide extensive reasoning so the AI can learn why the answer is correct. This and the fact that the AI learns better when providing simple problems before more complex ones shows that the exclusive brute force approach is inferior. Any researcher getting this paper will realize that there is probably a lot of potential to optimize training, so that the model will need less training to learn the same thing. Which in theory could lead to models learning just as fast as humans, if we are able to optimize the training methods (and probably the model structure) enough. After all, its possible, because humans can do it. And I have an example for you how learning can be massively improved when it comes to another branch of machine learning, reinforcement learning: look at openais paper on hindsight training, which hughely decreased the amount of samples needed to learn a task. Its not a far stretch that we can have similar progress with LLMs. So yea, I don’t think I am naive… But I am well aware of where we are right now.

    • @reinerheiner1148 says:

      @@jeff__w in the end reasoning is also a consequence of refining a neural network in humans as well so… You oversell humans. Yes we are superior, but its not magic. At the end of the day we are the sum of our learned information plus biological biases (hormones, neurotransmitters, brain structure). Which is not plesant for many to realize, because that limits the concept of free will as well.

  • @msylvestre says:

    I really feel lucky to have found your channel. I genuinely think it’s the best source for no-shill, in-depth, AI and LLMs news.

    • @RandyHawkinsMD says:

      I couldn’t agree more.

    • @skierpage says:

      This channel is really very good. There’s also Yannick Kilcher’s ML News, which is a sporadic excellent summary of what’s been going on not just in papers but new software releases and and other AI-related goings-on. He also does deep dives into particular papers.

      Let us not speak of Two Minute Papers and KZ-F’s mindless recycled eye-candy visuals with no information about the paper itself or what’s novel in it.

    • @Fru1tpunch says:

      oh god since ai is in the news now it feels like all the channels are like crypto shillers

    • @ToriKo_ says:

      +

  • @Cacti_hipster says:

    The web of lies at 11:05 flew over my head until I thought of it more as a recursive base case. Mad props to this team!

  • @lukeg1680 says:

    your content is deeply moving, thank you. I get this sense of vertigo, as the ground shifts beneath our feet, steadied only by your academic tone and deep commitment to the facts. I’ve never seen aww-inspiring breaking news told through academic papers before, gripping.

  • @harveyhutsby7697 says:

    I have a large appetite for this kind of content and you are by far the best source I’ve found on youtube, so personally I would like to see more.

  • @sbondi says:

    Actually, as to the “more open source?”comment because “you have a lot more to say about it”, I say “YES, PLEASE”! Everything you say in every video is done with such intelligence and quality that it ALWAYS has great value! I can see that you put your heart into these videos, and I really appreciate all the heavy lifting that you are doing for all of us! 😃

  • @alpha007org says:

    Stephen Wolfram explained his thoughts about LLMs, where he says that LLM models (like c.gpt) can be distilled down to a much smaller size. It was on Lex Friedman podcast.

  • @deciphrai says:

    Timestamps courtesy of Deciphr AI 👨‍💻

    0:02:32 – Orca’s 13 billion parameters
    0:04:12 – Orca leveraged system instructions
    0:05:58 – Task complexity and diverse examples
    0:07:18 – Orca matches text DaVinci 3 in the SAT, LSAT, GRE and GMAT
    0:08:03 – Orca reaches parity with ChatGPT
    0:08:39 – Microsoft’s involvement in Orca’s research
    0:09:04 – Orca vs Vicuna
    0:11:28 – Orca in common sense reasoning questions
    0:12:10 – Orca’s potential for improvement
    0:15:22 – Gap between open source and private models
    0:16:32 – Sam Altman’s perspective on OpenAI’s unique moat
    0:17:23 – Possible future videos on open source models

  • @TheMirrorslash says:

    I have a feeling that your theory about why microsoft conducted this research is spot on. The fact that using LLMs to train other models was called “a false promise” to begin with is wild. It feels 100% like the logical step you’d take to build on existing models. And the fact models can be “robbed” like this just shows that this technology will be everywhere no matter what format you release it in.

    • @electrolove9538 says:

      It’s still a bit bizzare that MS would publish these findings. Why not keep them private? It also may rub OpenAI the wrong way and hurt their relationship…still doesn’t make sense to me 🤔

    • @TheJackiMonster says:

      I think it definitely shows that companies can not really sell their model as easy as they thought with current legality around AI training. But then the problem is their models are all build around the fact they could train it without caring about copyright of the information used as training sets.

      So either everyone gets robbed or AI might not be profitable because paying for copyright will be too expensive when it comes to generalized networks.

      Either way it’s interesting to watch, I think.

    • @mohammednisham7126 says:

      ​@@TheJackiMonster it might be more like foundational models won’t be profitable, but applications built on top of them definitely could be

  • @BodyMusicification says:

    Please more coverage on open source models. This is the most uplifting, hope-inspiring video I’ve watched of yours yet.

  • @RandyHawkinsMD says:

    This channel constitutes a highly valued way for me to benefit from our moderator’s experience and efforts. His insights are thoughtful, his research is current, and both guide my investigations. Many thanks. And by the way, open source developments are of particular interest.

  • >