GPT 4 is Smarter than You Think: Introducing SmartGPT

In this video, I will not only show you how to get smarter results from GPT 4 yourself, I will also showcase SmartGPT, a system which I believe, with evidence, might help beat MMLU state of the art benchmarks.

This should serve as your ultimate guide for boosting the automatic technical performance of GPT 4, without even needing few shot exemplars.

The video will cover papers published in the last 72 hours, like Automatically Discovered Chain of Thought, which beats even 'Let's think Step by Step' and the approach that combines it all.

Yes, the video also touches on the OpenAI DeepLearning Prompt Engineering Course but the highlights come more from my own experiments using the MMLU benchmark, and drawing upon insights from the recent Boosting Theory of Mind, and Let’s Work This Out Step By Step, and combining it with Reflexion and Dialogue Enabled Resolving Agents.

Prompts Frameworks:

Answer: Let's work this out in a step by step way to be sure we have the right answer

You are a researcher tasked with investigating the X response options provided. List the flaws and faulty logic of each answer option. Let's work this out in a step by step way to be sure we have all the errors:

You are a resolver tasked with 1) finding which of the X answer options the researcher thought was best 2) improving that answer, and 3) Printing the improved answer in full. Let's work this out in a step by step way to be sure we have the right answer:

Automatically Discovered Chain of Thought:
Karpathy Tweet:
Best prompt: Theory of Mind:
Few Shot Improvements:
Dera Dialogue Paper:
MMLU:
GPT 4 Technical report:
Reflexion paper:
Why AI is Smart and Stupid:
Lennart Heim Video:

Non-Hype, Free Newsletter:

Joe Lilli
 

  • @markm0000 says:

    AI is developing so fast it’s past the scary part and I’m just bracing for impact right now.

    • @hanskraut2018 says:

      Its extreamly slow after century where its unclear if there are problems with incentives or patents they finally made a language model that forgets what you say in 3 answers mostly. It is relly amazing in some respects tho. But the “hype” stop development is not true and ja….. lack of imagination

    • @vectoralphaSec says:

      Youre likely to get more seriously injured if you brace for impact.

    • @iBuyBitcoin says:

      im all for the AI takeover, people and humanity sucks anyways lol

    • @augustuslxiii says:

      What does “impact” mean, here?

    • @theeternalnow6506 says:

      Same. We are not ready for what’s coming. Just saw a new open source model that can take 65000 tokens as well. Trained in 9.5 days for 200k…

  • @Fs3i says:

    In my experience: the significantly easier “reflect on your answer” and then “considering the reflection, answer again” improves answers by an insane amount.

    It’s quite nice, because it’s easy to improve

    • @aiexplained-official says:

      It’s gonna keep getting smarter. On the ‘easier point’, I think you misunderstand, this is a single prompt your end, you don’t have to do anything extra.

    • @Fs3i says:

      @@aiexplained-official Yeah, but using the chatgpt premium subscription I pay for is also quite convenient, and I have the site bookmarked everywhere. Convience is important for me, and a python program isn’t always that, unfortunately.

      That said, I might use this for stuff that’s a bit harder to answer!

    • @RossShaw says:

      @@Fs3i I find the same problem. The web interface is still more convenient for me most of the time. However, with time I’ll probably deploy my own specialty services to Linode as a way to get the best of both worlds.

    • @HoD999x says:

      @@aiexplained-official prompt:
      “question: 5 tshirts take 5 hours to dry in the sun. how long do 30 tshirts take?
      give an answer, explain it, then try to improve it.”
      response:
      “Answer: 30 t-shirts take 30 hours to dry in the sun.

      Explanation: If 5 t-shirts take 5 hours to dry, then it means that each t-shirt takes 1 hour to dry (5 hours / 5 t-shirts). So, for 30 t-shirts, it would take 30 hours to dry (1 hour/t-shirt * 30 t-shirts).

      Improved Answer: While it is true that if you dry each t-shirt one at a time, it would take 30 hours, in reality, you can dry multiple t-shirts simultaneously. If you have enough space to lay out all 30 t-shirts at once, then it would still only take 5 hours to dry all 30 t-shirts. If you have limited space, you can calculate the drying time based on how many t-shirts you can dry at once. For example, if you can dry 10 t-shirts at a time, it would take 3 cycles of 5 hours each to dry all 30 t-shirts, totaling 15 hours.”

      i guess you can compress the entire process you described into a single multi-step-prompt

    • @petersonclipshare says:

      The devs mentioned that if it answers with “my best guess is…”, it reduces GPT hallucinations by 80%

  • @DisturbedNeo says:

    I love this about current AI models, both LLMs and Image generators. You literally get better results by explicitly asking for them. You ask GPT-4 for a piece of code to do something, or Stable Diffusion for a landscape with a castle, you might get some decent results, but you’re much more likely to get much better results if you add in stuff like “Make sure the code follows best practices” or for images, “Best quality, Masterpiece” etc.

    Like the AI was just sat there, happily generating garbage for you, and then it suddenly realises “OH! You wanted my output to be _good_ ! Silly me, here you go.”

    I guess that’s what happens when you create something incapable of taking any pride in what it produces.

    • @aiexplained-official says:

      Exactly

    • @larsfaye292 says:

      Which, if you think like a computer, makes perfect sense. The robot will do EXACTLY as it’s requested; no protests or even suggestions to improve (most of the time). The more explicit you are, the better it can perform it’s job.

    • @MrAchterbahnfahrer says:

      There is actually a paper called “Large Language Models and the Reverse Turing Test” by Terrence Sejnowski in which the so-called mirror hypothesis is proposed. This hypothesis suggests that the response of LLMs somewhat mirrors the intelligence of the prompter. This hypothesis may also explain why special sentences such as “Let’s think step by step.” or “Let’s work this out in a step by step way to be sure we have the right answer.” lead to better results, as these prime the LLM with information about it being expected to give a good and correct answer and as they also lead to the benefits of chain-of-thought.

    • @sidequestsally says:

      @@MrAchterbahnfahrer Thanks! I’m half way through the paper! Cheers for the suggestion.

    • @McDonaldsCalifornia says:

      Interesting thought.

      In a way the taking pride part could be added on directly.

      An AI with an ego might put in more effort lol

  • @stevenking871 says:

    I’m a Data scientist and Language model researcher for a large international company. I also have access to GPT-4 and I would love to help implement this with Langchain.

    • @aiexplained-official says:

      Hey Steven, awesome, do email in the channel description.

    • @draken5379 says:

      Langchain already invokes the Ai to ‘think’ or have ‘thoughts’ about what they are going to do or say. Its pretty much the base of the concept they came up with, and most use to this day.

      You can see it all at work if you dive into the internal prompts being used by agents in langchain.

    • @tylerk3130 says:

      @@aiexplained-official Mr. King beat me to it, but I’m in a similar position both with regards to what I do for a living, and level of access. Hope you don’t mind me sending you an email as well.

    • @wendten2 says:

      @@aiexplained-official I got it about 3 days after it was released, and I’m a complete nobody, Try to reapply for it and say you do academic research, that’s what got me access

    • @mr.wonder8168 says:

      Something I think about when prompting is the three dimensional space the weights are distributed in. I may be wrong but the vector is the connection between multiple points of reference for the model. I am guessing the model has some sort of range of points to consider before supplying the answer. The “Let’s”, brings the model into the area of teaching or tutorials and therefore more considered train of thought answers. Considering the models where developed on GPU’s I am guessing the vectors are actually like polygons used to render 3d images. Having the model focus in a more consolidated area might allow for a more focused area or a familiar pattern/pipeline of probable correct answers within the, “Time Out or Time Allotted”, equation for the model. It would be interesting to have a 3D polygon centric programmer to look at the vector space as polygons and see how they would optimize how we structure the questions we ask the models. Not a scientist or anything but it might be a different approach.
      Thanks for the videos. Peace:)

  • @jackmagill8579 says:

    Phillip – your videos are simply amazing. Thank you for all you do and keep coming out with the best content on YouTube!

  • @ExplorersXRotmg says:

    One test prompt that may be worth doing is to tell GPT something like, “Here are some common logical pitfalls/mistakes that are made when solving problems like these. Please ensure to account for these common mistakes in your answer” Then you could catalog some generalized errors that the current model being tested runs into and feed that as an input to further boost results.

    Heck you could have GPT generate it’s own list of generalized logical errors by giving it a set of more understandable tasks, asking it to compile a list of logical errors that could be systemic/generalized, then have it use that list as it’s own self-feedback prompt on the tougher questions.

  • @solaawodiya7360 says:

    Thanks for bringing back an update on GPT-4 Philip. Tbh, I don’t think any developer of LLMs would know the full capability of their models before they release it to the public probably due to their organisation bias. As Sam Altman said, these language models can’t be tested only at the labs, but they have to be tested by the public to know of it’s full capabilities. I think moving forward, research papers they release with their new models would be just a way to gauge or estimate how powerful their models are.

    • @aiexplained-official says:

      Totally get your point Sola, thanks again for the comment

    • @nonesomanynone says:

      I’m not sure that it’s organizational bias as much as it is that we really don’t have any predictive framework for LLMs, at least as far as I know.

      That’s not to say that won’t ever be a away to predict performace and capability. Hell, it’ll probably be another LLM built to analyze LLMs that turns out to be the solution.

    • @theawebster1505 says:

      Sam Altman said it in his interview – We need to get the model out now, while we can still modify and “teach” it with minor comparative consequences. (As opposed to later when it will be a big black box).

  • @alwayslistening3340 says:

    We need to get you a GPT 4 API key, ASAP!

    Thank you for continuing to report on AI.

  • @theeternalnow6506 says:

    You have a great channel btw. You go much deeper into the weeds and papers then others. Much more scientific approach then just “look at this stuff! This is nuts!”.

    I appreciate it a lot.

    • @teachingtotechcareer says:

      was gonna say exactly this. appreciate the level of depth in these vids!

    • @thanX says:

      I agree. This is probably the only channel on this topic I subscribed “all notifications “ bell activated.
      Unfortunately YouTube keeps on showing me the “look at this, this is nuts” kind of videos on the home stream… I’d love, in this age of artificial intelligence, a solution to filter out all the click-baiter videos from the feed…

    • @thanX says:

      @@sachoslks I know and I use it. Nevertheless on one hand it is not enough to instruct the algorithm, on the other it might be too harsh in some cases, some YouTubers provide decent content. I don’t want to avoid their videos, I just want them to rank worse if they use some manipulative techniques.

    • @benhallo1553 says:

      What a great comment

    • @a.thales7641 says:

      I agree.

  • @JazevoAudiosurf says:

    1. grounding
    2. CoT
    3. reflection
    4. dialogue reflection
    5. long term memory using vector db

    • @GrindThisGame says:

      6. Shared memory between a group of smart GPT-4 “scientists”.

    • @alzblb1417 says:

      nice

    • @Martineski says:

      @Jorge Espinoza Maybe he meant lowering the model’s temperature to limit it’s “creativity”?

    • @ronnetgrazer362 says:

      Yes, but perhaps also doing different runs, while changing the order around a bit and adding or leaving out steps and seeing which method works best for a given problem.

      It could also use this trying out of different solving orders as a fast and coarse preliminary step, to figure out what kind of problem it is trying to solve based on the prelim answers. Then it would have an indication which logical errors to be extra wary of.

      But I could be overthinking it, by using too many steps myself. I suspect the optimum is going to look stupid simple method-wise, just on a massive scale.

    • @aelolul says:

      that list sounds like dialectical behavioral therapy tbh

  • @fsmvda says:

    To me the reason for this improvement is giving the model working memory so that it uses the external loop to get more computational steps than the depth of the model. It’s like that paper that said transformer models + external memory is Turing complete.

    • @ericstromquist9458 says:

      I think you are correct, and I find it interesting that ordinary human language may be the right way to implement recurrence, memory and even world models in AI that’s based on LLMs. Another longer term advantage of using human language to implement these things is that from a safety standpoint, departures from alignment with our interests should be much easier to detect when the memory or world model is in the form of language. Signs of potential misalignments could then be uncovered by monitoring the memory or world model of the AI using simpler LLMs focused on that task.

  • @ct5471 says:

    Parallel response generation is also interesting in the context of hallucination. Nvidia recently presented their guardrails framework, the second part after the fact check is similar: if you feed a quote to multiple models and feed the different answers to another LLM instance and ask it to cluster and check for consistency, you can distinguish knowledge from hallucination. If the model knows something, the answer will be similar; if it is making things up, the hallucinations will usually go in different (inconsistent) directions. The consistent result can be fed back into a feedback loop for self-reflection. If no answer is consistent, the model doesn’t know anything which could be given out as answer . Or the model could then actively search online or be tasked to identify a procedure to get the answer (an experiment or something), which would go in the direction of auto-GPT or the recent paper on autonomous science in chemistry.

    • @ChaoticNeutralMatt says:

      Fascination.. actually this is related to drift in telephone games and.. knowledge from different historical sources which overlap.. there’s a more specific idea I had but can’t put into words, they aren’t forthcoming. Anyway. Food for thought

    • @eugenes9751 says:

      You wouldn’t even need to prompt multiple models at that point. Just ask yourself the same question multiple times and see if the results agree.
      Humans do this kind of hallucinating all the time, where we think we’re right and our brain fills in the reasons for “why we MUST be right”. Especially during an emotional argument.

    • @ct5471 says:

      @@eugenes9751yes right now with the same model, when you have smaller models like alpaca which you can fine tune yourself at some later point, it may make sense to have multiple, and train them independently into specialized directions. The brain seems to do something similar, by having a large number of so called cortical columns that specialize with overlaying but not redundant areas of expertise. It is interesting that looking at the architecture of transformer models they have a number of similarities, if you for instance look at the work of numenta. Sensor-motor integration resembles the vector encoding in LLMs where you bring together semantic information and position. The attention mechanism plus the semantic vector encoding resemble the principle of SDRs (sparse distributed representations). Also that would break down training costs.

    • @cholst1 says:

      I was just thinking something similar yesterday myself.

    • @Naokarma says:

      This reminds me of the time I heard that NASA rockets run 4 copies of the same software simultaneously, and run a “majority vote” style choice for which computer to follow. This is due to the way muons and stuff can change an output in computers outside of the atmosphere, leading to interference, so it makes sense to do the same type of redundancies for any other system with randomness involved.

  • @xSugknight says:

    My take on the “step by step” and why it works:

    A LLM is just predicting word by word, and if you present it with a task that requires a chain of reasoning to get the final result (e.g. A -> B -> C -> Result) It is way harder to get all the intermediate steps right impicitly. If you ask it to write complete answers it will go over B first then from this sequence of tokens realizes that C makes sense and so on.

    Anyway, you are one of the youtubers whose video i instantly click on! And i really like your scientific take and presentation in this video, including your SmartGPT which is a great idea btw!

    • @aiexplained-official says:

      Thanks so much Markus, great theory

    • @travischristensen5385 says:

      I’m behind this theory. When I as a human solve a step by step problem I also can guess the answer and then do the work of finding the real answer. It is the function of my memory that allows me to do a problem in my head and then be able to share the answer and follow it with the explaination.

    • @TomFranklinX says:

      It was said that the legendary mathematician Ramanujan could solve difficult problems by manifesting the answer directly in his mind; but still, it doesn’t always work, and chain-of-thought reasoning vastly improved his effectiveness.

    • @ticketforlife2103 says:

      ​@@TomFranklinXthere’s no such thing as “manifestation” just like that, his brain already does the step by step thoughts even without his conscious awareness.

    • @Andytlp says:

      @@ticketforlife2103 The teacher is still gonna ask to show how you got to the conclusion even if the answer is right.This must be making gpt go through all the steps without skipping. Skipping is faster but it can be wrong.

  • @alanburgess6467 says:

    You appear to have several offers, but I’m happy to let you have a key. The thought of you doing this process manually, shows your dedication 😊. Such an interesting topic. Must admit that as I get older, I’m doing a similar process with my brain 😂

    • @daveyseeman says:

      I know I’m just a random dude from the middle of nowhere, but can I please have a key to do some testing of my own? I have an idea that should be able to increase the performance quite a bit. It involves asking GPT-4 to give itself a role (e.g., a mathematician) and then re-send the prompt to GPT-4 with its new role appended to the start of the prompt. Thanks

    • @Vysair says:

      @@daveyseeman An account created today huh, are you an AI or a researcher?

    • @daveyseeman says:

      @@Vysair i created this account to make the request

    • @alanburgess6467 says:

      @@daveyseeman I’ve only just seen this request! A bit reluctant to release a key that will rack up costs to my account. I don’t think you will need to wait much longer, OpenAI seem to be opening the API to people now.

  • @mparkerchavez says:

    If you still need a GPT 4 API key, I’d be more than happy to share it with you to help with your research. I’m a product designer and venture architect currently working to help my organization understand how we may leverage this technology for the next 3-5 years. I also have some more socially focused ideas and goals that originally drove me to do a deep dive into this technology. Love the work you do!

    • @Magnetar_StarHeart says:

      I appreciate the socially focused part of your comment. As a more abstract and visionary type thinker. I’m excited to attempt to push forward on socially minded blueprints I’ve had for a while.

      Yet it’s still a deep dive to make these ideas and plans a true reality.

  • @Thomasfrank says:

    Recently got access to the GPT-4 API; let me know if you’d like to collab on some tests! (Can’t believe they haven’t given you access yet)

    • @TheMirrorslash says:

      The one and only. Great to see other brilliant creators on here.

    • @test-ml9wr says:

      He literally just repeated what was in the paper. This isn’t original research.

    • @timseguine2 says:

      @@test-ml9wr He combined the methods of a few papers, actually.

    • @test-ml9wr says:

      @@timseguine2 no he didn’t, actually.

    • @timseguine2 says:

      @@test-ml9wr Yes he did. He lists them clearly on his flow chart (He even specifically names the titles of all of the papers he is using). He is using:

      Zhou et al 2023, Shinn et al 2023, Nair et al 2023

      all simultaneously to give a better answer. And his original implementation was using Kojima et al 2022 instead of Zhou.

      This is clearly an amalgamation of techniques.

  • @bujin5455 says:

    I’d say this channel is quickly becoming my favorite AI channel, but honestly, at this point, I think it’s simply moving into its own league.

  • @piotr_sss says:

    This one addition to the original prompts of this lady, gave me correct and concise answers to both questions: “Let’s work this out in a step by step way to find all possible answers and to finally pick the most possible one.”

  • @dcgamer1027 says:

    A friend of mine, who is more of a casual observer of all this AI stuff, mentioned that he really wished ChatGPT would ask questions more often rather than just give answers. Alot of the time when we ask it a question or give it a prompt there is an implicit assumption that I know what I’m asking or communicated it well, but if chatGPT asked more questions to clarify user intent that could enhance the user experience much more. I thought it was a great idea and wanted to share

    • @skierpage says:

      I agree. It doesn’t seem to know when it’s unclear on some detail of what you’re asking. Imagine how good they’d be on common-sense reasoning if they could ask “Do you have enough room outside to dry 30 shirts at once in the same manner you dried 5 shirts?” Guessing the next word won’t include that sort of conversation, it gets filtered out before it arrives in the vast corpora that LLMs ingest. Also, I don’t think the harried human evaluators giving Reinforcement Learning from Human Feedback scores have the time or means to encourage that sort of multi-step back and forth.

      Does adding “Ask questions more often to clarify my intent rather than just give me answers” to prompts help?

    • @dcgamer1027 says:

      @@skierpage I’ve been using ‘ask for any needed clarification before answering the prompt’ which has been good so far with gpt4

    • @aintaintaword666 says:

      @@dcgamer1027 Yes indeed, that helps a lot. Also you can tell it to be a “prompt engineer” and iteratively ask relevant questions about what you want before giving you the final prompt. This video contains lots of comments with similar ideas: youtu.be/OgYQAS9LY3o

    • @ciaopizzabella says:

      Did you try suggesting it in your prompts to ask questions if needed?

    • @samhaun3929 says:

      They recently introduced persistent system prompts so it should get easier, i.e. you don’t have to tell it “ask for clarification” every single time.

  • @TheTwober says:

    I took your step-by-step approach and the idea of AutoGPT to generate and example game with GPT, which it was previously incapable of doing, and it not only works wonders, it could be the way to actually get it to write any kind of software that you can think of. The core idea is this prompt:

    “You end every answer with a suggestion for a prompt that I should give you as follow-up that will either fix issues, optimize the code, make the game better, or develop it further. Always write sophisticated, high quality code, as a professional would do.”

    Which causes GPT to answer the question, and the deliver a continuation prompt. Now if someone had access to the API, you could extract that prompt, have GPT spill out like Git patch files, and with that assemble the entire software step by step. I did that with about 20 prompts on that example game, and it became quite playable, with a lot of well working content.

    GPT eventually looses track of the code it wrote so far when the internal memory is full, and then it starts to make a lot of mistakes, but if that was somehow fixed, you had a complete code generator template for any kind of problem.

  • >