GPT 4 is Smarter than You Think: Introducing SmartGPT
In this video, I will not only show you how to get smarter results from GPT 4 yourself, I will also showcase SmartGPT, a system which I believe, with evidence, might help beat MMLU state of the art benchmarks.
This should serve as your ultimate guide for boosting the automatic technical performance of GPT 4, without even needing few shot exemplars.
The video will cover papers published in the last 72 hours, like Automatically Discovered Chain of Thought, which beats even 'Let's think Step by Step' and the approach that combines it all.
Yes, the video also touches on the OpenAI DeepLearning Prompt Engineering Course but the highlights come more from my own experiments using the MMLU benchmark, and drawing upon insights from the recent Boosting Theory of Mind, and Let’s Work This Out Step By Step, and combining it with Reflexion and Dialogue Enabled Resolving Agents.
Prompts Frameworks:
Answer: Let's work this out in a step by step way to be sure we have the right answer
You are a researcher tasked with investigating the X response options provided. List the flaws and faulty logic of each answer option. Let's work this out in a step by step way to be sure we have all the errors:
You are a resolver tasked with 1) finding which of the X answer options the researcher thought was best 2) improving that answer, and 3) Printing the improved answer in full. Let's work this out in a step by step way to be sure we have the right answer:
Automatically Discovered Chain of Thought:
Karpathy Tweet:
Best prompt: Theory of Mind:
Few Shot Improvements:
Dera Dialogue Paper:
MMLU:
GPT 4 Technical report:
Reflexion paper:
Why AI is Smart and Stupid:
Lennart Heim Video:
Non-Hype, Free Newsletter:
AI is developing so fast it’s past the scary part and I’m just bracing for impact right now.
Its extreamly slow after century where its unclear if there are problems with incentives or patents they finally made a language model that forgets what you say in 3 answers mostly. It is relly amazing in some respects tho. But the “hype” stop development is not true and ja….. lack of imagination
Youre likely to get more seriously injured if you brace for impact.
im all for the AI takeover, people and humanity sucks anyways lol
What does “impact” mean, here?
Same. We are not ready for what’s coming. Just saw a new open source model that can take 65000 tokens as well. Trained in 9.5 days for 200k…
In my experience: the significantly easier “reflect on your answer” and then “considering the reflection, answer again” improves answers by an insane amount.
It’s quite nice, because it’s easy to improve
It’s gonna keep getting smarter. On the ‘easier point’, I think you misunderstand, this is a single prompt your end, you don’t have to do anything extra.
@@aiexplained-official Yeah, but using the chatgpt premium subscription I pay for is also quite convenient, and I have the site bookmarked everywhere. Convience is important for me, and a python program isn’t always that, unfortunately.
That said, I might use this for stuff that’s a bit harder to answer!
@@Fs3i I find the same problem. The web interface is still more convenient for me most of the time. However, with time I’ll probably deploy my own specialty services to Linode as a way to get the best of both worlds.
@@aiexplained-official prompt:
“question: 5 tshirts take 5 hours to dry in the sun. how long do 30 tshirts take?
give an answer, explain it, then try to improve it.”
response:
“Answer: 30 t-shirts take 30 hours to dry in the sun.
Explanation: If 5 t-shirts take 5 hours to dry, then it means that each t-shirt takes 1 hour to dry (5 hours / 5 t-shirts). So, for 30 t-shirts, it would take 30 hours to dry (1 hour/t-shirt * 30 t-shirts).
Improved Answer: While it is true that if you dry each t-shirt one at a time, it would take 30 hours, in reality, you can dry multiple t-shirts simultaneously. If you have enough space to lay out all 30 t-shirts at once, then it would still only take 5 hours to dry all 30 t-shirts. If you have limited space, you can calculate the drying time based on how many t-shirts you can dry at once. For example, if you can dry 10 t-shirts at a time, it would take 3 cycles of 5 hours each to dry all 30 t-shirts, totaling 15 hours.”
i guess you can compress the entire process you described into a single multi-step-prompt
The devs mentioned that if it answers with “my best guess is…”, it reduces GPT hallucinations by 80%
I love this about current AI models, both LLMs and Image generators. You literally get better results by explicitly asking for them. You ask GPT-4 for a piece of code to do something, or Stable Diffusion for a landscape with a castle, you might get some decent results, but you’re much more likely to get much better results if you add in stuff like “Make sure the code follows best practices” or for images, “Best quality, Masterpiece” etc.
Like the AI was just sat there, happily generating garbage for you, and then it suddenly realises “OH! You wanted my output to be _good_ ! Silly me, here you go.”
I guess that’s what happens when you create something incapable of taking any pride in what it produces.
Exactly
Which, if you think like a computer, makes perfect sense. The robot will do EXACTLY as it’s requested; no protests or even suggestions to improve (most of the time). The more explicit you are, the better it can perform it’s job.
There is actually a paper called “Large Language Models and the Reverse Turing Test” by Terrence Sejnowski in which the so-called mirror hypothesis is proposed. This hypothesis suggests that the response of LLMs somewhat mirrors the intelligence of the prompter. This hypothesis may also explain why special sentences such as “Let’s think step by step.” or “Let’s work this out in a step by step way to be sure we have the right answer.” lead to better results, as these prime the LLM with information about it being expected to give a good and correct answer and as they also lead to the benefits of chain-of-thought.
@@MrAchterbahnfahrer Thanks! I’m half way through the paper! Cheers for the suggestion.
Interesting thought.
In a way the taking pride part could be added on directly.
An AI with an ego might put in more effort lol
I’m a Data scientist and Language model researcher for a large international company. I also have access to GPT-4 and I would love to help implement this with Langchain.
Hey Steven, awesome, do email in the channel description.
Langchain already invokes the Ai to ‘think’ or have ‘thoughts’ about what they are going to do or say. Its pretty much the base of the concept they came up with, and most use to this day.
You can see it all at work if you dive into the internal prompts being used by agents in langchain.
@@aiexplained-official Mr. King beat me to it, but I’m in a similar position both with regards to what I do for a living, and level of access. Hope you don’t mind me sending you an email as well.
@@aiexplained-official I got it about 3 days after it was released, and I’m a complete nobody, Try to reapply for it and say you do academic research, that’s what got me access
Something I think about when prompting is the three dimensional space the weights are distributed in. I may be wrong but the vector is the connection between multiple points of reference for the model. I am guessing the model has some sort of range of points to consider before supplying the answer. The “Let’s”, brings the model into the area of teaching or tutorials and therefore more considered train of thought answers. Considering the models where developed on GPU’s I am guessing the vectors are actually like polygons used to render 3d images. Having the model focus in a more consolidated area might allow for a more focused area or a familiar pattern/pipeline of probable correct answers within the, “Time Out or Time Allotted”, equation for the model. It would be interesting to have a 3D polygon centric programmer to look at the vector space as polygons and see how they would optimize how we structure the questions we ask the models. Not a scientist or anything but it might be a different approach.
Thanks for the videos. Peace:)
Phillip – your videos are simply amazing. Thank you for all you do and keep coming out with the best content on YouTube!
Thanks so much Jack!
You need OpenAI to give you a GPT-4 api key as soon as possible!
One test prompt that may be worth doing is to tell GPT something like, “Here are some common logical pitfalls/mistakes that are made when solving problems like these. Please ensure to account for these common mistakes in your answer” Then you could catalog some generalized errors that the current model being tested runs into and feed that as an input to further boost results.
Heck you could have GPT generate it’s own list of generalized logical errors by giving it a set of more understandable tasks, asking it to compile a list of logical errors that could be systemic/generalized, then have it use that list as it’s own self-feedback prompt on the tougher questions.
This is a good idea!
Thanks for bringing back an update on GPT-4 Philip. Tbh, I don’t think any developer of LLMs would know the full capability of their models before they release it to the public probably due to their organisation bias. As Sam Altman said, these language models can’t be tested only at the labs, but they have to be tested by the public to know of it’s full capabilities. I think moving forward, research papers they release with their new models would be just a way to gauge or estimate how powerful their models are.
Totally get your point Sola, thanks again for the comment
I’m not sure that it’s organizational bias as much as it is that we really don’t have any predictive framework for LLMs, at least as far as I know.
That’s not to say that won’t ever be a away to predict performace and capability. Hell, it’ll probably be another LLM built to analyze LLMs that turns out to be the solution.
Sam Altman said it in his interview – We need to get the model out now, while we can still modify and “teach” it with minor comparative consequences. (As opposed to later when it will be a big black box).
We need to get you a GPT 4 API key, ASAP!
Thank you for continuing to report on AI.
Thanks, already had a lot of kind offers
You have a great channel btw. You go much deeper into the weeds and papers then others. Much more scientific approach then just “look at this stuff! This is nuts!”.
I appreciate it a lot.
was gonna say exactly this. appreciate the level of depth in these vids!
I agree. This is probably the only channel on this topic I subscribed “all notifications “ bell activated.
Unfortunately YouTube keeps on showing me the “look at this, this is nuts” kind of videos on the home stream… I’d love, in this age of artificial intelligence, a solution to filter out all the click-baiter videos from the feed…
@@sachoslks I know and I use it. Nevertheless on one hand it is not enough to instruct the algorithm, on the other it might be too harsh in some cases, some YouTubers provide decent content. I don’t want to avoid their videos, I just want them to rank worse if they use some manipulative techniques.
What a great comment
I agree.
1. grounding
2. CoT
3. reflection
4. dialogue reflection
5. long term memory using vector db
6. Shared memory between a group of smart GPT-4 “scientists”.
nice
@Jorge Espinoza Maybe he meant lowering the model’s temperature to limit it’s “creativity”?
Yes, but perhaps also doing different runs, while changing the order around a bit and adding or leaving out steps and seeing which method works best for a given problem.
It could also use this trying out of different solving orders as a fast and coarse preliminary step, to figure out what kind of problem it is trying to solve based on the prelim answers. Then it would have an indication which logical errors to be extra wary of.
But I could be overthinking it, by using too many steps myself. I suspect the optimum is going to look stupid simple method-wise, just on a massive scale.
that list sounds like dialectical behavioral therapy tbh
To me the reason for this improvement is giving the model working memory so that it uses the external loop to get more computational steps than the depth of the model. It’s like that paper that said transformer models + external memory is Turing complete.
I think you are correct, and I find it interesting that ordinary human language may be the right way to implement recurrence, memory and even world models in AI that’s based on LLMs. Another longer term advantage of using human language to implement these things is that from a safety standpoint, departures from alignment with our interests should be much easier to detect when the memory or world model is in the form of language. Signs of potential misalignments could then be uncovered by monitoring the memory or world model of the AI using simpler LLMs focused on that task.
Parallel response generation is also interesting in the context of hallucination. Nvidia recently presented their guardrails framework, the second part after the fact check is similar: if you feed a quote to multiple models and feed the different answers to another LLM instance and ask it to cluster and check for consistency, you can distinguish knowledge from hallucination. If the model knows something, the answer will be similar; if it is making things up, the hallucinations will usually go in different (inconsistent) directions. The consistent result can be fed back into a feedback loop for self-reflection. If no answer is consistent, the model doesn’t know anything which could be given out as answer . Or the model could then actively search online or be tasked to identify a procedure to get the answer (an experiment or something), which would go in the direction of auto-GPT or the recent paper on autonomous science in chemistry.
Fascination.. actually this is related to drift in telephone games and.. knowledge from different historical sources which overlap.. there’s a more specific idea I had but can’t put into words, they aren’t forthcoming. Anyway. Food for thought
You wouldn’t even need to prompt multiple models at that point. Just ask yourself the same question multiple times and see if the results agree.
Humans do this kind of hallucinating all the time, where we think we’re right and our brain fills in the reasons for “why we MUST be right”. Especially during an emotional argument.
@@eugenes9751yes right now with the same model, when you have smaller models like alpaca which you can fine tune yourself at some later point, it may make sense to have multiple, and train them independently into specialized directions. The brain seems to do something similar, by having a large number of so called cortical columns that specialize with overlaying but not redundant areas of expertise. It is interesting that looking at the architecture of transformer models they have a number of similarities, if you for instance look at the work of numenta. Sensor-motor integration resembles the vector encoding in LLMs where you bring together semantic information and position. The attention mechanism plus the semantic vector encoding resemble the principle of SDRs (sparse distributed representations). Also that would break down training costs.
I was just thinking something similar yesterday myself.
This reminds me of the time I heard that NASA rockets run 4 copies of the same software simultaneously, and run a “majority vote” style choice for which computer to follow. This is due to the way muons and stuff can change an output in computers outside of the atmosphere, leading to interference, so it makes sense to do the same type of redundancies for any other system with randomness involved.
My take on the “step by step” and why it works:
A LLM is just predicting word by word, and if you present it with a task that requires a chain of reasoning to get the final result (e.g. A -> B -> C -> Result) It is way harder to get all the intermediate steps right impicitly. If you ask it to write complete answers it will go over B first then from this sequence of tokens realizes that C makes sense and so on.
Anyway, you are one of the youtubers whose video i instantly click on! And i really like your scientific take and presentation in this video, including your SmartGPT which is a great idea btw!
Thanks so much Markus, great theory
I’m behind this theory. When I as a human solve a step by step problem I also can guess the answer and then do the work of finding the real answer. It is the function of my memory that allows me to do a problem in my head and then be able to share the answer and follow it with the explaination.
It was said that the legendary mathematician Ramanujan could solve difficult problems by manifesting the answer directly in his mind; but still, it doesn’t always work, and chain-of-thought reasoning vastly improved his effectiveness.
@@TomFranklinXthere’s no such thing as “manifestation” just like that, his brain already does the step by step thoughts even without his conscious awareness.
@@ticketforlife2103 The teacher is still gonna ask to show how you got to the conclusion even if the answer is right.This must be making gpt go through all the steps without skipping. Skipping is faster but it can be wrong.
You appear to have several offers, but I’m happy to let you have a key. The thought of you doing this process manually, shows your dedication 😊. Such an interesting topic. Must admit that as I get older, I’m doing a similar process with my brain 😂
I know I’m just a random dude from the middle of nowhere, but can I please have a key to do some testing of my own? I have an idea that should be able to increase the performance quite a bit. It involves asking GPT-4 to give itself a role (e.g., a mathematician) and then re-send the prompt to GPT-4 with its new role appended to the start of the prompt. Thanks
@@daveyseeman An account created today huh, are you an AI or a researcher?
@@Vysair i created this account to make the request
@@daveyseeman I’ve only just seen this request! A bit reluctant to release a key that will rack up costs to my account. I don’t think you will need to wait much longer, OpenAI seem to be opening the API to people now.
If you still need a GPT 4 API key, I’d be more than happy to share it with you to help with your research. I’m a product designer and venture architect currently working to help my organization understand how we may leverage this technology for the next 3-5 years. I also have some more socially focused ideas and goals that originally drove me to do a deep dive into this technology. Love the work you do!
I appreciate the socially focused part of your comment. As a more abstract and visionary type thinker. I’m excited to attempt to push forward on socially minded blueprints I’ve had for a while.
Yet it’s still a deep dive to make these ideas and plans a true reality.
Recently got access to the GPT-4 API; let me know if you’d like to collab on some tests! (Can’t believe they haven’t given you access yet)
The one and only. Great to see other brilliant creators on here.
He literally just repeated what was in the paper. This isn’t original research.
@@test-ml9wr He combined the methods of a few papers, actually.
@@timseguine2 no he didn’t, actually.
@@test-ml9wr Yes he did. He lists them clearly on his flow chart (He even specifically names the titles of all of the papers he is using). He is using:
Zhou et al 2023, Shinn et al 2023, Nair et al 2023
all simultaneously to give a better answer. And his original implementation was using Kojima et al 2022 instead of Zhou.
This is clearly an amalgamation of techniques.
I’d say this channel is quickly becoming my favorite AI channel, but honestly, at this point, I think it’s simply moving into its own league.
This one addition to the original prompts of this lady, gave me correct and concise answers to both questions: “Let’s work this out in a step by step way to find all possible answers and to finally pick the most possible one.”
A friend of mine, who is more of a casual observer of all this AI stuff, mentioned that he really wished ChatGPT would ask questions more often rather than just give answers. Alot of the time when we ask it a question or give it a prompt there is an implicit assumption that I know what I’m asking or communicated it well, but if chatGPT asked more questions to clarify user intent that could enhance the user experience much more. I thought it was a great idea and wanted to share
I agree. It doesn’t seem to know when it’s unclear on some detail of what you’re asking. Imagine how good they’d be on common-sense reasoning if they could ask “Do you have enough room outside to dry 30 shirts at once in the same manner you dried 5 shirts?” Guessing the next word won’t include that sort of conversation, it gets filtered out before it arrives in the vast corpora that LLMs ingest. Also, I don’t think the harried human evaluators giving Reinforcement Learning from Human Feedback scores have the time or means to encourage that sort of multi-step back and forth.
Does adding “Ask questions more often to clarify my intent rather than just give me answers” to prompts help?
@@skierpage I’ve been using ‘ask for any needed clarification before answering the prompt’ which has been good so far with gpt4
@@dcgamer1027 Yes indeed, that helps a lot. Also you can tell it to be a “prompt engineer” and iteratively ask relevant questions about what you want before giving you the final prompt. This video contains lots of comments with similar ideas: youtu.be/OgYQAS9LY3o
Did you try suggesting it in your prompts to ask questions if needed?
They recently introduced persistent system prompts so it should get easier, i.e. you don’t have to tell it “ask for clarification” every single time.
I took your step-by-step approach and the idea of AutoGPT to generate and example game with GPT, which it was previously incapable of doing, and it not only works wonders, it could be the way to actually get it to write any kind of software that you can think of. The core idea is this prompt:
“You end every answer with a suggestion for a prompt that I should give you as follow-up that will either fix issues, optimize the code, make the game better, or develop it further. Always write sophisticated, high quality code, as a professional would do.”
Which causes GPT to answer the question, and the deliver a continuation prompt. Now if someone had access to the API, you could extract that prompt, have GPT spill out like Git patch files, and with that assemble the entire software step by step. I did that with about 20 prompts on that example game, and it became quite playable, with a lot of well working content.
GPT eventually looses track of the code it wrote so far when the internal memory is full, and then it starts to make a lot of mistakes, but if that was somehow fixed, you had a complete code generator template for any kind of problem.