From the course: Generative AI: Working with Large Language Models

PaLM

- [Instructor] In April, 2022, Google released PaLM, or to give it its full name, the Pathways Language Model. Now there are a couple of key takeaways from this model. Comparing the number of parameters, we can see that PaLM is the largest of the dense parameter models with 540 billion parameters. It dwarfs the GPT-3's 175 billion parameters, Gophers, 280 billion, and just edges out Megatron-Turing NLG at 530 billion parameters. Now, Google used the pathway system, a new AI architecture that they revealed at the end of 2021. So using this framework allows for many more chips to be used for model training, with PaLM being trained on 6,144 hardware accelerators versus smaller numbers of chips being used for previous large language models. And finally, if we look at the Model Flops Utilization, you can see that the Model Flops Utilizations have increased going from GPT-3 to PaLM. PaLM has effectively doubled the Model Flops Utilization. So the higher the number, the more efficiently a model can be trained. And these are possible because of improvements over the years across the model and compiler technology. Now, PaLM was trained on an enormous 780 billion tokens using a multilingual corpus with text from over 100 languages. Now about 78% of this training data was in English. So 50% of the training is in multi-language social media conversations, just over a quarter is filtered webpages, and then we have the usual contents that we've seen so far, books, GitHub, Wikipedia, and the news. Now another really interesting phenomena that the Google team picked up on was on scaling. It looked like the models could only perform certain tasks once a certain scale was reached. Here, 8 billion parameter models could perform certain tasks such as question answering, language understanding, and arithmetic. It was only when the model was scaled up to 62 billion parameters that more tasks such as translation, summarization, and common sense reasoning were possible. But it then required a much bigger jump to 540 billion parameters for the model to be able to perform tasks, such as general knowledge, reading comprehension, and joke explanation amongst others. Yes, I did say joke explanation. Let me give you an example. Like any few short learning model, you can give it a couple of solved examples as a prompt to your input. So we provide the first example of explaining a joke. So the prompt is I will explain these jokes. The problem with kleptomaniac is that they always take things literally and we then provide a sample explanation. So the explanation is this joke is wordplay. Someone who takes things literally is someone who doesn't fully understand social cues and context, which is a negative trait. But the definition of kleptomania is someone who literally takes things. We can then provide a second example of a joke. So always borrow money from a pessimist, they'll never expect it back. And finally we provide an explanation of this joke. So the explanation goes, most people expect you to pay them back when you borrow money. However, a pessimist is someone who always assumes the worst, so if you borrow money from them, they will expect that you won't pay them back anyways. And now we provide our joke as the input. So I was going to fly to visit my family on April 6th. My mom said, "Oh great, your stepdad's poetry reading is that night. So now I'm flying in on April 7th. And remarkably the model returns this as output. The joke is that the speaker's mother is trying to get them to go to their stepdad's poetry reading, but the speaker doesn't want to go. so they're changing their flights to the day after the poetry reading. Now let's see what results we get from GPT-3 with joke explanation. Now, just so you know, I have absolutely no idea what the GPT-3 model will output. Let me provide the two jokes with the explanations as an example, and then let's go ahead and see what explanation GPT-3 comes up with. So the response back from GPT-3 is this joke is a play on words. The person is saying that they were originally going to fly in on April 6th, but their stepdad's poetry reading is that night, so they decided to fly in on April 7th instead. This is funny because it's a play on the words fly and poetry, which are both pronounced the same way, so clearly this isn't correct, and GPT-3 didn't get this right. Let's head back to the findings from the PaLM model researchers. They also found that sometimes standard prompting didn't work. So if you give the model the example, Roger has five tennis balls. He buys two more cans of tennis balls. Each can has three tennis balls. How many tennis balls does he have now? And you gave the model, the answer is 11. And if you then ask the question, the cafeteria has 23 apples, if they use 20 to make lunch and bought six more, how many apples do they have? It would sometimes get the wrong answer. And you can see here the incorrect answers provided. The model return the answer 50. Now instead, if you provided how you came up with the answer as part of your prompt, so for example, Roger started with five balls, two cans of three tennis balls. Each is six tennis balls, five plus six equals 11. Then the output from the model would mimic your chain of thought reasoning and come up with the correct answer. So for the question with the cafeteria, the model would come up with the cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 minus 20, which is three. They bought six more apples. So they have a total of three plus six, which is nine. The answer is nine. So let's wrap up the section by adding PaLM to our list of models. You can see it's the largest language model parameter to date, and it has the best overall performance.

Contents