From the course: Generative AI: Working with Large Language Models

Chinchilla

- [Instructor] Up to this point, we've seen that the trend has been to increase the model size. Interestingly, the number of training tokens used for most of these models has been around 300 billion. Now, the DeepMind team's hypothesis was that Gopher was too large. If you take the same compute budget, a smaller model trained on more data will perform better. They then tested this hypothesis by training over 400 language models, ranging from 70 million to over 16 billion parameters with data sets from five to 500 billion tokens. They then trained Chinchilla a 70 billion parameter model with 1.4 trillion training tokens and Chinchilla outperforms Gopher which has 280 billion parameters GPT-3 with its 175 billion parameters and Megatron-Turing NLG with its 530 billion parameters on a large range of downstream evaluation tasks. As this is a smaller model this means less computes required for fine tuning and inference. Now, let's think back to what we learned about scaling laws. One of the key insights in this paper is that if you were training a large language model and you got a tenfold increase in computational budget, the majority of that should go towards increasing the size of the model. They suggested that the model should increase by five and a half times and the number of training tokens should increase by 1.8 times. The conclusion that the DeepMind team came to was very different. For a tenfold increase in computational budget the model size and the number of training tokens should be scaled in equal proportions. Let's take a look at how they demonstrated this. Now, on the X-axis we have flops or floating point operations which are a measure of computation, and on the Y-axis we have the number of parameters for the models. On the graph, you can see that the GPT-3 model represented by the red star, the Gopher model represented by the yellow star, the Megatron-Turing model given by purple, and finally the Chinchilla model represented by the green star. Now, we know that Chinchilla performs better than any of the models, but what's interesting is that it was trained on the same amount of compute as all of the larger models. The point being that you don't need to have all of these larger models because they've all been undertrained. We can see this in this table Chinchilla has 70 billion parameters, which is far fewer than any of the other models. Compare this to GPT-3s, one 75 billion parameters and Gopher's 280 billion parameters and so on. But what is different is that while most of the large language models have been trained on around 300 billion tokens, Chinchilla has been trained on 1.4 trillion tokens, which is almost five times as many tokens as has been the norm. The DeepMind team set out to answer this question, given a fixed flops budget, how should one trade off model size and the number of training tokens? Now, if you're given the number of flops, then you can figure out the number of parameters and the number of training tokens. So here the smaller models are given by a purple color and the largest models are yellow. So what's happening is that for each of the flops we can figure out which model has the lowest training loss and we can then plot this. We can then take all of the lowest training losses for each of the models and plot this against flops. This way, given a certain number of flops we can predict how large the model needs to be. With the example of Chinchilla, we know that the number of flops was over 10 to the power of 23, so the optimal number of parameters is going to be around 70. Similarly, we can take the number of training tokens and plot that against the number of flops and determine for a specific number of flops how many training tokens we need. Now from the graph earlier we know that the number of flops for Chinchilla was a little less than 10 to the power of 24. Now, this means that if we know that this is the compute budget we can then figure out the number of parameters for an optimal model and the number of training tokens. So let's confirm this for Chinchilla. If we then draw a line for these number flops we can determine that any model that uses this many flops should have approximately 67 billion parameters. And similarly, we need around 1.5 trillion training tokens. And if we look here, it isn't surprising. While Chinchilla has 70 billion parameters and was trained on 1.4 trillion tokens. The DeepMind team took things a step further. For a given flop budget what is the optimal parameter count? So this time we have nine different training budgets which correspond to the nine different flops and the curves that you can see in this diagram. Now we can plot these lowest loss values and the number of parameters in a model against flops. Similarly, we can plot the number of training flops against the number of training tokens. And again, for a compute budget for Gopher of less than 10 to the power of 24 gives us an optimal parameter size of 63 billion, and we need 1.4 trillion tokens to train on. So this then begs the question, are the massive language models we are seeing today oversized? So let's say we use the Gopher model as our baseline for their compute budget of 5.76 to the power of 23. The optimal model parameter size should be 67 billion and the number of training tokens, 1.5 trillion. Now, we know that the number of training parameters for Gopher was 280 billion. This meant that their training budget should have been actually 17 times more and they would've required 5.9 trillion training tokens. It doesn't mean that you can't train these larger models it's just that these models have not been optimally trained with enough data. Now, remember that the Chinchilla model which is trained on 70 billion parameters with 1.5 trillion training tokens significantly outperformed a GPT-3 which has 175 parameters, Gopher which has 280 billion parameters and Megatron-Turing NLG which has 520 billion parameters. So where does this leave us with the scaling laws that we looked at earlier? The DeepMind team designed an interesting experiment to compare their findings with the scaling laws. So given a compute budget of 10 to the power of 21 flops, determine the number of parameters required and how much a data is required to train it using the scaling laws prescribed by OpenAI and the ones determined by DeepMind. So whichever model results in the most performant model is better. With the scaling laws from the OpenAI team, a 10 to the power of 21 flops budget recommends a 4.68 billion parameter model. And DeepMind's approach recommends a 2.8 billion parameter model. The Y-axis is the training loss, so the lower the better, and the X-axis represents the number of training tokens. And you can see that if we stopped at the number of training tokens recommended by OpenAI it would appear that it has a lower training loss and that it's a better model. However, because DeepMind's 2.8 billion parameter model needs to be trained on more data, we see that it ends up with an overall lower training loss after being trained on more data. And similarly, if you were to just plot the training loss versus the number of training flops. Now we plot the two models, the 4.6 billion parameter model from OpenAI and the 2.8 billion parameter from DeepMind versus Compute. You can see that we get a lower loss with the DeepMind model. They concluded that you can end up with a more performant model using a smaller model with more training data. So let's wrap up this section by adding Chinchilla to our list of models. Our biggest takeaway is that the current large language models are significantly undertrained and from the table, Chinchilla is trained on more than four times as much data as any other large language model. It's the smallest but also has the best performance results.

Contents