From the course: Generative AI: Working with Large Language Models

Scaling laws

- [Instructor] Up to this point, we've looked at a couple of models, but now is a good time to try and understand why we have such large parameter models. Around the time of the release of GPT-3, the OpenAI team released some results around what they called the scaling laws for large models. They suggested that the performance of large models was a function of the model parameters, the size of the data set, and the total amount of compute available for training. They performed several experiments on language models. Let's take a look at some of the results. On the Y axis is the test loss. The test loss will converge for each of the models. So the lower the test loss, the better performing the model. Across the x axis is the number of parameters of the model. You can increase the sizes of these models by making them wider or increasing the number of layers. So as we go across, we're going with models with a hundred thousand to 10 million to 1 billion parameters. So we can see here that the larger the model size, the better performing you would expect the model to be. The graph in the middle, plots the test loss versus the data size. The OpenAI researchers used a large model, so the model size would not limit the performance. And they also used a training technique called early stopping, which stops training when the test loss reaches a minimum value. Now, from the graph in the middle, you can see that there's again, this striking power log trend, much like we saw when plotting test loss against parameters. So the larger the dataset size, the lower the test loss. And finally, if we look at compute, the x axis is the number of petaflop compute days. Now, petaflop day is around 10 to the power of 20 operations. In the diagram, each of the blue lines correspond to the learning curves of different models of different sizes. The reason that this moves to the right is that bigger models require more computation. So for every token that is input for larger dense models, each of the parameters are involved in some way, and so they require more compute. But you can see that the test loss reduces, meaning that the bigger models perform better. The orange line is the maximum optimal amount of performance you can get for a given amount of compute. So what this graph is telling you is that based on your compute budget, if you have more compute budget, use a larger language model, and if you have a smaller compute budget, use a smaller language model. So if we take all three, what this is saying is that language modeling performance improves as we increase the model size, the data set size, and the amount of compute used for training. For best performance, all three factors must be scaled up together. The OpenAI team then go onto propose that as more compute becomes available, you can decide where you want to allocate this. Either training a larger model using larger batches, or training for more steps. And the conclusion that came to, was that most of the increase should go towards increasing the model size. There will be some benefit to using more data and using large batch sizes, but minimal contribution if you train for more steps. One of the reasons why model sizes have just got bigger after GPT-3, is that these scaling low suggest that increasing the model size will give you the biggest benefit.

Contents