I'm wondering what the sweet spot for parameters will be. Right now it feels lik...

lhl · on April 19, 2023

I think "sweet spot" is going to depend on your task, but here's a good recent paper that may give you some more context on thinking about training and model sizes: https://www.harmdevries.com/post/model-size-vs-compute-overh...

There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774

version_five · on April 19, 2023

I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).

Taek · on April 20, 2023

Well, based on all the data we have available now it seems like you don't get much benefit yet from going above 200 billion.