Quote:
Originally Posted by verticaldoug
Using Meta's claim for their latest model LLAMA 3.1 450B , they say they trained it on a cluster of 16,000 H100 which took 30.94mm GPU hours.
I think this equals 20Gw over 80 days
I think this is the consumption of 5300 homes annually.
|
Thats the training, where the all the model weights (these are the "billions of parameters") are updated each time through many passes through the data. Once trained these models can be deployed in production.
Hugging face for example has hundreds of pretrained open source models.What is often done is to use pretrained models, freeze most of the weights and only train, say the top layers on data specific to your use case. That is much quicker. Or simply just predict using the entire model-no training. The point being that the training is more of a one off deal. Not that prediction is not computationally intense-you still need GPU to make it practical.