Know the rules The Paceline Forum Builder's Spotlight


Go Back   The Paceline Forum > General Discussion

Reply
 
Thread Tools Display Modes
  #46  
Old Yesterday, 03:59 AM
marciero marciero is offline
Senior Member
 
Join Date: Jun 2014
Location: Portland Maine
Posts: 3,320
Quote:
Originally Posted by verticaldoug View Post
Using Meta's claim for their latest model LLAMA 3.1 450B , they say they trained it on a cluster of 16,000 H100 which took 30.94mm GPU hours.

I think this equals 20Gw over 80 days

I think this is the consumption of 5300 homes annually.
Thats the training, where the all the model weights (these are the "billions of parameters") are updated each time through many passes through the data. Once trained these models can be deployed in production. Hugging face for example has hundreds of pretrained open source models.What is often done is to use pretrained models, freeze most of the weights and only train, say the top layers on data specific to your use case. That is much quicker. Or simply just predict using the entire model-no training. The point being that the training is more of a one off deal. Not that prediction is not computationally intense-you still need GPU to make it practical.

Last edited by marciero; Yesterday at 04:07 AM.
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 09:30 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.