Green AI
Green AI — How the financial and environmental costs of NLP models could change the future of the NLP community.
Introduction
The recent improvements in NLP models obtained notable improvements in accuracy. However, these improvements are not “free”, they require heavy computational resources. They are costly and consume a significant amount of energy. In the paper, “Energy and Policy Considerations for Deep Learning in NLP” by Emma Strubell et Al. [1] there is an attempt to quantify the financial and environmental costs of such models. (I previously mentioned this paper in my SpacyIRL post.)
In the paper, in order to measure the energy, Emma Strubell et Al. [1] trained the state-of-the-art NLP models using the default settings provided, and sample GPU and CPU power consumption during training. They trained each model for one day and multiplied the daily consumption by the number of days the total expected time the model’s paper reported. In addition, the total consumption was multiplied by the Power Usage Effectiveness (PUE) coefficient, which accounts for additional energy needed.
In order for you to get the feeling of the numbers of energy consumption the paper listed the followings:
So yes, training a BERT model is equivalent to air travel from NY to SF.
Aside from the disturbing numbers, I think as an NLP community this raises interesting questions.
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
As NLP researchers, this idea has an impact on the direction of NLP research. Each year we increase the number of parameters, have more complex architectures and require more and more data to train. I believe the NLP research future will focus on training with much smaller datasets and use simpler solutions to obtain similar results.
An interesting paper that tries to do so is “Distilling the Knowledge in a Neural Network” by Hinton et Al. [2] The idea is simple, take the full output of a large network such as BERT, and train a small network to predict the full vector output. The small network will learn to mimic the large one. The results are surprisingly good, getting pretty close to the large networks.
As NLP applied scientists, this emphasizes we should always think about this important trade-offs while modeling solutions. If I am building a solution based on an English to German machine translation model and NAS achieves a new stateof-the-art BLEU score of 29.7 for it, an increase of just 0.1 is it worth it? The answer is obviously no.
To sum up, the current financial and environmental costs of NLP models is very high. I hope as a community the focus on using and finding simple, elegant, small-data and ecological solutions will become just as important as using and creating the “next BERT”.
Some food for thought!
Until next time,
Noa Lubin.
References
[1] Energy and Policy Considerations for Deep Learning in NLP, Emma Strubell, Ananya Ganesh, Andrew McCallum
[2] Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean