Accuracy is NOT all that you need!

badri nath
Ushur Engineering
Published in
4 min readJun 3, 2021

Research papers published at premier AI conferences such as ACL, neurIPS, and AAAI, on the topic of deep learning are exploding. While many papers do push the boundaries of the state of the art, a significant majority of the accepted papers are on improving the performance of tasks based on the ever-increasing model complexity. The model is mostly a combination of prior methods or a tweak to existing models, or a more sophisticated version that increases the number of layers of prior models. This class of “performance” papers most often report a single performance metric-accuracy. Accuracy in classification, accuracy in sentiment analysis, or accuracy in a well-known NLP task. There is nothing inherently wrong with reporting accuracy improvements of proposed models except that the improvement in accuracy is often the only result shown. That too, an improvement in accuracy in second or third decimal place. Many years ago, probably today as well, system research would have rejected papers that report infinitesimal improvement in a single performance metric complex models involving a multitude of model random variables.

When you get to read the performance section of these deep learning research papers, often the accuracy of the proposed method is shown in bold font as though the font should be given more weight to the meager improvement in 2nd or 3rd significant decimal place. For some reason, the results are always presented in a table with the column in bold indicating the accuracy of the proposed method and other columns showing the accuracy of prior approaches. No attempt is made to show confidence intervals on the accuracy. When the improvement is so slight, a confidence interval bar would let the reader know if the performance is truly better than prior methods in a statistically significant way. Without Confidence Interval bands, given the effect of the variance of so many random variables in the experiment, how does one know if this is truly an improvement?

One wonders why the AI community continues to accept research papers with results that purport improvement in accuracy but that could just be an artifact of the randomness present in the experiment. In systems research, the performance section typically consists of several graphs that show how the dependent variable (performance metric) in the model varies as a function of some selected set of independent model parameters. These graphs depict the performance over a range of values for several independent variables of the model. In many deep learning research papers, no attempt is made to provide performance graphs that show how the measured accuracy varies when any of the other independent variables are varied. This lack of knowledge in the operating range of the model, let alone other dimensions of performance, make the research less applicable to any practitioner who wants to incorporate this approach in a real-world deployment.

Further, with just one evaluation metric of accuracy and no other metric such as stability — retraining overhead due to data drift — — it is difficult to judge the usefulness of the reported result. With the ever-increasing model layers and model parameters, come the associated cost of training and retraining. It is not just a one-time cost of training but periodic retraining when data drift occurs. Most complex models in deployment today require retraining several times a day, placing huge demands on resources and thus incurring long-term technical debt. Does the incremental improvement in accuracy result in an increase in debt? Results showing the relationship between added debt and improvement in accuracy will lead to a better understanding of cost-benefit tradeoffs of deploying complex machine learning models. A good research question to ask is, given a metric by which data drift can be measured, does the model incur only a proportional cost of additional retraining? If so, how should the layers be structured so that an incremental change in a measure of data drift only requires an incremental cost to service the technical debt?

I hope the deep learning community begins to demand a more thorough performance evaluation than reporting just one number in bold font that shows accuracy improvement in second or third decimal place. Accuracy, stability, technical debt, and other suitable metrics have to be presented to judge the benefits of using the model for any underlying AI/ML task.

References:

  1. Hidden technical debt in machine learning systems, D. Sculley, et.al., NIPS 2015
  2. Attention is all you need, Vaswani, et.al., NIPS 2017

--

--