Why Deep Learning is not the Holy grail of Data Science?

Faizan Ahemad
6 min readJan 9, 2018

--

How silly is your Neural Network?

If you are getting into Data Science currently then don’t just do Deep Learning, do some stats, some regression and some sql too. Then you can do more than just distinguish between cats and dogs.

Everyday I see multiple articles related to deep learning popping up on medium. Articles detecting cats and dogs, playing chess and go, even self-driving cars keep coming up.

But then the question arises, Are statistics and other machine learning techniques useless? Is Linear Regression dead? Is Data Engineering an obsolete craft?

And I guess we need to consider the limitations of deep learning to answer these questions. So when was the last time you needed to do image recognition or playing Go in your job? I will list a few limitations of deep learning I have seen

  • ETL (Extract Transform Load) still needed: Deep Learning systems can’t get data from several data stores into a single good format, especially when some of these stores are unstructured files like logs. Once the data is extracted the right transformations are also upto a human to design. You gotta normalise your data even for DL.
  • Feature engineering is not needed as deep learning learns all feature importances itself. But then we do need to convert all features into numeric types. Isn’t this feature engineering itself? What about cases where features are collinear (hint: Gradient descent won’t play well)?. Too many features also take much longer to train. So as a result you would need more resources or you will need to select right set of features.
  • Getting Feature Importances is going to be tough (See here and here for hints to do it).
Does this NN look simple to your business??
  • Simplicity: So how do you plan to explain your business about the gradient descent that happened in a 1000 dimensional hyperspace? Well could you just explain them relu and sigmoid at least? Naive Bayes and other models (Decision Trees) are easier to explain and understand.
  • Statistical significance: For Naive Bayes its possible to calculate how significant is your result, how confident you are of your prediction (90% confident or 60% confident).
  • Domain Knowledge: No matter how good a neural network is, it will only optimise the loss function you ask it to. For a skewed dataset like Frauds where split maybe 99:1 if you ask your network to predict for accuracy then its gonna predict the major class always. You need to understand your domain to design a good error function. RMSE/RMSLE/Recall/F1 or something very specific to your domain.
  • Large Amounts of Data Needed: Deep learning is most useful where you have huge amounts of data, my experience with xgboost vs DNN seems to show that DNN shows promise over 1 Million. Anything less and you have more chances to overfit, especially if you have a large number of features. Try DNN with 1k features and only 10k examples to see how badly it can overfit.
  • Extrapolation to unknown data: This is a challenge that all ML algorithms face, DL is not free from it either. A practitioner needs to check if they have representative sample of real world or not and then proceed accordingly. Check the images below for predicting X².
Good Fit while Training
Fails to extrapolate to New Data
  • Non labelled data and No Data to begin: Suppose you made a social network recently and you want to sort posts by some logic. Now you may say we generate post importance by feeding the data of likes/clicks/claps etc into your model and train it. But then it dawns upon you that you just launched, so no data and no DL. You can go for a simpler Linear model with pre-decided weights here. Improve on the Linear model once you have more data.
  • Theoretical Knowledge needed: Which, You will need to know a lot of theoretical concepts to use DNN effectively. Compare that xgboost or random forests where you can just grid search over few hyper parameters and get the best outcome. For a DNN you will need to understand adam vs sgd, dropout vs other regularisation, how many layers, how many neurons in each layer. And if you are modelling time series then LSTM is a different beast altogether.
  • Lots of Computational Resources Needed: You will need a high end rig with GPUs to train neural networks with any appreciable speed. Compare that to Random Forests that can be trained on a distributed manner on a cheap cluster.
Debugging these cases is hard
  • Deep Learning models are black boxes (Debugging). How do you explain to your business partners how the model works? How do you say that you predicted a certain outcome due to certain reason? Example: you are classifying if a person is eligible for a loan? Lets say a loan application is rejected. Now you need to know why? Lets say in your training set 5 examples were similar where rejections happened and the only similarity between these people was that they owned cats. So now any cat owner has his loan rejected. But you haven’t added another field which was common to those 5 training cases, their permanent address, turns out these people are of same family and have done multiple loan frauds. The local bank knew and rejected their loan but your training data only captures their cat love and not their locality. Given that your DL system is a black box you won’t diagnose this easily or mostly ever. So Garbage in garbage out.
  • Inability to understand,store and transfer conceptual knowledge. You know the process of driving, you can teach it to your friend in under 20 hours. You know the concepts of steering, road, car etc already. All that because your mind builds up a hierarchy of concepts. Why then does Google and Uber find it challenging to make a self-driving car? Cause no way to teach basic concepts to neural networks. You always start from scratch. Transfer learning in its current phase can only help in image recognition type problem.

--

--