In my daily work in Machine Learning and Full-Stack Software Engineering, I touch many aspects of Amazon Web Services (AWS) and am constantly surprised by how many new services are becoming available in the Machine Learning space, as well as their integration with other more bread-and-butter services involving computation and storage across the sprawling AWS platform. Furthermore, price-performance is improving exponentially.
This post will articulate my thoughts on moving up the value chain in Machine Learning, and how this means that an individual or small team can leverage their talent, knowledge, and limited time to do much more than before and the implications for architecting systems to maximize productivity. It will start from the lower levels and move towards increasingly higher levels in the value chain.
1. Commodity Cloud Infrastructure in Preparation for Machine Learning
Except in special cases, there’s no reason why an individual or organization would need to buy, setup, and maintain individual servers for computing tasks such as processing or storage since it can all be abstracted away through virtual machines and various services. Not only is it vastly cheaper and less labor intensive to spin up a virtual machine, but it permits scaling infinitely (limited by capital) with various value-added services such as automated backups, enhanced security, etc.
Indeed, infrastructure-as-a-service or datacenter-as-a-service is where it all begins. This has been the case for over a decade, and it’s become so commonplace, cheap, and efficient that not much attention is paid here. I believe this effect will move rapidly up the value chain.
AWS even provides disk images that can be used to spin up virtual machines in the state of that image with pre-installed software packages to save on even the simple act of installing free open source software and all the dependencies. To get a machine to a usable operating state already abstracts away many entire job functions and permits starting fairly advanced up the value chain compared to even a few years ago where a programmer would need to configure Linux, install Python, install packages, and get all the versioning and dependencies correct. A nightmare indeed that’s abstracted away with such software as Anaconda or Pip.
Databases hosted on servers must be setup and maintained, adding a layer of manual labor. This can be abstracted away via various serves. I focus on canonical SQL and No-SQL solutions.
Amazon has RedShift, among other SQL solutions, which could be coarsely viewed as an infinitely scalable relational database. Queries are blazingly fast and you can search on tens of millions of rows in seconds. Amazon has DynamoDB, among other NoSQL solutions, with functionality as you can expect with non-relational databases.
It takes seconds to configure these and have it up and running. The hardest and most time-consuming part is simply to get oriented and get permissions correct. It frees the programmer up to focus on thinking about the process of adding, organizing, or thinking through the data.
2. Commodity Machine Learning
After having technical infrastructure in place for computing and storage of data, the true task of machine learning begins. About 80%-90% of the effort is simply to clean and prep the data to be well-structured for use in the machine learning process. I suppose that this part of the value chain may be very difficult to abstract away and requires actual human effort.
There exist many ways to mitigate the cleanup and structuring process, such as having very clear business requirements involving very precise rules to gather, validate, and structure the data much upstream.
Machine Learning Processes: Training
In Machine Learning, there’s various techniques to hold out a training set, a validation set, then a test set. These are then compared with various assessment metrics such as a confusion matrix to calculate precision, recall, accuracy, F2-score, etc. to determine how good the model is and in which way.
Ideally, these commodity machine learning process could be abstracted away to just view it as:
3. Predicting (In Reinforcement Learning, this would be modified to be some scalar to maximize, instead of some metric such as accuracy.)
Amazon SageMaker allows for these processes to be abstracted away so you can just focus on the overall problem.
Parameter and Hyperparameter Tuning
If the training, testing, or prediction doesn’t proceed according to expectations and various statistical metrics are not met, then it’s repeated in successive iterations. This may involve tuning the parameters in specific machine learning models such as support vector machines or perhaps starting weights in neural networks. It may also involve hyperparameters such as regularization parameter, learning rate for neural networks, and so on.
Before, these needed to be done “by hand” as the engineer had to write loops to go through the state space of parameters, defining minimum, maximum, and increments. Now SageMaker has processes to do this automatically, saving time here, thus freeing the developer. Very important since this is not a very creative or deep process, but rather a shallow superficial mind-numbing process, something that should be abstracted away if possible.
4. Data is the True Advantage
In all these cases, training a model and optimizing it via tuning through a range of parameters and hyperparameters may improve prediction accuracy, but it faces the fundamental limit of insufficient data.
In fact, a superior tuning and classifier training process with less data is often worse than an inferior tuning and classifier training process with “more” data. Basically, more data can pick up the slack where a model or process falls short. This is the true advantage. More data is better data.
This is even more a competitive advantage in a niche field relative to the sophistication of your machine learning process and computational resources.
This is where individuals or companies can differentiate themselves and have the true competitive advantage. The AI startup Blue River Technology was acquired by agricultural machinery company John Deere via mounting cameras on tractors to identify plants and weeds.
Basically, using Deep Learning, Blue River became the world’s expert in identifying if something caught on camera is a plant or weed. It’s sprayed with fertilizer if it’s a plant and sprayed with a pesticide if it’s a weed. By enhancing the ability to identify these plants in a specific context, they were able to reduce the volume of chemicals sprayed by up to 90% to minimize the chemical footprint in our food chain. The key advantage here was here was having data in the form of screenshots to do deep learning on the plants and weeds, and having more of that data than anyone in the world. It was able to outperform Google, Amazon, or other large tech companies which easily has orders of magnitude more computational power and much more manpower.
5. Machine Learning: Models and Algorithms
If you don’t have the data and/or the computational resources for training, it can be possible to abstract away the training process by jumping all the way to the end to get a trained model instead. Indeed ImageNet with 14 million images and 20,000 different categories is the underlying data set for various convolutional neural network models in the form of weight parameters in the nodes. These are available as free open source files that you can import and start classifying. By having access to the neural network weights in convolutional network trained models , you bypass the need to access this original data, the massive computational resources required, as well as the human software development and infrastructure required.
You are moving up the value chain simply by having access to trained classifiers. This is actually useful even if you have access to all the original components to train such as the original data set, the open source algorithms, since it bypasses that labor and computation, which could be viewed as fungible and interconvertible with money.
As for algorithms, it’s interesting to note that simply being able to import packages that encode algorithms that were developed by an intense research effort is already moving far up the value chain by simply using thousands of hours of work by brilliant minds, and furthermore, even this can be abstracted away yet further by bypassing the creation of these models by the Machine Learning algorithms.
Interestingly, Amazon Marketplace actually sells trained classifiers. For example, one suite of classifiers from Perception Health can be rented to predict risk of various diseases such as Breast Cancer, heart Failure, Fibromyalgia, Colorectal Cancer, and a few dozen others. As data to be run through the classifiers, it is timestamps, procedure ICD-10, procedure HCPCS codes, and diagnosis, and a few other data points accessible from patient histories. These models were trained models on billions of medical claims which would effectively be impossible for a machine learning developer to get access to. Even if they did, it might not be useful to run through the entire process.
6. True Advantage: Iteration and Moving to Problem Domain
So it appears that we can rapidly move up the value chain by abstracting away the collection of data, the data itself, the training of classifiers by tuning parameters and meta-parameters, the classifiers themselves, as well as all the code and attention required to step through these processes.
When confronting problems that involve Machine Learning problems, it’s important to think and strategize how far up the value chain using existing tools and services before you even begin in solving the problem in the problem domain, rather than starting from the bottom and working your way up and taking pride in a job well done in the technical/coding domain. When you wield powerful technical tools, it’s easy to view problems and try to use those tools, rather than think of how you can bypass even the very use of the tools.
This is important since even if you are able to maximize at various parts of the lower levels of the value chain, since it does not necessarily imply beneficial progress on the higher parts of the value chain, which ultimately constitute the most potential to minimize development time and attendant capital costs, and to maximize value to increase revenue and profit.
For example, assuming you have the most elegant and efficient code to loop through and train convolutional neural networks on ImageNet and assess the various parts of the training process to maximize success metrics, it’s not guaranteed that you can come up with better trained models than those that currently exist and freely available to import and use.
The true advantage of taking this approach of moving up the value chain is that you can iterate quickly and move to solving the actual problem in the problem domain, rather than intermediate, inconsequential, artificial, and/or technical problems in the infrastructure, machine learning, and coding and software engineering domain. There’s no sense in re-inventing the wheel. Try to always seek leverage to accomplish the most via thinking, while trying to minimize the actual “doing.” This is especially since we all have such limited time and energy, and there’s so many problems in the world to solve.