Demystifying the ML, AI, and Data Science development ecosystem (Part 1: Build)
This blog post is part 1 of a 3 part series explaining the landscape of what we call the quantitative oriented developer (QoD) pipeline; encompassing machine learning, artificial intelligence, and data science.
There are already many tutorials for getting started with individual modeling frameworks or algorithms, but explanations outlining how all of the moving parts fit together in these broader workflows are lacking. There are countless tools, concepts, and algorithms being discussed on any given day, and that creates a high barrier of entry for newcomers to the space who are trying to learn their place in this space. We think it’s essential that developers have access to the resources that allow them to understand how all of these exciting pieces fit together from a broader perspective, aligning with our mission to truly democratize AI.
For the sake of staying on topic, we’re not going to be diving into specific performance tradeoffs between algorithms, that will be for another time!
Part 1: Build
1. Frame the question, approach, or goal
Regardless of what type of end-product you’re trying to produce, building a data-driven product relies on structuring quantitative problems that we can “solve”, or at least be able to quantify success for, which we will then strive to optimize.
Some examples of questions-solution pairs are as follows:
Are we trying to optimize an existing process (machine learning)?
- Predicting defective products in a factory (predictive modeling)
- User-tailored content (targeted advertising)
Are we trying to develop an insight or better understand our data (data science)?
- Discovering user preferences (statistics, feature extraction)
- Assessing consumer financial patterns (developing insights)
Are we trying to build a data-driven automation solution (AI)?
- Chatbot (NLP, maybe Voice UX)
- Recommendation engine (Semantic analysis, NLP)
- Identifying features in images (Computer Vision)
2. Acquire Data
Once you’ve framed your question, you’ll need to identify and acquire the data necessary to solve it. Sometimes this requires finding a way to acquire new data, other times it involves cleaning data from a firehose that your company is already processing.
Regardless, the developer needs a way to retrieve, store, access, and manipulate data, so the data engineering pipeline becomes important to understand. Quantitative developers need to express their data needs in a way that their infrastructure-facing team can take action on. The following is a graphic that correlates all of the entities in a data pipeline, from ingestion/acquisition to a database solution that is queryable by the language of choice for doing quantitative development. Below is an overview of the data pipeline from Insight Data Science. They have an awesome interactive chart available here!
There are two primary ways to get access to data that you don’t already have, assuming someone isn’t going to just send you a dataset. The first is through an Application Programming Interface (API). API’s are gatekeepers for data that is hosted by someone else, a set of technical rules and procedures for how developers can access and use it. Almost all languages have standard libraries for calling APIs to access data thanks to the standardized schemas of REST and SOAP. Here’s a guide that shows how to use REST in 8 different languages.
The alternative is data mining/web scraping, where a programming script can be used to traverse a medium (website, document, book, etc) and extract a particular type of data from the source in a usable format. Examples range from web crawlers that scrape top news headlines to computer vision API’s that can count appearances of people in a video (trust me, there are lots of different options).
Whether you’ve ingested data from an external source or already had it on-hand, there needs to be some sort of way of accessing the data in a programmatic way. Depending on the size and type of the data, there are different solutions one could take. Sometimes data is small enough to fit on the user’s local filesystem, but then making sure that there is enough RAM to manipulate the data becomes a concern. One single database may often be enough to address this case, but sometimes your data is so large that you need a distributed system where you can perform multiple queries simultaneously for the sake of speed, or use internally facing API’s to only query certain parts of the data that are relevant for your current work. This is where distributed access frameworks help, they serve as a wrapper for the databases that allow your processing scripts to handle data without making infrastructure/hardware constraints a nightmare. There are many different solutions for internal data storage/access systems that fall under the ETL (extract, transform, load) umbrella, with intrinsic tradeoffs for each that need to be considered based on an individual company’s storage and usage behaviors.
3. Data Pre-processing
Data preprocessing refers to all of the steps where raw data is transformed into a state that is directly usable by our data frameworks and programming languages. More often than not, data cleaning steps occur in the same language that the model will inevitably be built in (although not necessary). Further, it’s valuable to do preprocessing steps that improve the ‘starting point’ for yourself and others who may be using this dataset, cutting down on redundant work.
In the real world, data is messy. Although we wish we could always have large, complete datasets, very often there are gaps in parts of our entries, a small sample size, imbalanced classes, or non-numeric variables that need to be converted into discrete values for use in machine learning algorithms.
Incomplete entries can either be omitted (if we have a large enough number of complete entries) or interpolated (estimated based on known properties of the other entries’ behavior).
“In the real world, data is messy”
Imbalanced classes occur when you have a dataset where there is a large gap between the number of entries for two classes. For example, if I was trying to build an algorithm that detected between legitimate credit card purchases and fraudulent ones, I would ideally want a 50–50 split between the two, but there the vast majority (>95%) are legitimate. Simply excluding most of the other class doesn’t really work here like with incomplete entries, so more data for the smaller class is generated using quantifiable properties of the data we already have. The validity of conclusions is then tied to how closely the smaller class’s assumed behavior is to its actual behavior.
Feature selection/extraction refers to the use of statistics for filtering a subset of factors, or generating significant factors based on variables from your dataset. Extracted features are not always “real” or tangible entities, but rather represent interactions between variables that prove to be strongly correlated with the outcome variable of choice.
Data normalization is conducted so that the scale of variables doesn’t disrupt the way certain algorithms measure relationships. Examples include clustering, recommendation engines, and other variables where the relative distance between variables is what we really care about, but absolute value differences may overly saturate the model.
4. Building the Model
This is where the number of choices can start to feel overwhelming. There are many instances where there is a certain framework you wish to use, only to find out that it is only available in a particular language, or maybe your company’s stack requires you to use one language in particular and find a package/framework that coincides with that due to technical or infrastructure constraints.
The answers to many of the following questions will depend on a few things, from objective differences between tools like features, limitations, and performance differences, to subjective things like preferences and tool knowledge (which ultimately convert into things that are more objective like time-to-completion and quality of code). Ultimately, the end goal is to build a model that solves your initial problem, so those initial conditions are what should be used for guiding your pro/con weighing of each implementation option.
The best way to navigate the limitless number of possibilities is to ask yourself a series of questions that will help narrow in on a series of tools that will suit your needs.
What datatypes are you working with?
(images, text, video, numerical, or some combination)What type of output from your model are you expecting (binary result of yes/no, a multiple outcome classification, or predicting an expected value for time series data like stock pricing)?
Which language will we use?
Which framework will we use?
Here you’ll be limited based on which language you’ve committed to, as well as what kind of data you have, and what type of solution you’re looking for. Some of these frameworks were built around particular models as their specialty. Here’s a repository that lists an expansive directory on data modeling frameworks, sorted by language.
Some standout generalizable frameworks include:
- Python: TensorFlow, XGBoost, scikit-learn, MLlib, CNTK, auto_ml, Pytorch, Keras, Caffe/Caffe2
- C++: OpenCV(computer vision), CNTK, CUDA
Which type of learning, and consequently, which type of model?
First, the type of data will determine what type of learning methods you will have access to. Sometimes, you may not know what your question is, but you still wish to do exploratory analysis to generate insights on data you already have. This is referred to as unsupervised learning, where there is no known target data for which you can score your model, it simply tries to measure relatedness and predict relationships.
Conversely, supervised learning involves using data entries that have known results in order to train a model that can most accurately predict which class a given datapoint would belong to as dictated by the training data’s known behavior.
Second, the type of data you have (continuous or categorical) will drive you towards a specific algorithm. Each algorithm has different properties, from model inputs and outputs, to tradeoffs of bias, variance, class balance, normality, conservativeness, and beyond.
Once all of these things have been decided, you can code, and consequently, build your model. The code writing and ‘compilation’ steps differ mainly by language.
Next steps: Validation and Tests
Congratulations, you’ve now gone from ideation to a trained model! Now, while your model may not have errored during compilation/build, there is still a lot that goes into testing and validating it, ranging from assessing whether it actually modeled your data in the way you thought it did (validation with training subset/holdout data) to whether it holds up on unknown/live data in the future. Stay tuned to ‘Part 2: Test’ to learn more, where we’ll explain all of these aspects of your model and how to assess it!