The Convergence of Data Science and Engineering
This is an opinion piece. I have been working almost exclusively with data for the last 10 years. I have had the privilege to work with some of the most nimble, innovative, and hard-working engineers as we built custom applications and insights for our data-savvy customers to consume.
My goal with this article is to enunciate what has happened in the world of data science perhaps within the last 3–5 years when viewed from my lens.
This is what a BMW 7 Series sedan looked like under the hood 25 years ago versus now.
Time an average driver had to spend with the hood open 25 years ago — maybe an hour every month? Time spent today — maybe an hour every 3–4 years. Modern internal combustion engines have become that good over the decades. Precision in the design and manufacturing of all components big and small in a car is exceptionally high. One can get significantly higher lifetime mileage from a modern average sedan than an old expensive saloon. Furthermore, engineering services and subject matter expertise for ensuring the upkeep of the car are available at commoditized prices.
It is the same with modern Data Science — Artificial Intelligence — Machine Learning.
The Definition
Allow me to digress for a bit. Since there is no authority like ISO to define these terms, here are my definitions. I use Data Science as a catch-all term. If it were a list, it would contain these items, [‘descriptive analytics’,’business intelligence’,’predictive analytics’,’artificial intelligence’, ‘machine learning’,’prescriptive analytics’,’natural language processing’,’vision computing’,’statistics’]. There are more items on that list. Artificial intelligence is an experience for the end-user of the product or service. It looks good on presentations and sounds fancy in conversations. It could be a rudimentary rule-engine behind the API, or it could be a genetic algorithm that’s finding the most optimum combination of parameters to run a factory that saves millions of dollars in operational expenditure. As long as that service’s users get an impression of magic, it is AI. The same thing is called Machine Learning from the implementation side when hiring people to build such systems. Now, back to the topic.
The Organization
The size of the organization determines the roles they offer to build such systems. A big organization may have specialized roles to offer. What matters next is where an organization belongs in the digitization spectrum. The more digital the industry/organization, the more specialized roles would be on offer.
I am not going to stir up the hornet’s nest that are vacancies enlisted by recruiters from 1-year-old/recently infused startups that want to hire a Ph.D. with over 10 years of experience in — SQL, Scala, Spark, Python, R, time series analysis, natural language processing, convolutional neural nets, so on and so forth.
A non-tech firm with a few 100 employees would have the designation called Data Scientist. The organization would want her to start by writing their Business Intelligence pipeline using an ETL tool as they move towards right on the scale of data maturity. Next, she would do descriptive reporting and dashboarding using spreadsheets, Qlik, or Tableau. As their data demands increase, she now has to explore and employ an analytical data warehouse like Redshift that allows horizontal scaling. She would model a simplistic data mart, and connect the reports to it using SQL. Now they want her to forecast the revenue of their product lines for the next 2 quarters. She sets up a notebook platform like JupyterHub to build the ARIMA model using Python. Then she is expected to package it into a REST API using Flask which is deployed using Kubernetes service. The same individual is expected to cover the entire spectrum of data requirements. Based on anecdotal evidence (I like oxymorons), this is the role offered 90% of the time regardless of the quoted designation.
A non-tech firm with 1000s of employees that is data mature would have these folks on their payroll — product analyst, business analyst, data engineer, machine learning engineer, applied data scientist, decision scientist, research analyst, etc. These specific roles are 5% of the listings.
The remaining 5% vacancies are from those 1-year-old startups. To be fair to 1-year-old startups, I have seen such job descriptions from seemingly mature organizations as well. I attribute them to a complete disconnect between talent acquisition and the hiring manager.
The Commodity
As opposed to cars, the digital world moved way faster. Expected. Instead of taking 5–7 decades, the analogous transition happened within the last 5–7 years.
The lead time from the publishing of a research paper to the algorithm’s availability as a commodity function on cloud vendors for public consumption is short. Complex mathematical algorithms are now encapsulated in libraries like ScikitLearn and Scipy.
As an implementer, one needs to know the applicability of algorithms to the problem statement. Once that is known, she gives the algorithm a clean input in the format that it expects. She needs to know what happens within, but only at a zoomed-out level. Finally, she has to keep a tab on the output. The goal for her here is to ensure that she gets an RoI above the baseline. If something doesn’t look right, she should be able to point to the faulty part. All modern libraries ensure that these steps become simple. They give clear visibility into what goes in and comes out while keeping the middle adequately obfuscated.
Here are a few examples of commoditization of math as a service -
Statistical Analysis and Modeling
One doesn’t have to be a statistician to build an ARMA/time series model that forecasts revenue. As an engineer, one can study that vertical slice of statistics which will help them solve the current problem. A few things that one has to study ahead are if their data represents time-series that is stationary. That will help one isolate Trend. Then one has to study and take care of Cyclicity and Seasonality as well since they may be components of their series.
Albeit simplified, this is what the code will look like — with carefully made parameter choices.
from statsmodels.tsa.arima_model import ARIMAmodel = ARIMA(df_log, order=(2,1,2))results = model.fit(disp=-1)
One could further simplify their job by using a library called Prophet which is from Facebook for time-series forecasting.
from fbprophet import Prophetprophet = Prophet()prophet.fit(df)future = prophet.make_future_dataframe(periods=12 * 6, freq=’M’)forecast = prophet.predict(future)
It works well. Well enough to get results that are ‘reasonably close’ to a very hand-crafted model.
Other Applications
The question is how closely tied is your firm’s core competency to the specific model? If it is not, better to go with a less hands-on approach. One will be able to show positive results faster making it more likely to be signed off for production application use-case by the higher-ups.
If you check the table above, even the slower approaches are a far cry from having to write the algorithm by oneself on Java or C. It is that they give more knobs to turn between each set of output and input as the data is processed from left to right.
The Transition
Cloud vendors like AWS, GCP, and Azure have accelerated this transition of Data Science to Engineering. Machine learning features are available out-of-the-box on analytical databases like BigQuery. They are incredibly useful. SQL is the standard language. Data already resides in such databases. One does not have to pull it out to wrangle it in Python. Features comparable to what’s available on Numpy, Pandas, and even ScikitLearn are available for use as SQL functions. User-Defined-Functions take it to the next level. Then there are managed machine learning services, which I have not tried enough, but I am sure further helping with the accessibility of both math and engineering.
Meet the business objective. Otherwise, it is either a research project or a hobby. Most firms want to apply things as quickly as possible to make their customers’ lives simpler. Reduce costs at their end. Keep their stakeholders happy by showing the strength of their existing products. It is important to let them know what is working and what is not at the earliest. I read that 70% of machine learning projects end up abandoned. The shift from ‘science’ to engineering is helping this number go down.
An engineer or analyst using commodity tools can help reduce a firm’s marketing expenses by 60% by doing the right targeting with the help of the right models. Often, there are diminishing marginal returns from that point onwards. To shave off every extra 1% of that marketing expenditure, the amount of money spent on specialized tools and individuals increases.
80% Of the business use-cases ranging from simple to complex in terms of the technical challenge imposed can be covered by engineering. The remaining 20% may require mathematical research. But then unless you are a large enterprise flush with funds, you may want to consider cheaper solutions to achieve those business objectives for all stakeholders. A smaller firm could choose to hire a Ph.D./specialist for a few critical assignments, but then do they expect them to do descriptive analytics and pipeline design as well?
The convergence has already happened. Modern Data Science is now Engineering for more than 95% of us.
The Individual
An engineer with her background in data is rightly placed to take advantage of this newfound accessibility. These libraries have solid documentation. There are checklists published by seasoned professionals calling out what to do and what not to do in commonly encountered situations. Research papers published and reviewed by experts on a vast variety of topics are available. They act as beacons. Everything else in between can be found on — Stack Overflow, Medium, YouTube, Stats.StackExchange, etc.
Anybody with curiosity and drive could thoughtfully use such libraries and packages to make them bend to their use cases. Engineers are used to leveraging publicly available packages to stack them as lego blocks to build what solves the business problem. Given that good engineers vastly outnumber candidates from math, statistics, and actuarial background regardless of the latter’s quality, hiring managers like myself can fill Data Scientist positions.
A modern Data Scientist can build a model atop the transactional data using the following SQL and immediately start running predictions -
CREATE OR REPLACE MODEL mTRANSFORM(ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,label_col)OPTIONS(model_type=’linear_reg’, input_label_cols=[‘label_col’])AS SELECT * FROM t
Reference — BigQuery REGRESSION — CREATE MODEL
My experience so far says that behavior matters more than the degree. The ability to change one’s mind as the new information unveils is important. Curiosity helps drive one to solutions. Strong communication skills are a must. They help with the ability to search the web. The ability to confront people quoting data points moves the conversations in the right direction. A good Data Science candidate is often well-read. That convergence of technical know-how, coding skills, and domain knowledge helps with the applicability. The software has kept up and made Data Scientists incredibly productive. These aforementioned traits make it easy for an individual to use such software without breaking a sweat.