Innovation in the Data Science Workflow

Published in

FusionFund

5 min readDec 12, 2019

As the global Data Science Platform Market is expected to gather $115B by 2023 (29% CAGR) according to Market Future Research, the entire data science workflow is ripe for acceleration and disruption. As the workflow consists of multiple components and more data is being generated, the highly manual process is becoming harder to maintain. Data processing occurs after sourcing data access and aims to turn the source data into a clean form for use in the later modeling stage. Data processing needs are relevant across a variety of markets such as the streaming analytics market and the AI/ML market, with the former being valued at $5.34B in 2018/$29.04B in 2024 (CAGR of 32.67%)[i] and the latter being valued at $12B in 2017/$57.6B in 2021.[ii]

Today data processing is an essential component of the data science workflow. However, there are still major opportunities for improvement in this stage including reformatting and cleaning data, which are namely the most tedious yet unavoidable part of the workflow. With no handy tools, data scientists still clean the raw data containing semantic errors/missing entries/inconsistent formatting by writing scripts or manually editing them in a spreadsheet. The needs for easy processing are not met as more data is being generated every second and as data scientists shall think ahead to smoothen the transition from modeling to deployment.

How is data processing currently solved for and what are some example processing tools?

As mentioned earlier, the aim of data processing is to turn the source data into a clean form. Since data usually come with various features, the gist of data processing is to engineer the features to make data useable for modeling. One of the good practices is to organize the processing in an explicitly described computation graph, which represents a math function in graph mode. As it’s quite common for a neural network to have more than one million edges (think of it as connection lines that execute functions) in its computation graph, the architecture of the computational graph will have a big impact on how well the model can perform.[iii]

There are multiple current tools for data processing, including Makefiles, Data Version Control (DVC), and workflow management systems such as Luigi and Airflow. In summary, a cookiecutter route is using Makefiles as it implements each step in a script and outputs new data files from original input files for users to nicely compile the project. Another method, Data Version Control (DVC), comes with some extra convenience features such as easy sharing of the files. DVC reflects the process of researching and looking for a great model (and pipeline), thus a good fit for iterative ML processes. Other workflow management systems such as Luigi and Airflow, can monitor and optimize existing models, especially once a good model is discovered with DVC.[iv] A particularly versatile approach is to store all source data in an SQL database as a set of tables and implement all the feature extraction logics in SQL views. The versatility comes into play because until the data scientists query the features, they are only kept as code and can be easily tracked without the need to store them in huge data tables. Moreover, such a strategy makes deployment to production more straightforward.

What are some of the needs not met with current data processing tools?

As data grow from terabyte to petabyte and beyond, data processing now consumes more resources and time. Companies need to be aware of the two opposing trends: public cloud vendors like AWS are all about centralized data centers and transporting data to the cloud, while edge computing delivers faster data analytics results with the data residing closer to the compute resources. Moreover, as companies build out separate local/edge clusters for their data teams spread around the world, they need central management to enhance operational efficiency and streamline deployment. The infrastructures will inevitably be in multiple sites to reduce latency and fulfill country-specific data regulation requirements, but a good central management system can help companies navigate those diverse environments and ultimately better leverage the raw data with fewer efforts on processing.[v]

There are also additional steps data scientists can take during the processing stage to prepare for deployment. For instance, if data scientists can express data processing in a specially designed Domain Specific Language (DSL) rather than in free-from Python, they may then translate the DSL into Java or an intermediate format like Predictive Model Markup Language (PMML) to speed up the workflow.[vi]

What is Fusion Fund looking for?

Though there are bottlenecks to solve throughout the data science workflow, addressing the gaps in the data processing stage is key to accelerate the entire process. At Fusion Fund, we are looking to meet companies that are disrupting the space with a technological edge and strong market traction. These companies may either obtain proprietary and industry-specific data to solve pressing problems within a vertical or build out industry-agnostic products that are proven to be viable through PoC testing. As novel technology alone is not enough, leveraging a strong business model to grow market recognition will be the cornerstone for long-term success. Startups also need to build out a good sales channel if targeting to serve enterprise accounts, considering the long sales cycle and high barriers of entry.

Fusion Fund has invested in solutions that accelerate the workflow from modeling to deployment by scaling Python AI to High Performance Computing and providing real-time analytics from Cloud to Edge. We’ve also invested in a semiconductor architecture company that develops a fully integrated package solution combining both 5G connectivity and edge computing capability. By combining hardware with software, they are uniquely positioned to provide the infrastructure that can accelerate data workflow and deliver powerful results. We are excited about the technological innovations and disruptions in the data science space, especially as it is so intertwined with pushing AI/ML breakthroughs that can create major societal impacts. If you have a technologically differentiated startup addressing problems in the data science workflow, we would love to chat!

[i] https://www.globenewswire.com/news-release/2019/06/21/1872425/0/en/Insights-Into-the-Worldwide-Streaming-Analytics-Market-2019-2024-Retail-to-Hold-a-Significant-Share.html

[ii] https://www2.deloitte.com/content/dam/Deloitte/global/Images/infographics/technologymediatelecommunications/gx-deloitte-tmt-2018-intense-machine-learning-report.pdf

[iii] https://medium.com/tebs-lab/deep-neural-networks-as-computational-graphs-867fcaa56c9

[iv] https://blog.dataversioncontrol.com/how-a-data-scientist-can-improve-his-productivity-730425ba4aa0

[v] https://www.apmdigest.com/big-data-pain-points-1

[vi] https://towardsdatascience.com/the-data-science-workflow-43859db0415

Innovation in the Data Science Workflow

Written by Fusion Fund