From Data Management Systems to Data Science Environments: State of the Art

DBMS architectures have evolved to the notion of service-based infrastructure where services are adapted and coordinated for implementing ad hoc data management functions (storage, fragmentation, replication, analysis, decision making, data mining). These functions are adapted and tuned for managing huge distributed multiform multimedia data collections. Applications can extend the functionality of DBMS through specific tasks that have to be provided by the data management systems, these tasks are called services, and allow interoperability between DBMS and other applications [11].

A service-based DBMS externalizes the functions of the different systems’ layers and enables the programming of personalized data management as service systems. To enable broad adoption and integration of data operations, one simply needs to create a web-hosted HTTP endpoint or “service”. Putting compute behind a service allows different system components to scale independently to minimize bottlenecks. If services reside on the same machine, one can use local networking capabilities to bypass internet data transfer costs and come closer to the latency of normal function dispatch. This pattern is referred to as a “micro-service” architecture, and powers many of today’s large-scale applications [35]. Thus, service-based DBMS make it possible to couple the data model characteristics with well adapted management functions that can themselves be programmed in an ad hoc manner.

A service-based DBMS remains a general-purpose system that can be personalized, thanks to service composition, to provide ad hoc data management. It is then possible to have services deployed in architectures that make them available to applications in a simple way (e.g., cluster, cloud). Requirements concerning data management performance vs. volume, and the effort of constructing data collections themselves has determined the evolution of DBMS toward efficiency. The three-level architecture that encouraged program-data independence based on series of transformations among layers seems inappropriate to fulfil performance requirements. The architectures are making levels gaps thin. The principle being that the less transformations among data are required the more efficient are data management functions, particularly querying, accessing, and processing. It seems that the very principle of independence between programs and data management is a very expensive quality that is not worth paying in certain situations.

Many companies such as Microsoft, Amazon, IBM, and Google have embraced model deployment with web services to provide pre-built intelligent algorithms for a wide range of applications [17, 27, 15]. This standardization enables easy use of cloud intelligence and abstracts away implementation details, environment setup, and compute requirements. Furthermore, intelligent services allow application developers to quickly use existing state of the art models to prototype ideas. Tools for deploying these technologies as distributed real-time web services are emerging[1].

Parallel data processing environments

Google’s technical response to the challenges of Web-scale data management and analysis was the Google File System (GFS) [3]. To handle the challenge of processing the data in such large files, Google pioneered its Map-Reduce programming model and platform [12]. This model enabled Google’s developers to process large collections of data by writing two user-defined functions, map and reduce, that the Map-Reduce framework applies to the instances (map) and sorted groups of instances that share a common key (reduce) similar to the sort of partitioned parallelism utilized in shared-nothing parallel query processing [3].

Yahoo!, Facebook, and other large Web companies followed. Taking Google’s GFS and Map-Reduce papers as rough technical specifications, open-source equivalents were developed, and the Apache Hadoop Map-Reduce platform, and its underlying file system HDFS emerged[2]. Microsoft technologies include a parallel runtime system called Dryad [16], and two higher-level programming models, DryadLINQ [39] and the SQL-like SCOPE language [5]. The Hadoop community developed a set of higher-level declarative languages for writing queries and data analysis pipelines that are compiled into Map-Reduce jobs, and then executed on the Hadoop MapReduce platform. Popular languages include Pig from Yahoo! [28], Jaql from IBM,[3] and Hive from Facebook[4]. Pig is relational-algebra-like in nature and is reportedly used for over 60% of Yahoo!’s Map-Reduce use cases; Hive is SQL- inspired and reported to be used for over 90% of the Facebook Map-Reduce use cases [3].

Once the map-reduce model was consolidated some works agreed on the need to study the Map-Reduce model for identifying its limitations and pertinence for implementing data processing algorithms like relational operators (i.e., join). Other platforms oriented to dataflows like Spark [40] propose alternatives to data processing requiring computing resources and also Storm and Flink for dealing with streams (i.e., Big data velocity). Spark is capable of a broad range of workloads and applications such as fault-tolerant and distributed map, reduce, filter, and aggregation style programs. Spark improves on its predecessors MapReduce and Hadoop by reducing disk IO within memory computing, and whole program optimization [9, 34]. Spark clusters can adaptively resize to compute a workload efficiently (elasticity) and can run on resource managers such as Yarn, Mesos, Kubernetes [30,32], or manually created clusters. In recent years, Spark has expanded its scope to support SQL, streaming, machine learning, and graph style computations [2,24,37,38].

Big Data Analytics Stacks

New data analytics stacks have emerged as environments that provide the necessary underlying infrastructure for giving access to data, implementing data processing workflows to transform them and execute data analytics operations (statistics, data mining, knowledge discovery, computational science processes) on top of them.

One of the most prominent ones are Berkeley Data Analytics Stack (BDAS) from the AMPLAb project in Berkeley. BDAS is a multi-layered architecture that provides tools for virtualizing resources, addressing storage, data processing and querying as underlying tools for Big Data aware applications. Another important Big Data stack system is AsterixDB[5] from the Asterix project. AsterixDB is a scalable, open source Big Data Management System (BDMS).

Data lake environments also deal with Big Data management and analytics through integrated environments designed as toolkits. A data lake is a shared data environment consisting of multiple repositories. It provides data to an organization for a variety of analytics processing including discovery and exploration of data, simple ad hoc analytics, complex analysis for business decisions, reporting, real-time analytics. Industrial solutions are in the market today, such as Microsoft Azure Data Lake, IBM, and Teradata.

Machine Learning Environments

Data science pipelines using different machine learning methods are far from automation. Since there are no algorithms that can achieve good performance on all possible learning problems with equal importance. Every aspect of data science pipelines’ tasks, such as feature engineering, model selection, and algorithm selection, needs to be carefully configured. This is usually involved heavily with human experts. Taking human out of these data science pipelines can enable fast deployment of solutions across organizations, quick validate and benchmark the performance of deployed solutions. This can make human focus more on problems depending on applications and business. Thereby, data science pipelines including machine learning tasks can be made much more available for real-world usages, leading to new levels of competence and customization, of which the impact can be indeed dramatic.

In recent years, the automated machine learning (AutoML) itself has emerged as a new sub-area in machine learning. Specifically, AutoML attempts to reduce human assistance in the design, selection and implementation of various machine learning tools used in applications’ pipeline. It has got increasingly more attention not only in machine learning but also in computer vision, data mining and natural language processing. Besides, AutoML has already been successfully applied in many important problems like automatic model selection, like Auto-sklearn [10,21], neural architecture search like Google’s Cloud [42,22], automatic feature engineering like FeatureLab [19,18], Data Science Machine [18], ExploreKit [19] and FeatureHub [36].

Many machine learning workflows rely on deploying learned models as web endpoints for use in front-end applications. However, these frameworks all compromise on the breadth of models they export, or the latency of their deployed services. Microsoft machine learning environment builds upon the SparkML API, which is similar to the popular Python machine learning library, Scikit-Learn [4]. Like scikit-learn, all SparkML models have the same API, which makes it easy to create, substitute, and compose machine learning algorithms into “pipelines”. In addition, Spark clusters can use a wide variety of hardware SKU[6]s making it possible to leverage modern advances in GPU accelerated frameworks like Tensorflow, CNTK, and PyTorch [1,33,29]. These properties make the SparkML API a natural and principled choice to unify the APIs of other machine learning frameworks.

Three main types of data science pipelines supported by machine learning processes can be identified with specific underlying execution requirements. First, pipelines relying on deep learning create symbolic computation graphs that automatically differentiate and compile to machine code. Tools like Cognitive Toolkit (CNTK) [33,13,14], Tensorflow, PyTorch, and MxNet liberate developers and data scientists from the difficult task of deriving training algorithms and writing GPU accelerated code. With CNTK on Spark, users can embed any deep network into parallel maps, SQL queries, and streaming pipelines. Environments also contribute and host a large cloud repository of trained models and tools to perform image classification with transfer learning [13,14]. Databrick’s “Deep Learning Pipelines” provide integration of Spark and Tensorflow [8]. Microsoft’s Machine Learning environment shares the same API making it easy to use CNTK and/or Tensorflow inside of SparkML pipelines without code changes.

Second type of pipelines are based on gradient boosting and decision trees. To efficiently learn tree/forest-based models, many turn to GPU enabled gradient boosting libraries such as XGBoost or LightGBM [6,20]. Microsot Machine Learning environment integrates of LightGBM into Spark to enable large scale optimized gradient boosting within SparkML pipelines. LightGBM is one of the most performant decision tree frameworks and can use socket or Message Passing Interface (MPI) communication schemes that communicate much more efficiently than SparkML’s Gradient Boosted Tree implementation. This integration allows users to create performant models for classification, quantile regression, and other applications that excel in discrete feature domains.

Finally, the third type of pipeline regards model interpretability. Integrating frameworks into Spark through transfers of control, machine learning environments have also expanded SparkML’s native library of algorithms [41]. One example is LIME, a distributed implementation of Local Interpretable Model Agnostic Explanations [31]. LIME provides a way to “interpret” the predictions of any model without reference to that model’s functional form. Azure Search is a cloud database that supports rapid information retrieval and query execution on heterogeneous, unstructured data [26]. Azure Search leverages elastic search to index documents and provide REST APIs for document search on linguistic similarity and a variety of other filters and logical constraints.

Discussion

Data science pipelines combining different machine learning and deep learning are the new query types that have specific needs regarding the way data must be structured and managed. The “one all-fits all” data structure and associated management functions approach is no longer adapted for data science queries. Indeed, every query has a specific objective (modelling, prediction) and its design is fully dependent on the input dataset. The data science query is not based on a clear knowledge of the data, it includes tasks devoted to mathematically understand the data; then the partial results of those tasks, determine the design of other tasks devoted to the computation of a model that represents some hidden knowledge. Given statistical and machine learning methods and a target objective, data scientists rely on libraries that provide methods that they combine to define a data science pipeline. The results obtained by this pipeline are never definite they are always in some degree close to the target.

Data Science and Machine Learning Environments provide all the necessary methods and they are supported by enactment stacks that deal with the storage, fragmentation, indexing and distribution of the data required and produced by the tasks composing a pipeline. Yet, the data scientists have to make decisions to combine these high- and low-level tools to “compose” their pipeline and ensure that it will run at scale when used for processing datasets of different sizes. Deciding which are the best ML methods to use is complex. Therefore, automatic ML tools [7,41] have been proposed to support this challenge. Data science pipelines are programmed often as ad hoc solutions, this hinders the possibility of re-using some tasks or at least the strategies implemented for addressing target problems using specific methods. Besides, decision making tools aiding to decide which are the most adapted data management strategies for every step of a data science pipeline are still to come.

[1] Code and documentation for MMLSpark can be found through https://aka.ms/spark

[2] http://hadoop.apache.org

[3] http://code.google.com/p/jaql/

[4] http://hive.apache.org

[5] https://asterixdb.apache.org

[6] S(tock-)K(eeping) U(nit), a code that consists of letters, numbers, symbols or any combination thereof that uniquely identifies a product or service, https://www.techopedia.com/definition/1606/stock-keeping-unit-sku.

Bibliography

[ 1 ] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.

[ 2] Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., and Zaharia, M. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Manage- ment of Data, SIGMOD ’15, pp. 1383–1394, New York, NY, USA, 2015. ACM. ISBN 978–1–4503–2758–9. doi: 10.1145/2723372.2742797. URL http://doi.acm. org/10.1145/2723372.2742797.

[ 3] Borkar V, Carey MJ, Li C (2012) Inside big data management: ogres, onions, or parfaits? In: Proceedings of the 15th international conference on extending database technology. ACM, pp 3–14

[ 4] Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. API design for ma- chine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122, 2013.

[ 5] Chaiken R, Jenkins B, Larson PA ̊, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. Proc VLDB Endow 1(2):1265–1276

[ 6 ] Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd inter- national conference on knowledge discovery and data mining, pp. 785–794. ACM, 2016.

[ 7] Combust, I. MLeap. http://mleap-docs.combust. ml/.

[ 8 ] Deep learning pipelines for apache https://github.com/databricks/spark-deep-learning. Accessed: 2019–01–20.

[ 9] Dean, J. and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[ 10 ] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Advances in Neural Information Processing Systems, 2015, pp. 2962–2970.

[ 11 ] Geppert A, Scherrer S, Dittrich KR (1997) Construction of database management systems based on reuse. University of Zurich, KIDS

[ 12 ] Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: ACM SIGOPS operating systems review, vol 37. ACM, pp 29–43

[ 13 ] Hamilton, M., Raghunathan, S., Annavajhala, A., Kirsanov, D., Leon, E., Barzilay, E., Matiach, I., Davison, J., Busch, M., Oprescu, M., Sur, R., Astala, R., Wen, T., and Park, C. Flexible and scalable deep learning with MMLSpark. In Hardgrove, C., Dorard, L., and Thomp- son, K. (eds.), Proceedings of The 4th International Conference on Predictive Applications and APIs, volume 82 of Proceedings of Machine Learning Research, pp. 11–22, Microsoft NERD, Boston, USA, 24–25 Oct 2018. PMLR. URL http://proceedings.mlr. press/v82/hamilton18a.html.

[ 14 ] Hamilton, Mark, et al. “MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales.” Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019.

[ 15 ] High, R. The era of cognitive systems: An inside look at IBM Watson and how it works. IBM Corporation, Redbooks, 2012.

[ 16 ] Isard M, Budiu M, Yuan Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev 41(3):59–72

[ 17 ] Jackson, K. R., Ramakrishnan, L., Muriki, K., Canon, S., Cholia, S., Shalf, J., Wasserman, H. J., and Wright, N. J. Performance analysis of high performance computing ap- plications on the amazon web services cloud. In 2nd IEEE international conference on cloud computing technology and science, pp. 159–168. IEEE, 2010.

[ 18 ] J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: To- wards automating data science endeavors,” in IEEE International Conference on Data Science and Advanced Analytics, 2015, pp. 1–10.

[ 19] G. Katz, E. C. R. Shin, and D. Song, “Explorekit: Automatic feature generation and selection,” in International Conference on Data Mining, 2016, pp. 979–984.

[ 20] Ke, G., Meng, Q., Finely, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, December 2017.

[ 21 ] L. Kotthoff, C. Thornton, H. Hoos, F. Hutter, and K. Leyton- Brown, “Auto-WEKA 2.0: Automatic model selection and hyper- parameter optimization in WEKA,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 826–830, 2017.

[ 22] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in European Conference on Computer Vision, 2018.

[ 24] Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241, 2016.

[ 26] Microsoft. Azure search. https://azure. microsoft.com/en-us/services/search/, b. Accessed: 2019–01–20.

[ 27] Microsoft. Cognitive services. https:// azure.microsoft.com/en-us/services/ cognitive-services/, c. Accessed: 2019–01–20.

[ 28] Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international confer- ence on Management of data. ACM, pp 1099–1110

[ 29] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. 2017.

[ 30] Rensin, D. K. Kubernetes — Scheduling the Future at Cloud Scale. 1005 Gravenstein Highway North Sebastopol, CA 95472, 2015. http://www.oreilly.com/ webops-perf/free/kubernetes.csp.

[ 31] Ribeiro, M. T., Singh, S., and Guestrin, C. ”Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1135–1144, New York, NY, USA, 2016. ACM. ISBN 978–1–4503–4232–2. doi: 10.1145/2939672. 2939778. URL http://doi.acm.org/10.1145/ 2939672.2939778.

[ 32] Sayfan, G. Mastering Kubernetes. Packt Publishing Ltd, 2017.

[ 33] Seide, F. and Agarwal, A. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pp. 2135–2135. ACM, 2016.

[ 34] Shvachko, K., Kuang, H., Radia, S., and Chansler, R. The Hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, pp. 1–10. Ieee, 2010.

[35] Sill, A. The design and architecture of microservices. IEEE Cloud Computing, 3(5):76–80, 2016.

[ 36] M. J. Smith, R. Wedge, and K. Veeramachaneni, “FeatureHub: To- wards collaborative data science,” in IEEE International Conference on Data Science and Advanced Analytics, 2017, pp. 590–600.

[ 37] Xin, R. Project hydrogen: Unifying state-of-the-art ai and big data in apache spark. https://databricks. com/session/databricks-keynote-2. Ac- cessed: 2019–01–20.

[ 38] Xin, R. S., Gonzalez, J. E., Franklin, M. J., and Stoica, I. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems, pp. 2. ACM, 2013.

[ 39] Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson U ́ , Gunda PK, Currey J (2008) DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol 8, pp 1–14

[ 40] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, October 2016. ISSN 0001–0782. doi: 10.1145/2934664. URL http://doi.acm.org/10.1145/2934664.

[ 41 ] Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., et al. Accelerating the machine learning lifecycle with mlflow. Data Engineering, pp. 39, 2018.

[ 42 ] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in International Conference on Learning Representations, 2017.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store