IBM - Data Science Experience, what is it?
By: Ramarao Kothamasu (IBM) - Opinions are my own
Data scientists are tasked with turning raw data into meaningful insight. The latest Forrester Research report states “Data scientists have been plodding through the same process for 20 years” and concluded that the current process is reaching the limits of productivity and scalability.
The new generation of data scientists are looking to open source technologies for innovation. Currently, they are spending a lot time in configuring their own environments and finding themselves having to source for critical capabilities from multiple places. In a business environment, this often creates silos and makes it impossible to share work within teams.
IBM Data Science Experience (DSX) is that one-stop-shop platform for data scientists, data engineers etc. to learn about new tools and trends, create valuable insights using the best of open source tools with IBM’s value additions, in Figure 1, and collaborate on projects within their teams and the broader data science community.
Cloud based pre-configured environment:
DSX is built on top of IBM’s Bluemix cloud technology where multiple services are pre-configured and ready to use. Via a web browser (http://datascience.ibm.com), a new user can be ready to work with such analytics environment that includes popular open source tools such as, Spark. This platform saves lot of time and effort for data scientists to put their efforts toward more valuable endeavors like actual analysis of data instead of environment readiness.
The most valuable feature of DSX is that it is a platform that unifies numerous open source components to help data scientists become better at what they do. Some of the challenges in using data science tools includes installation, deployment, maintenance etc. — with DSX it is already done!
Open Source Tools & Technologies with IBM Value additions:
As highlighted in Figure 2, DSX currently includes Apache Spark, Jupyter Notebooks and RStudio. This initial set of tools will continue to grow but it is a powerful initial foundation for data science.
Apache Spark, a fast and general compute engine for large-scale data processing, provides over 80 high-level operators that makes it easy to build parallel applications that data scientists can use interactively in languages such as; Scala, Python and R shells. In Figure 3, all the machine learning libraries from Spark, are included, as well as SparkR — a light-weight front end use of Spark that leverages the open source R language. It provides a distributed data frame implementation that supports operations like Selections, Iterations, Aggregation etc. on small or large datasets.
Jupyter notebooks is a web application that allows users to create live documents that contain live code, equations, visualizations and rich explanatory text/html. Data Scientists can use the notebook for data cleaning, transformation, numerical simulation, statistical modeling, machine learning and much more.
Data scientists can create and collaborate with other team members, in Figure 4, and have a community working together with their preferred programming language of: Scala, Python and/or R. This support for multiple languages, in Figure 5, helps data scientists to continue using their preferred language and libraries.
RStudio - a popular open source integrated development environment (IDE) for statistical analysis is part of DSX which allows data scientists, in Figure 6, to develop R scripts with greater productivity.
Ultimately, the ability to use each of these open source languages and a variety of techniques with IBM innovation makes DSX a unique and excellent environment for Data scientists.
DSX is built for scalability that can range from individuals who are learning about this platform to small or large enterprise companies. The cloud technology on which DSX is built upon allows the scalability of the Spark environment. Enterprise features like creating projects to organize and collaborate on multiple notebooks, ability to assign different types of permissions, integration with a wide variety of data sources and collaborative features make it easy for such enterprises to build their data science practice.
Work with a wide variety of data sources:
DSX supports the following wide variety of data sources that can either be pulled from or connected to. Along with the following data sources, streaming data from Kafka topics is also supported.
· Amazon Redshift
· Apache Hive
· Cloudera Impala
· IBM DB2®
· IBM Informix®
· IBM Netezza®
· IBM dashDBTM
· IBM WatsonTM Analytics
· Microsoft Azure
· Microsoft SQL Server
· Pivotal Greenplum
· Sybase IQ
Projects create a space for you to collect and share notebooks, connect to data sources, and add data sets all in one place. When you click the Projects tab, in Figure 7, it takes you to your personal project area. The name and description that you give your project is available in the list of projects that you share with collaborators. Each project must have an associated Spark instance and storage type before you can add a notebook.
Projects enable effective collaboration because you can freely share knowledge and resources, flexibly shift workloads, and help one another complete jobs. In Figure 8, users can share their projects and notebooks by adding collaborators and enabling different permissions. Users can also enable cell masking, in Figure 9, to protect sensitive data like user credentials.
Following are the supported permissions/roles:
· Viewer: View the project.
· Editor: Control project assets.
· Admin: Control project assets and collaborators.
Built in Learning
Use the built-in learning to get started or go the distance. Join a vibrant community, in Figure 10, of data scientists across industries, functions, and organization types. Take advantage of shared data sets, notebooks, and tutorials. Share your work with your team and your peers. Start a course, start from a sample, or start from scratch.
DSX helps data scientists collaborate with peers on projects to find better solutions together. They can share their knowledge and their code and help accelerate the advancement of data science for others - or get input from peers on their own work.
Data scientists can fork their notebooks and share them with the entire community, in Figure 11, to demonstrate successful approaches to others or get feedback on their work. In addition, DSX includes shared data sets, multiple tutorials and how-to articles to ensure that new data scientists have what they need to get started and experienced data scientists can try out new analytics platforms such as DSX.
IBM is working with a community of Data Scientists to shape the future of DSX and intends to evolve this platform based on the community’s feedback.
DSX is one of the first integrated development environments that is optimized and built around Spark and other open source analytical tools to provide ideal collaboration capabilities. This platform enables Data Scientists to build real-time and high performance analytics, or machine learning apps with Spark using their choice of skill set. This type of open and collaborative platform is new in the data science world and it is absolutely positioned to increase the adoption of data science and machine learning in every organization.
Note: At present IBM Data Science Experience is in Beta and users can register for a 30-day free account at http://datascience.ibm.com.