Making Big Data Science “Easy” - Hadoop & Spark & IBM

Open Source Tech Talks
6 min readNov 18, 2016

--

By: Michael Harkins (IBM) - Opinion are my own.

Hadoop and Spark are allowing Data Scientists to analyze Big Data by applying processing frameworks to ever expanding sets of data. The task of the data scientist is manifold.

Working with Line of Business (LoB) teams, including business analysts, the data scientist(s) formulate an approach to the “business problem.” This could be taking large quantities of data and classifying for specific use (sorting by category — articles as sports, business, science, politics, etc. as for an online news feed), or analyzing historical data to recommend products or services, or predicting outcomes or “next best actions.”

http://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science

http://www.slideshare.net/JohnBRollinsPhD/foundational-methodology-for-data-science

https://bigdatauniversity.com/courses/data-science-methodology-2/

This blog will discuss how Hadoop and Spark are used in sections:

2 — Gather Data

3 — “Analyzing Data” of the Data Science Method, as outlined by IBM.

Beginning in the late 90s, grid computing started to be used as an approach to process large amounts of data in parallel. After the publication of two seminal articles, Apache Hadoop’s MapReduce and HDFS components were derived respectively from Google’s MapReduce and Google File System (GFS) papers. Hadoop, comprised of MapReduce and HDFS, the first “killer app” for grid, was created in 2005. These two frameworks, working together, provided the storage and compute layers for Big Data processing. The Hadoop ecosystem grew over the next decade to encompass additional frameworks for capturing and organizing data, performing data discovery, and analyzing/modeling data.

https://opensource.com/life/14/8/intro-apache-hadoop-big-data

Gather the Data

Data Requirements

Identifying the location of data needed to solve the “business problem” is a joint effort of data scientists, LoB personnel, business analysts, and data engineers. Once the various forms of data are identified, movement into a central data store (HDFS) is planned.

Data Collection

Data is gathered and placed in HDFS, the Hadoop Distributed File System. Data can reside in RDBMS systems, already very structured — aligned and labeled in rows and columns with specific data types, or unstructured in log files or documents. Moving this data into HDFS is done using “ingest” tools and frameworks, such as Sqoop, Flume, and HDFS copy routines.

Data Understanding

Once in HDFS, the MapReduce algorithm can be used to manipulate the data to “discover” whether the data provided can be used to answer the business problem, and also to inspect the quality of the data — is anything missing, are values consistent, is the data “clean.” Pig and Hive can be used in this step — more on this below.

Data Preparation

The data is then formatted to be of use to modeling techniques in subsequent steps.

Analyzing Data

Modeling Data

Machine learning and graph processing techniques can be applied to the data using Mahout and Giraph, respectively. Hadoop’s MapReduce framework is used to implement Mahout and Giraph. Mahout is a set of libraries for determining recommendations, classification and clustering (predicting what product to buy, sorting, and grouping). Giraph is used to process graphs, the obvious use case is social networking — I’m friends with you, your friends with Mary, maybe I should be friends with Mary. But a more interesting set of graphs can be created from a business process, or the flow in a web page, to determine the fastest path to quicker or better outcomes.

Evaluation

Determining if the results solve the business problem. The output of the modeling needs to demonstrate consistent resolution to the problem. Usually, tables and graphs are used to indicate success, typically with Hive.

Let’s take a step back and review.

Big Data Science with Hadoop

Collecting Data

Gathering data from data bases and log files can be accomplished with Sqoop and Flume.

SQOOP

Sqoop (SQL to Hadoop) is used to import data from existing RDBMS systems and export results. Sqoop used MapReduce to move data.

Tables and whole data bases (sets of tables) can be moved into Hadoop’s file system (HDFS). Not just the existing data base, but the entire history of data that may have been archived, allowing for analysis over long periods of time. Multiple data bases can be collocated into HDFS, providing a 360 degree view of all data from all RDBMS systems.

This can be across an entire Lob, or the entire Enterprise.

FLUME

Flume is used to capture system and application logs, in addition to other sources.

Logs from operating systems, network devices, security systems, applications, and other sources can be collected into HDFS. Imagine an application spanning dozens or even hundreds of servers collected into a single data store. A single LoB may consist of dozens of applications. Having all that data in a single repository can provide a unique view of the data.

Understanding Data

HIVE

Hive was developed to act as a data warehouse. Writing MapReduce jobs in Java to manipulate tabular data was a laborious effort. Business analysts, familiar with SQL, were the target audience for Hive, which uses a subset of standard SQL to perform most core SQL functions: query, join, etc. against data in HDFS.

Preparing Data

PIG

Pig uses a scripting language to parse data and reformat for use by Hive. ETL workers, familiar with scripting languages, such as perl and python, can easily pick up Pig Latin (yes..) and transform logs into tables. Pig can be used to improve data quality, and transform/enhance data.

Using Hive and Pig, data scientists can get a good understanding of the data, and prepare it for modeling.

Modeling Data

Mahout

Mahout is a library containing algorithms that can be incorporated into MapReduce jobs to perform machine learning data stored in HDFS. Data scientists can create recommendation engines, generate predictions, classify and group data.

Giraph

Giraph is a library containing algorithms that can be incorporated into MapReduce jobs to perform graph processing. These algorithms are used for page ranking, k-means clustering, and other complex problems that lend themselves to graph processing.

Hadoop Conclusion

The Hadoop ecosystem provides a set of tools and frameworks that allow data scientists to gather and store, understand, prepare, and model data to solve business problems.

The follow on parts of this blog will cover Spark, and IBM’s contributions to Hadoop and Spark, and IBMvalue add technologies. Additional blogs will go into various use cases, demonstrating all these components of the Hadoop and Spark ecosystems, and more. I hope you will follow along.

Additional resources:

Hadoop, Spark, and Data Science courses (Free):

Big Data University: https://bigdatauniversity.com/courses/

IBM Data Science: http://www.ibm.com/analytics/us/en/technology/data-science/

--

--