Hadoop in Action

A few real-life use cases to get you in the mood for distributed storage and computing.

Update: Gluent and Cloudera are presenting this course across the US. Next stop, St. Louis, Missouri!

Here at Gluent we just released and executed our first public training course, Hadoop for Database Professionals. This one-day remote training offers a view of Apache Hadoop from the perspective of the typical relational database administrator, developer, or architect. We provide the basic introductions to new concepts and then jump into SQL processing on Hadoop, comparing and contrasting with the RDBMS along the way. The first and second run of the course both went well, with great questions and feedback from the attendees. We even had several ask if there could be a 2nd day to continue learning more! In the future, we just might add a second day. ;)

“For everyone who is trying extend its RDBMS knowledge to the modern Hadoop world this webinar is really a godsend. Thanks a lot.” — Hadoop for Database Professionals attendee

Towards the end of the training course, we like to share several Hadoop use cases from various industries. There are loads of different reasons companies rely on Hadoop for their data storage, processing, and analytics back-end, and many of them are quite compelling.

Enterprise Data Hub

Tubular Labs

When your organization is responsible for providing analytics to your clients, about their customers online video viewership, with real-time performance, you should probably have a high-powered computation engine driving the solution. That’s why Tubular Labs decided to build an enterprise data hub (EDH) using Cloudera’s Hadoop distribution. Tubular Labs “analyze over 2.5 billion videos and the viewing habits of more than 400 million consumers across 30+ platforms to produce actionable insights that result in winning online video strategies.” tubularlabs.com/team. Millions of data points are ingested daily from dozens of data sources, leading to a complex web of data to be mined for insights.

Using Hadoop as the enterprise data hub, and Apache Impala for real-time SQL-on-Hadoop queries, Tubular now has the analytic flexibility necessary to perform at scale. With so many different data points being ingested, and the need to join them in complex, ad-hoc queries, Tubular turned to Impala hosted on the Amazon EC2 cloud. Impala is a SQL-on-Hadoop engine built for fast, interactive analytics. It uses a similar SQL syntax to that of Apache Hive and also shares the Hive Metastore for tracking and storing metadata. Impala, however, is more like a massively parallel processing (MPP) analytic database on top of Hadoop. It’s built for handling low latency, interactive queries and performs well under load from multiple users.

After slight “massaging” of the data via Hive scripts, Tubular’s analytics tool queries the data using Impala for real-time analytics. As the case study states, “Tubular customers are growing their YouTube subscriber base 56% faster than non-Tubular users, and are growing their views 52% faster”. That’s quite an impact being made by Hadoop, Impala, and the Cloudera implementation.

Store, and Access, Everything

Arizona State University

Data is complicated. Nobody knows that better than Dr. Kenneth Buetow, director of Computational Sciences and Informatics for the Complex Adaptive Systems Initiative (CASI) at Arizona State University. “CASI’s research mission is to develop and promote a new type of science that embraces the complexity of natural systems.” hortonworks.com/customers/arizona-state-university. Included in this program is Dr. Buetow’s research on the genetic basis of cancer.

The complexity of the subject, cancer genomics, and the data that was generated by the research led to a challenge in data storage and processing at ASU. Essentially, each researcher would spin up an individual system for a specific funded project. If data was necessary from a different project, it would take days, even weeks, to perform the integration between the proprietary research systems. What ended up happening was referred to as “lamp-posting”, meaning cancer researchers could only “look where the light was”, while the potential answer could remain “in the dark”.

Dr. Buetow and the team at CASI decided to implement Hadoop using the Hortonworks Data Platform distribution. Once up and running, the gains were immediate. Now, the department could store and process huge amounts of data. Not only was the data shared across research projects within CASI, but also could be shared outside of the department and university. All of this made possible with the ability to scale as the genomics dataset continues to grow by petabytes. Sharing and collaborating CASI’s cancer research “data lake” was now possible with anyone in the world.

Powerful quotes from Dr. Buetow. Hadoop not only provided the storage capability for the research data, but also allowed access to information that was previously unimaginable.

The Hybrid World

Have you ever wondered how a parolee wearing an ankle bracelet is tracked? Well, Securus Technologies provides the answer. Their leading edge civil and criminal justice technology solutions, including the Satellite Tracking of People (STOP) solution, help to improve public safety while modernizing the incarceration experience. STOP is an application that captures the geolocation information from a GPS monitoring device (ankle bracelet) attached to an offender. This data is loaded into an Oracle database and presented to supervising agents and parole officers via a cloud based application. Questions such as “did the offender leave the designated area?” or “did the offender violate a restriction placed upon him/her?” can be answered via the application.

Securus Technologies Satellite Tracking of People (STOP)

Securus built their STOP solution well before the advent of Big Data technologies. This led to an increasing number of restrictions placed on the capabilities of the Oracle database. The database was continuing to grow, already at 150 TB with several years of history. Not only that, but new analytics and capabilities were necessary, but could not be implemented. The architecture only allowed a max query of 2 days of history, due to the application itself caching those 2 days in memory.

With a data storage and access bottleneck, Securus turned to Gluent for a hybrid solution. The Gluent Data Platform has the ability to offload data from a relational database to Hadoop and also present that data back to the RDBMS, as if it had never been moved in the first place. The Securus team decided a proof of concept of the Gluent Data Platform was necessary. With the entire dataset offloaded 100% to Hadoop, Securus began running their STOP application directly against the hybrid environment.

Query response time (in seconds), between datasets situated 100% in Oracle vs 100% in Hadoop. Note: 365 Day query in Oracle did not complete within designated time window.

The results were astounding. Not only did the STOP application continue to function as it had previously, but it also was able to return 6 days, and even 30 days, of history with nearly the same response time as the original 2 day query. Even a query of data from 2–3 years in the past returned within several minutes. The difference from the full Oracle solution was that Hadoop was performing the majority of the computation and heavy lifting when returning these historical datasets.

With the Gluent Data Platform and a hybrid Oracle / Hadoop environment, Securus is now able to rethink what was possible with their analytic applications and has a serious competitive advantage in the marketplace.

There is a lot of buzz within the data and analytics world right now, claiming that the “buzz” around Hadoop has died and the technology is on a downward slope. There may be several reasons for this thinking; failed big data projects, large Hadoop clusters with very small amounts of data, etc. But, I do know that if Hadoop is implemented in the correct way, and is implemented to provide an appropriate solution to a technical challenge, then Hadoop, and its ever-changing and rapidly growing ecosystem, will be around for quite some time.

Join Gluent and Cloudera for the next Hadoop for Database Professionals complimentary training course, September 7, 2017 @ 10am-2pm CST in St. Louis, to learn more about the technology and why we believe the new world is here to stay.