Real Time Analysis with Apache Hadoop

Jim Scott
The Ramp
Published in
4 min readNov 25, 2015

Apache Hadoop is revolutionizing big data in more than one way. While the Hadoop platform introduced reliable distributed storage and processing, various packages such as Spark on top of Hadoop make it possible to build applications and analyze data much faster. Here are some cool ways the Hadoop stack is being used right now.

Fraud Detection

If you’ve been to your local grocery store lately, you might have noticed a new PIN pad has been installed. Why would they do that when the old credit card readers worked just fine? The main reason is to guard against credit card fraud.

Credit card companies are switching from the old magnetic strip and signature method over to EMV (Europay, MasterCard, and Visa), also known as the “Chip-and-PIN” system. It’s the same system that’s been in use for several years in Europe. The new cards have a little chip that generates a one-use code, which you use with a PIN, just like with your ATM card. Some of the credit card companies are using Chip-and-Signature, but the premise is the same.

Recently, credit card companies have put pressure on merchants by shifting liability for fraudulent purchases over to them as of autumn 2015 if they refuse to upgrade.

Still, the system isn’t perfect, with proof-of-concept hacks and even card readers that have been tampered with. In any case, it’s in everybody’s interest to detect fraud while it happens. Credit card companies are doing just that, learning their customers’ patterns and catching anything that looks out of character, like a purchase in a foreign country when you just filled up at a gas station a few minutes ago.

Here is an example of a company that analyzes millions of transactions in real time to flag those which are potentially fraudulent, saving them and their customers lots of money.

User Profiles

If you listen to new tunes on Pandora or binge-watch your favorite shows on Netflix, you’re building a profile of things you’ve liked and stuff it thinks you’ll like in the future.

These companies want to keep you happy so you will remain subscribed and contribute to their profit margin. The best way to do that is to keep serving up content that you want. But these services have lots of users, and processing user preferences can take a long time.

MapR has created a proof-of-concept that works in the real world under real loads.

Business Analytics

Business moves fast in the 21st century. While computers have been part of modern businesses since the middle of the last century, it always seems that they’re just not fast enough. Sure, a company could store vast amounts of data, but the process to access it was just too slow — until Spark.

Quantium, an Australian data analytics company, uses Spark and Hadoop to offer fast analytics to companies such as Woolworths, National Australia Bank, and Foxtel. Quantium is using Spark to generate insights near real time. The database queries have a latency of under 50 milliseconds, to support live interactive use such as in call centers.

The company’s whole approach is to favor interactive use rather than batch processing. The ability to move fast allows businesses to create new products and respond to market changes that much faster.

Drug Discovery

The Novartis Institute for Biomedical Research uses Spark for Next Generation Sequencing.

Genomic Sequencing is the new frontier in modern medicine. Genetic information allows researchers to create new drugs that are tailored to one person. A genetic sequence may be very small physically, but it contains vast amounts of information.

Novartis uses Spark on Hadoop to plow through public datasets and discover new drugs. Spark on Hadoop offers much more flexibility than other solutions. Novartis can combine the batch processing with machine learning for even more powerful analysis.

Global Warming

There’s no doubt that global warming is changing the planet, and that humans are contributing to it. Data centers consume a lot of energy, which in turn produces lots of greenhouse gases. Although companies like Google can try to build efficient data centers, it’s possible to use real-time Big Data processing to help address climate change.

There are a number of public datasets available from organizations like NASA, but crunching this data can take a long time. Apache Spark Streaming allows for near instantaneous data mining. Researchers can mine data from public datasets and also incorporate data from sensors in real time, producing an even more powerful and accurate model.

With real time data, environmentalists and planners can see how pollution affects the atmosphere during the day and figure out new ways to reduce the impact of people on the planet. Yes, Apache Hadoop stack could very well save the planet.

Conclusion

When you have the power of Apache Hadoop, you can tackle the complex problems in your own world.

Originally published at www.mapr.com.

--

--

Jim Scott
The Ramp

Digital Transformation and Emerging Technologies Leader | Head of Developer Relations, Data Science @NVIDIA