Geek Culture
Published in

Geek Culture

Data lakes, a better solution for real-time data analytics

Photo by Ricardo Frantz on Unsplash

The ever-increasing use of IoT, social media, and the rise of Industry 4.0 has led to discussions in organizations to value the need for real-time data storage infrastructure. The basic concept of data lake implies that we can store data in any form, without having to structure the data as in the case of the data warehouse. We could then perform different types of analytics including machine learning, big data processing, real-time analytics, etc. It helps an organization to perform new types of analytics for example perform machine learning over new sources like log files, social media, data from clickstreams.

Data lakes helps companies react sooner to boost business growth, retain customers, increase productivity, proactively monitor devices

A data lake stores relational data from business applications and non-relational data from mobile apps, IoT devices, and social media. The structure of the schema is not defined when the data is captured. Many organizations are looking at the benefits of data lakes and starting to implement them along with conventional data warehouse design for better insights.

Key elements in the data lake

Photo by Maria Ziegler on Unsplash


It allows various people in the company to access their choice of analytic tools or framework. The tools frameworks include Presto, Apache Hadoop, Apache Kafka, spark, etc. It lets you run analytics without the need to move the data to a separate analytic system

Machine learning applications

The edge of receiving data through data lakes at a faster pace lets the organizations run machine learning algorithms to predict better forecasts and suggest a prescribed action to be taken to achieve optimal results.

Data Movement

The ease of moving data from any source to your desired destination, helps data teams save time in defining data structures, schema, and necessary transformations.

Davita Inc, Stanley Black & Decker, Inc., Whitehall Resources Limited, Loren Technologies are the list of few top companies that utilize azure data lake for their analytics solutions.

In the big data, the world is often common to hear the 3 V’s — Velocity, Variety, Volume. It often seems difficult to capture all the above components, therefore the implementation of data warehouses along with data lakes will enhance big data analytics.

If you reached here and are equally excited to know more about how big data works. I will insert a case study to give a better outlook on how it aids organizations.

How BP uses big data and AI in practice

Photo by Chris LeBoutillier on Unsplash

AI would be “one of the most critical digital technologies to drive new levels of performance” in the industry, said Morag Watson, chief digital innovation officer at BP. The company is investing millions into big data technology to improve the use of its resources, safety, and reliability of oil and gas production and refining.

More than 99% of BP’s oil and gas wells have sensors installed that continuously create data to help the BP team no matter where they are located understand the realities of conditions at each site, optimize the performance of equipment, and monitor maintenance needs to prevent breakdowns which allow the company to realize tremendous cost savings.

In addition to BP’s High-Performance Computing (CHPC), a supercomputer with extraordinary data crunching capability, and nearly 2,000 kilometers of fiber optic cable that can carry 5 m data points every minute, BP has invested in big data technology for enhancements to its data streaming, storage, and processing capabilities. BP is also in the midst of an expansion that will increase its data capacity from about 1 petabyte to 6 petabytes by 2020.

BP’s data sensors are collecting enormous amounts of data about temperature, chemicals, vibration, and more from oil and gas wells, rigs, and facilities. Streaming technologies for large volumes of data such as Kafka, Apache NiFi, Apex, Amazon Kinesis, and Google Pub/Sub have the capability to take the data from BP’s sensors to the datastore ready for processing. The very large data sets these sensors to create require scalable data stores such as Parquet files on Hadoop Disk Filing System (HDFS).

Once data is collected and stored, it needs to be processed and acted upon to deliver the business advantages such as cost savings and operational efficiencies. That’s where big data tools such as Apache Spark, the most popular open-source processing engine, and Hadoop come into play.

One of the innovations BP credits with improving the reliability of its exploration and production facilities has been the creation of a “digital twin” where BP engineers can test critical engineering work through virtual reality before implementing it on real facilities.

In June 2017, BP invested $20 million in Beyond Limits, a start-up that adapts software, originally developed for robotic exploration of space from NASA and the US Department of Defense, for commercial use. Beyond Limits’ cognitive computing systems focus on how to automate human decision processes — it can even fill in missing pieces from data sets.

BP expects Beyond Limits’ technology to provide operational insight at a new level, help them locate and develop reservoirs, enhance how it produces and refines crude oil, increase process automation and operational efficiency, and even optimize business activities such as how it markets its products. This software can aid in decision-making and helps to manage operational risks.

I will be attaching an interesting case study on how organization benefit from implementing data lakes


[1] Privately Owned Multinational Energy Corporation Streamlines Enterprise Data Visibility Using Advanced Cloud Data Lake Solution-



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store