Using Enterprise Data Lakes for Modern Analytics and Business Intelligence

“Big data” and “data lake” only have meaning to an organisation’s vision when they solve business problems by enabling data democratization, re-use, exploration, and analytics. Big data architectures, leveraging the data lake concept, are being used to improve search and analytics in highly innovative ways.

What Is a Data Lake?

A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. An “enterprise data lake” (EDL) is simply a data lake for enterprise-wide information storage and sharing.

What Are the Benefits of a Data Lake?

The main benefit of a data lake is the centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible. The disparate content sources will often contain proprietary and sensitive information which will require implementation of the appropriate security measures in the data lake.

The security measures in the data lake may be assigned in a way that grants access to certain information to users of the data lake that do not have access to the original content source. These users are entitled to the information, yet unable to access it in its source for some reason.

Some users may not need to work with the data in the original content source but consume the data resulting from processes built into those sources. There may be a licensing limit to the original content source that prevents some users from getting their own credentials. In some cases, the original content source has been locked down, is obsolete or will be decommissioned soon; yet its content is still valuable to users of the data lake.

Once the content is in the data lake, it can be normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing. Read more about data preparation best practices. Data is prepared “as needed,” reducing preparation costs over up-front processing (such as would be required by data warehouses). A big data compute fabric makes it possible to scale this processing to include the largest possible enterprise-wide data sets.

Users, from different departments, potentially scattered around the globe, can have flexible access to the data lake and its content from anywhere. This increases re-use of the content and helps the organization to more easily collect the data required to drive business decisions.

Information is power, and a data lake puts enterprise-wide information into the hands of many more employees to make the organization as a whole smarter, more agile, and more innovative.

Searching the Data Lake

Data lakes will have tens of thousands of tables/files and billions of records. Even worse, this data is unstructured and widely varying.

In this environment, search is a necessary tool:

  • To find tables that you need – based on table schema and table content
  • To extract sub-sets of records for further processing
  • To work with unstructured (or unknown-structured) data sets
  • And most importantly, to handle analytics at scale

Only search engines can perform real-time analytics at billion-record scale with reasonable cost.

Search engines are the ideal tool for managing the enterprise data lake because:

  • Search engines are easy to use – Everyone knows how to use a search engine.
  • Search engines are schema-free – Schemas do not need to be pre-defined. Search engines can handle records with varying schemas in the same index.
  • Search engines naturally scale to billions of records.
  • Search can sift through wholly unstructured content.

The State of Data Lake Adoption

Radiant Advisors and Unisphere Research recently released “The Definitive Guide to the Data Lake,” a joint research project with the goal of clarifying the emerging data lake concept.

Two of the high-level findings from the research were:

  • Data lakes are increasingly recognizable as both a viable and compelling component within a data strategy, with small and large companies continuing to adopt.
  • Governance and security are still top-of-mind as key challenges and success factors for the data lake.

More and more research on data lakes is becoming available as companies are taking the leap to incorporate data lakes into their overall data management strategy. It is expected that, within the next few years, data lakes will be common and will continue to mature and evolve.

Using Data Lakes in Biotech and Health Research – Two Enterprise Data Lake Examples

We are currently working with two world-wide biotechnology / health research firms. There are many different departments within these organizations and employees have access to many different content sources from different business systems stored all over the world. The data includes:

  • Manufacturing data (batch tests, batch yields, manufacturing line sensor data, HVAC and building systems data);
  • Research data (electronic notebooks, research runs, test results, equipment data);
  • Customer support data (tickets, responses); and
  • Public data sets (chemical structures, drug databases, MESH headings, proteins).

Our projects focus on making structured and unstructured data searchable from a central data lake. The goal is to provide data access to business users in near real-time and improve visibility into the manufacturing and research processes. These enterprise data lake and big data architectures are built on Cloudera, which collects and processes all the raw data in one place, and then indexes that data into a Cloudera Search, Impala, and HBase for a unified search and analytics experience for end-users.

Like what you read? Give Alex Baretto a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.