Organize your data lake using Lighthouse

Published in

datamindedbe

4 min readOct 31, 2019

Lighthouse is an open source library (using Apache Spark and Scala) that we developed at Data Minded, with the aim of providing a way to organize your data lake and your spark jobs in an extendable and modular format. Like every library, this is going to be opinionated. We have found that this way works for us. If it works for you as well, even better!

In the diagram above, data is coming from any external system(s), going to the Ingestion pipeline, and it ends up in long term storage such as AWS S3. The raw data is then further processed as per project requirements. Such project specific data will also end up in long term storage. What is not shown in this diagram, is that ultimately the project specific data will get exported for actual use e.g. to a database backing a web app. Of course this is a very simplified situation, for a more detailed overview check out the Lighthouse Concepts.

Lighthouse is concerned with the “spark jobs” part of the diagram. It eases the problems of:

environment management
data source identification and specification

In the following, we will take a deep dive into the library by exploring the little example project we have created.

Project structure

Depending on project size and number or projects, different levels of separation are necessary. For our small example, it will suffice to separate the classes related to the data lake, the pipelines and the case classes. Once the data lake grows, one can make another module containing only the data lake definition and the case classes. This way other projects can easily import those definitions, and use the same raw data for other purposes. The degree of code separation is always a popular discussion topic, where we often lean towards the mono-repo approach, however, that is not the point of this post.

Data lake definition

Environment definition

Even the smallest data pipelines should have a “development playground”, where development of the data pipelines will take place. This is often filled with test data, which means different things in different situations. It can be just a sample of the production data, synthetic data, data from test devices etc. What is important, is that the test data should be representative of the production data. The production environment is the “real” data, often feeding some reporting dashboards or user-facing systems.

It should be easy to specify which environment is targeted. In our case that means specifying the location of input and output files; e.g. have different S3 buckets for test and prod. At one point, one needs to hard-code such things, and we found that the best way of doing this is by defining an abstract class defining all the necessary variables for your environment (bucket names in AWS, storage container names in Azure, SQL connection strings, etc). One then makes concrete classes for each environment.

Defining the data

To get started, we suggest using the TypedDataLink class of Lighthouse. It is a thin wrapper/decorator around the regular DataLinks, but it comes with type-safe read and write methods. A quick example:

val myData = new TypedDataLink[CaseClassTwo](
    new FileSystemDataLink("path/to/data", Parquet)
  )

Here, CaseClassTwo is the type associated with the DataLink. A call to readTyped() will return a Dataset[CaseClassTwo] . Similarly, a call to writeTyped() will expect an argument of type Dataset[CaseClassTwo] .

Setting up the data lake

One can define the data lake by extending the Datalake class from Lighthouse. The class is constructed by calling the apply method of the companion object.

Our Datalake class contains a simple object hierarchy which organizes the different data sources (TypedDataLinks in our case) according to the different areas, e.g. raw, clean, project.

Writing jobs

Once we have the datalake defined, the jobs just need to construct the Datalake object, and have easy access to the data sources defined. The job parses the arguments, and passes the LighthouseConfiguration to the datalake constructor.

The job expects 2 arguments:

--environment or -e , this value will be passed to the TypedCompanyDataLake constructor. Make sure it’s an exact match.
--date or -d , this is the date for which the data will be read, in YYYY-MM-DD format.

The final spark-submit call would look something like:

spark-submit --class com.companyname.jobs.ProjectJob <path_to_jar_file> --environment <env> --date <date>

Conclusions

As we have seen, Lighthouse is a simple library to organize small- and mid-size datalakes. It provides an easy and simple way of accessing the data and managing environments. It also provides a small testing library, but that will be for another post!