Organize your data lake using Lighthouse

Gergely Soti
Oct 31 · 4 min read

Lighthouse is an open source library (using Apache Spark and Scala) that we developed at Data Minded, with the aim of providing a way to organize your data lake and your spark jobs in an extendable and modular format. Like every library, this is going to be opinionated. We have found that this way works for us. If it works for you as well, even better!

In the diagram above, data is coming from any external system(s), going to the Ingestion pipeline, and it ends up in long term storage such as AWS S3. The raw data is then further processed as per project requirements. Such project specific data will also end up in long term storage. What is not shown in this diagram, is that ultimately the project specific data will get exported for actual use e.g. to a database backing a web app. Of course this is a very simplified situation, for a more detailed overview check out the Lighthouse Concepts.

Lighthouse is concerned with the “spark jobs” part of the diagram. It eases the problems of:

  • environment management
  • data source identification and specification

In the following, we will take a deep dive into the library by exploring the little example project we have created.

Project structure

Depending on project size and number or projects, different levels of separation are necessary. For our small example, it will suffice to separate the classes related to the data lake, the pipelines and the case classes. Once the data lake grows, one can make another module containing only the data lake definition and the case classes. This way other projects can easily import those definitions, and use the same raw data for other purposes. The degree of code separation is always a popular discussion topic, where we often lean towards the mono-repo approach, however, that is not the point of this post.

Data lake definition

Environment definition

It should be easy to specify which environment is targeted. In our case that means specifying the location of input and output files; e.g. have different S3 buckets for test and prod. At one point, one needs to hard-code such things, and we found that the best way of doing this is by defining an abstract class defining all the necessary variables for your environment (bucket names in AWS, storage container names in Azure, SQL connection strings, etc). One then makes concrete classes for each environment.

Defining the data

val myData = new TypedDataLink[CaseClassTwo](
new FileSystemDataLink("path/to/data", Parquet)
)

Here, CaseClassTwo is the type associated with the DataLink. A call to readTyped() will return a Dataset[CaseClassTwo] . Similarly, a call to writeTyped() will expect an argument of type Dataset[CaseClassTwo] .

Setting up the data lake

Our Datalake class contains a simple object hierarchy which organizes the different data sources (TypedDataLinks in our case) according to the different areas, e.g. raw, clean, project.

Writing jobs

The job expects 2 arguments:

  • --environment or -e , this value will be passed to the TypedCompanyDataLake constructor. Make sure it’s an exact match.
  • --date or -d , this is the date for which the data will be read, in YYYY-MM-DD format.

The final spark-submit call would look something like:

spark-submit --class com.companyname.jobs.ProjectJob <path_to_jar_file> --environment <env> --date <date>

Conclusions

datamindedbe

Better data engineering

Thanks to pascal.brokmeier

Gergely Soti

Written by

datamindedbe

Better data engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade