Part I: Is a Data Lake Right for Me?

Stefan Lopez
Oct 15, 2020 · 5 min read

If you’re anything like me, your first exposure to the world of data science probably started with a single SQL query. With good reason too, as relational databases provide a very intuitive, user-friendly way to interact with large amounts of data. But what happens when your business matures, the problems become more complex, and your relational databases can’t keep up? We’ve been experiencing these growing pains at NITS Solutions lately and have finally settled on building a data lake as our preferred solution.

What is a Data Lake?

Interest in data lakes has grown quite a bit in the past few years, joining the ranks of machine learning, predictive, and A.I. as an industry-defining buzzword. Don’t just take my word for it, take a look at this Google Trends chart:

So what’s all the fuss about? Like any topic worth exploring the answer is never so simple, but for now this definition should suffice:

data lake: a repository for the consumption of large amounts of structured and unstructured data, usually within the context of a parallelized, cloud-based reporting solution.

If you were anything like me when first learning about this stuff, that definition was probably a lot to chew on. Let’s break it down further by getting a few more definitions out of the way.

structured data: data that has a set number of fields and pre-defined data types for all the fields it contains.

unstructured data: data that has a potentially variable number of fields and does not necessarily have known data types.

parallel computing: computational tasks that are distributed across many different computers who are all working on the same problem at the same time.

cloud computing: computational tasks that are run on computers that do not reside locally, but are instead rented out from a remote data center.

OK, now I think we’re ready to take a deeper dive into the benefits and drawbacks of implementing a data lake solution.

Advantages of a Data Lake

1. More Data Variety

One of the sticking points with relational databases like Oracle, MySQL, etc. is that they require structured data with a well-defined schema (i.e. data type definitions). That is not necessarily the case with a data lake, which opens up a lot more possibilities for potential data feeds. All of the sudden, data objects that would be a hassle in a relational database like images, PDFs, streaming content and more are now handled with ease.

2. Unlock the Power of Parallelization

Data lakes are all about storing and crunching massive amounts of data, and there is no better way to do that than through the use of parallelization. With cloud-based services like AWS’s EMR clusters, tasks that would take hours to do on one computer can now be accomplished in minutes. Summarizations, M.L. training, file processing, you name it! Everything gets a speed boost. What’s more, once you’re done crunching the numbers in your data lake you can simply terminate your EMR cluster and only pay for the computer time you used.

3. Nimble Processes

Perhaps most importantly, data lakes are extremely flexible. There is no requirement to set up relationships between tables, program endless amounts of procedures and triggers, deal with the proprietary idiosyncrasies of something like Oracle, etc. You don’t even need to keep your data locally, in fact, many companies store their data lake files in AWS S3 buckets. It’s bare bones: just your raw data files and the programs you write to consume them.

That’s not to say that data lakes don’t have their drawbacks, and you should think carefully before deciding to take the time to create one of your own.

Disadvantages of a Data Lake

1. Steep Learning Curve

If your team hasn’t worked with this technology previously it can be quite overwhelming. At minimum, they will have to handle their own data file storage, have an understanding of Hadoop, know how to run parallel computation tasks in Spark, and above all else, possess the architectural prowess to connect all these dots together. If your team doesn’t have a firm grasp on either Java or Scala (or perhaps Python in conjunction with PySpark), learning those languages on top of everything else will be a tall order.

2. Process Reset

The fundamental architecture and final output of your ETL processes will be changed, and that has ramifications for everyone. Those JDBC connections your data analysts, business analysts, and product managers have been using to view their data may become replaced by connections to AWS Athena. The SQL client they’ve been using may not support Athena connections (check out the SQuirreL client if so). The very tables and columns they are used to querying may get completely rearranged. While not insurmountable, these issues take time to get sorted and your organization should factor that in when drawing up timelines.

3. Square Pegs in Round Holes

Because data lakes have been such a hot topic, it seems like everyone wants to have one regardless of need. The truth is, if you have a relatively small amount of data whose format and content changes infrequently, relational databases are still probably the way to go for you. It’s easy for management to want the latest and greatest, but if your organization’s tech stack does not align with its use case, those growing pains that come from adopting a data lake will be for nothing.

Wrapping It Up

Data lakes have become popular lately for good reason, as there is perhaps no more accessible way to efficiently consume massive amounts of data. By relying on raw data files and parallelization instead of a monolithic database, much more flexibility and processing power can be achieved while ensuring your code base stays light. Perhaps most importantly, data lakes offer you the ability to take on new projects that simply aren’t practical in a traditional relational database. If you think your organization could benefit from moving to a data lake solution, check out Part II: Understanding the Data Lake Tech Stack to get a better understanding of the tools and skills needed for a successful data lake solution.

The Startup

Get smarter at building your thing. Join The Startup’s +793K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Stefan Lopez

Written by

Data Team Manager at NITS Solutions. When I’m not solving tech problems you can probably find me on the disc golf course. https://stefanlopez.tech

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +793K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store