If you’re anything like me, your first exposure to the world of data science probably started with a single SQL query. With good reason too, as relational databases provide a very intuitive, user-friendly way to interact with large amounts of data. But what happens when your business matures, the problems become more complex, and your relational databases can’t keep up? We’ve been experiencing these growing pains at NITS Solutions lately and have finally settled on building a data lake as our preferred solution.
What is a Data Lake?
Interest in data lakes has grown quite a bit in the past few years, joining the ranks of machine learning, predictive, and A.I. as an industry-defining buzzword. Don’t just take my word for it, take a look at this Google Trends chart:
So what’s all the fuss about? Like any topic worth exploring the answer is never so simple, but for now this definition should suffice:
data lake: a repository for the consumption of large amounts of structured and unstructured data, usually within the context of a parallelized, cloud-based reporting solution.
If you were anything like me when first learning about this stuff, that definition was probably a lot to chew on. Let’s break it down further by getting a few more definitions out of the way.
structured data: data that has a set number of fields and pre-defined data types for all the fields it contains.
unstructured data: data that has a potentially variable number of fields and does not necessarily have known data types.
parallel computing: computational tasks that are distributed across many different computers who are all working on the same problem at the same time.
cloud computing: computational tasks that are run on computers that do not reside locally, but are instead rented out from a remote data center.
OK, now I think we’re ready to take a deeper dive into the benefits and drawbacks of implementing a data lake solution.
Advantages of a Data Lake
1. More Data Variety
One of the sticking points with relational databases like Oracle, MySQL, etc. is that they require structured data with a well-defined schema (i.e. data type definitions). That is not necessarily the case with a data lake, which opens up a lot more possibilities for potential data feeds. All of the sudden, data objects that would be a hassle in a relational database like images, PDFs, streaming content and more are now handled with ease.
2. Unlock the Power of Parallelization
Data lakes are all about storing and crunching massive amounts of data, and there is no better way to do that than through the use of parallelization. With cloud-based services like AWS’s EMR clusters, tasks that would take hours to do on one computer can now be accomplished in minutes. Summarizations, M.L. training, file processing, you name it! Everything gets a speed boost. What’s more, once you’re done crunching the numbers in your data lake you can simply terminate your EMR cluster and only pay for the computer time you used.
3. Nimble Processes
Perhaps most importantly, data lakes are extremely flexible. There is no requirement to set up relationships between tables, program endless amounts of procedures and triggers, deal with the proprietary idiosyncrasies of something like Oracle, etc. You don’t even need to keep your data locally, in fact, many companies store their data lake files in AWS S3 buckets. It’s bare bones: just your raw data files and the programs you write to consume them.
That’s not to say that data lakes don’t have their drawbacks, and you should think carefully before deciding to take the time to create one of your own.
Disadvantages of a Data Lake
1. Steep Learning Curve
If your team hasn’t worked with this technology previously it can be quite overwhelming. At minimum, they will have to handle their own data file storage, have an understanding of Hadoop, know how to run parallel computation tasks in Spark, and above all else, possess the architectural prowess to connect all these dots together. If your team doesn’t have a firm grasp on either Java or Scala (or perhaps Python in conjunction with PySpark), learning those languages on top of everything else will be a tall order.
2. Process Reset
The fundamental architecture and final output of your ETL processes will be changed, and that has ramifications for everyone. Those JDBC connections your data analysts, business analysts, and product managers have been using to view their data may become replaced by connections to AWS Athena. The SQL client they’ve been using may not support Athena connections (check out the SQuirreL client if so). The very tables and columns they are used to querying may get completely rearranged. While not insurmountable, these issues take time to get sorted and your organization should factor that in when drawing up timelines.
3. Square Pegs in Round Holes
Because data lakes have been such a hot topic, it seems like everyone wants to have one regardless of need. The truth is, if you have a relatively small amount of data whose format and content changes infrequently, relational databases are still probably the way to go for you. It’s easy for management to want the latest and greatest, but if your organization’s tech stack does not align with its use case, those growing pains that come from adopting a data lake will be for nothing.
Wrapping It Up
Data lakes have become popular lately for good reason, as there is perhaps no more accessible way to efficiently consume massive amounts of data. By relying on raw data files and parallelization instead of a monolithic database, much more flexibility and processing power can be achieved while ensuring your code base stays light. Perhaps most importantly, data lakes offer you the ability to take on new projects that simply aren’t practical in a traditional relational database. If you think your organization could benefit from moving to a data lake solution, check out Part II: Understanding the Data Lake Tech Stack to get a better understanding of the tools and skills needed for a successful data lake solution.