The Good, the Bad and the Ugly

Published in

Life of a Senior Data Engineer

8 min readAug 2, 2020

Since recent years, as different organizations strive hard to be more transparent, more efficient and more data-driven, there is a surge in the need of a stable analytic platform for internal departments. Long gone the age when Business department comes to IT team and says :

Will you please extract this month data to CSV format ? I need to run some stats on it

Trust me, in those days, your engineers hate these requests a lot, they may struggle to dump a full month of production data, then for some reason, the connection drops and they have to do export again. After all of those things, oops, the business team come back to say : “Well in fact we need to export another table, as there is something missing”….

“Hi, can you export this month data to csv for us ?”

Thankfully, now in each organization, you alway have a dedicated analytics database for those job ( hopefully). Depend on the technologies you have chosen for those dedicated analytics databases, you may call it Data Warehouse or Data Lake or “Cemetery of Data”. These effectively are dedicated data servers which replicated production data, applied a lot of Extract-Transform-Load , which serve as Analytical Database (OLAP- Online Analytical Processing.)

After working with different companies , ranging from startups to corporations, working with different technologies these companies have chosen as the back-bone data warehouse system, I think it worths to note down my reflection on those technologies.

At first, the company itself, and notably the tech-team is so excited about the technologies . But in the end, all of those nice pieces of technologies have to serve the organizations in some way. Let’s be honest, business is about “Minimum effort Maximum Efficiency”, you need to choose the best technologies to get the job done, and those “best choices” are sometimes not always the “cool” ones.

Thus, I start these small series to share my experience in different Data Warehouse technologies I have been working with.

Part I: The Bad

Part II: The Ugly (coming)

Part III: The Good (coming)

PART I: The Bad

The legendary technology, which was the first generation of the age that has been named “Big Data” is nothing new, it’s Hadoop and here in our particular case, it’s Apache HIVE

If you have ever spent time in Analytical Teams in big organizations, then definitely at some point you have used Apache Hive (https://hive.apache.org)

After the publication of the paper MapReduce: Simplified Data Processing on Large Clusters by Google Research Lab, MapReduce became the famous trend , and with the arrival the opensource Apache Hadoop ecosystem, opensource MapReduce officially opened the so-called “big-data age”, and organizations quickly adopt this new piece of technology.

Apache Hive itself it not a database, for the commoners, it is just a tool which translate your normal SQL query to MapReduce jobs that can be run on distributed Hadoop File System (HDFS). However, in a not-so-strict sense, Apache Hive serves its purpose just as any analytical database can do. So let’s try to consider it in that sense.

Any database management system consists of the following components.

Structures of a database management system

In any database management system, you interact with its driver (WebUI, driver connections…etc) to send your query. The query is then sent to query parser/validation, and at this level, the system will draft out an execution plan for your query.

When the execution plan is formed, it is sent to the executors, which actually launch the workload. At the lowest level, the data is pulled out from physical files stored on the disk or on the memory to retrieve the data. The data is then returned in the readable format for consumption.

Now, keeping that in mind, let’s take a look in to how Hive works

Schema of Hive designs. Yellow are Hive components and Pink are tasks that are under responsibilities of Hadoop.

Let’s describe what happen when you want to query something from your file.

Imagine you have a file, called clients.csv . The file look like:

John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
“John “”Da Man”””,Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,”7452 Terrace “”At the Plaza”” road”,SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
“Joan “”the bone””, Anne”,Jet,”9th, at Terrace plc”,Desert City,CO,00123

Then when you says to Hive:

CREATE TABLE Client (firstname string, lastname string, address, string , city string, postal int) row format delimited fields terminated by ',';LOAD DATA LOCAL INPATH '/home/client.csv' OVERWRITE INTO TABLE Client;

Let’s trace the flow of this query:

The query is written to the Hive cli shell, or sent via JDBC, or via the WebUI
The query is then parsed, optimized and execution plan is formed.
Now, as in your query, it is a creation of the table with the given column, Hive does not store these metadata information directly. Metadata includes information on table’s name, table’s columns, columns’ types…etc. These information , even vital, is not stored in Hive, but sent to a proper database, normally a MySQL for storage and fast retrieval.
The execution plan is formed, then sent to Hadoop . It’s the Hadoop executors that actually translate the code to MapReduce code , then execute the work load. In this case, the data file is loaded to Hadoop file system, and in the metadata storage, an entry is created indicating the “Hive table”

If you can see, Hive does not have its proper “Executors” or does not even manage the data file directly, but relies totally on Hadoop for those tasks. Thus, Hive itself is not a proper database management system.

What is the reason behind it ? Why HIVE?

Every times I have to use HIVE, I keep asking myself those questions . It seems that, you may argue, why do you need Hive any way ? It does nothing good, or does it ?

In fact, it seems that if your organization has moved most of its analytical data to Hadoop File Systems, then you have a few choices:

A. Write code to directly interact with Hadoop File System

B. Write SQL query and forget it, take your coffee and just wait for results.

Write simple SQL that we love, then just wait

Clearly, if you ask your analytical teams to a bunch of complicated code just to get the data out of your Hadoop File System, they are not going to do it. Or even worst, they are going to write for once, just to data from your Hadoop File System to CSV, and they will just work with plain CSV (here we go again….)

So in short, Apache Hive is an extension on Hadoop, a friendly arm which enables everyone in the team to happily work with data stored on Hadoop.

Life is good then… Or is it ?

The pain in the b*tt !

The problem arises from the fact that everything said in HIVE depends on MapReduce jobs and Hadoop itself.

Since MapReduce, Hadoop ecosystem itself is Java-based, a single query has to go through a bunch of transformation, translation and then to be distributively executed on Java Virtual Machine.

What’s wrong with that ? You may ask…. But as a guy who has “been there, done that”, I can tell you that it’s a big pain. Name it:

You takes 2 coffees and jobs are still being planned
You have weird errors randomly popped up from an obscure part of the system, everything started with:

“A NullPointerException was thrown at line number 120045689 of planet Scorpion Alpha X67Y , generated by machine Tetra 76, failed at task 24….”

It’s pretty like that how you are going to debug your code, even sometimes you write only some extremely simple SQL JOIN of 2 tables.

The whole things are just slow, difficult to debug, a nightmare for analytical users .
It’s hard for Devops team to maintain the whole clusters.
It is costly to maintain those clusters.

Do we need it ?

In fact, sometimes I ask myself the question : “Do we really need that Hadoop ecosystem ? Do I need to write SQL on HIVE? Can I just write plainly SQL on a relational database management system like Postgres or SQL Server instead ?”

In most case, I figure out myself that technically, the answer is NO. Since in most organizations I have been working with, most of them (not all) do not actually need MapReduce at all. May be it was the trend of Big Data Era that drives the adoption of this. MapReduce leverages the jobs running on huge dataset in a very cheap (but not necessarily fast) way. However, the fact is that, most of the organizations that are using it do not have that huge data set.

Sometimes I have seen case that people put just 500 000 lines of data to HDFS and run HIVE SQL query on it. Technically, 500 000 lines of data is no where to my definition of “BIG”.

To be fair, indeed I have seen in a few organizations, they truly have huge datasets, which is really costly to run extensive query on it, and HIVE becomes very handy, you have have to wait 30 minutes for the execution, but at least it gets the job done. Instead of having nothing, you have something. It’s a step ahead.

Conclusion

Apache HIVE and its brother Apache Hadoop is listed in my category of Bad guy.

But remember, sometimes , you still need the Bad guy to get the job done.

Next: The Ugly

The Good, the Bad and the Ugly

Written by Tiểu Đông Tà