Geek Culture
Published in

Geek Culture

Docker Hive Scripts

Helper scripts for EMR in Docker

I’ve forked docker-hive project here https://github.com/alex-ber/docker-hive
Basically, it is EMR 5.25.0 cluster single node hadoop docker image. With Amzn linux, Hadoop 2.8.5 and Hive 2.3.5.

It mimics EMR on AWS on docker container. You can read about docker here https://github.com/alex-ber/AlexBerDocs/tree/master/Docker/Windows

I’ve recently added some bash script that I want to explain you here.

This article appears to be long. It is written using top-down approach. It is recommended to read it at least twice. In the first reading all footnotes can be skipped, they contain more detail explanation. In the second reading, you can optionally read button up.

run-hive.sh

This script is intended to run in the Host machine.

  • If there is no running docker container, we will run one. We will wait while it will be responsive (see checkisup.sh section below).
  • Otherwise, we will run bash scripts.
bash run-hive.sh

By default, it will be “reinit-hdfs.sh && reinit-metastore.sh”.
You can also pass them explicetley, if you like:

bash run-hive.sh reinit-hdfs.sh && reinit-metastore.sh
  • You can pass as parameter what bash script to run. For example,
bash run-hive.sh reinit-hdfs.sh

will reinit only HDFS.

bash run-hive.sh reinit-metastore.sh

will reinit only Hive Metastore.

checkisup.sh

Intended usage example is:

docker exec alex-local-hive checkisup.sh

where local-hive is name of container that was created from alexberkovich/docker-hive image.¹

Use cases:

  • When you (re)start docker container. If you can’t connect to Hive, you should wait.
  • When you format metastore and/or (re)start Hive Server. It takes time to Hive Server to be responsive.

Basically, this bash script is busy wait loop that tries to connect to the Hive service with Beeline (CLI tool to connect to Hive). It makes 10 different such attempts with sleep between them. In each attempt it wait for output from Beeline, if the output is not yet ready, there is 10 inner retires to read the output (with some sleep in-between). If it succeed than return code 0 is returned. If after 10 attempts, connection wasn’t established, than return code non-zero is returned.²

Reinit to Empty State

You can remove docker container and create new one as described here¹. This will work, but this have a couple of downsides:

  1. It take a lot of time.
  2. You will lose the state. Maybe you have some files on docker filesystem (not HDFS!) that you want to reuse. Maybe you want to see the logs of the previous runs, all will be gone.
  3. Maybe it is sufficient for you use-case to reinit only HDFS or only Hive Metastore.
  4. Maybe it is sufficient for you use-case to reinit both HDFS and Hive Metastore, but leave everything else (for example docker’s filesystem or unrelated to HDFS & Hive processes) untouched.
docker exec alex-local-hive bash -c "reinit-hdfs.sh && reinit-metastore.sh && checkisup.sh"

The code snippet above will reinit both HDFS and Hive Metastore and will return when Hive Server will be up and available to handle requests.

Note:

  • Docker don’t have built-in ability to run more than one bash script. ‘bash -c’ is recommended work-around for this use cases. If there is only 1 script ‘bash -c’ maybe dropped (assuming CMD/ENTRYPOINT are /bin/bash or /bin/sh, this is default anyway).
  • You can do it also in opposite order
docker exec alex-local-hive bash -c "reinit-metastore.sh && reinit-hdfs.sh && checkisup.sh"

It works this way too, but I think the way above is more readable.

docker exec alex-local-hive reinit-hdfs.sh

reinit-hdfs.sh don’t affect Hive Server, so you can remove checkisup.sh from the command.

docker exec alex-local-hive bash -c "reinit-metastore.sh && checkisup.sh"

Apache Hive bundles Hive Metastore and Hive Server together. So, in order to reinit Hive Metastore reinit-metastore.sh script also restart Hive Server. So, you should check when Hive Server will be up and available to handle requests.

reinit-hdfs.sh

Intended usage is:

docker exec alex-local-hive reinit-hdfs.sh

where local-hive is name of container that was created from alexberkovich/docker-hive image.¹

Use case:

  • You want to start with empty state in Hive. For example, in order to run unit test, you can between 2 unit test, format HDFS.

Note:

  1. This script doesn’t format metastore. So, you will have metastore and HDFS not in sync. See reinit-metasore.sh below.
  2. On mine machine this takes ~70 seconds. After this script finish to run, you can use HDFS in regular way.

Basically, we stop all HDFS and Yarn services. We format namenode and remove all data from datanode, we remove another leftovers from previous run, than we resrart HDFS and Yarn service and recreate folders that Hive Service expect to be present.³

Hive Metastore Note

  1. Apache Hive runs as usual process. There is no service/bash wrapper around it.
  2. Hive Metastore and Hive Service are bundled together.
  3. When Hive Service run’s for the first, it creates Hive Metastore tables.
  4. If we want to format Hive’s Metastore, we have to stop Hive Service first.
  5. Process of recreating of Hive’s Metastore are spitted upon schemaTool and Starting up Hive Service (it will create Version table, etc). This script wrappes this 2 things together.

reinit-metastore.sh

Intended usage is:

docker exec alex-local-hive reinit-metastore.sh

where local-hive is name of container that was created from alexberkovich/docker-hive image.¹

Use case:

  • You want to start with empty state in Hive. For example, in order to run unit test, you can between 2 unit test, format Hive’s metastore.

Note:

  1. This script doesn’t format HDFS. So, you will have metastore and HDFS not in sync. See reinit-hdfs.sh above.
  2. On mine machine this takes ~36 seconds. After this script finish to run, you Hive Service is available.

Basically, we stop Hive Server, then init metastore and then start Hive Server in the mode that create metastore_db.

init-metastore.sh

It is intended for internal usage.

It deletes metastore_db directory, and runs initSchema by schematool. It use Derby as storage. The data is stored in metastore_db directory. See also Hive Metastore Note.

Note:

  • init-metastore.sh can be technically run when Hive Service is up, but this should be avoided.
  • init-metastore.sh will not create Version table, etc (it is done when Hive Service is running up only).

stop-hiveserver2.sh

It is intended for internal usage.

See Hive Metastore Note.

Basically, we’re looking for Hive process using ps utility and some identification string, than we use kill -9.⁵

start-hiveserver2.sh

It is intended for internal usage.

See Hive Metastore Note.

There is 2 mode in which this script can be run:

  • Without any parameter.
  • With parameter.

Basically, we’re looking for Hive process using ps utility and some identification string. If we found one, we first of all stop it.⁶

  • If start-hiveserver2.sh run without parameter, than we starting up Hive Server with existing Hive Metastore.
  • If start-hiveserver2.sh run with parameter (it is intended to be indication to create Hive Mestastore, but technically it can be anything) it will be passed through to hiveserver2.

func.sh

It is intended for internal usage.

pdate — function that prints current timesatmp.

echoerr, echowarn, echoinfo — function that mimic logger output.
echoinfo send message (with current timestamp) to stdout.
echowarn, echoerr send message (with current timestamp) to stderr.

killit — takes process_id as parameter and send to it kill signal.

If it fails, send kill -9.

If it still fails, it will exit with return code 1.

findpid — helper function to find process_id.

It takes as parameter string that will be used in grep in ps aux.

P.S. Maybe you will also interesting in my Git Tutorial https://medium.com/@alex_ber/git-tutorial-40697ec6683f

Footnotes:

¹ Simple way to create docker container is to run following command from the command line:

docker run -p 8030-8033:8030-8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 10000:10000 -p 10002:10002 -d --name alex-local-hive alexberkovich/docker-hive

You can compose this in the following simple script:

docker rm -f alex-local-hive && 
docker run -p 8030-8033:8030-8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 10000:10000 -p 10002:10002 -d --name alex-local-hive alex-docker-hive &&
docker exec alex-local-hive checkisup.sh

This code snippet will stop & remove existing docker container (if exists) , will create docker container from docker image, will start docker container and will return when Hive Server will be up and available to handle requests.

Of course, you can build the docker image from source, see https://github.com/alex-ber/docker-hive for mode details.

Alternatively, you can copy&paste following docker-compose.yml

Note: that there is difference in networking and in logging between these 2 approaches.

in the same directory run following command:

docker-compose up -d

-d flag means run in in detach mode (run process in background and immediately return to shell). This is default mode in docker exec.

or

docker-compose up -d --force-recreate

--force-recreate -the later command will also mandatory recreate service/container.

More-over, you can write a simple script:

docker-compose up -d --force-recreate && 
sleep 60 &&
docker-compose exec -T alex-local-hive checkisup.sh

This code snippet will remove existing service/container (if exists) , will start docker-compose, that will start docker-container and will return when Hive Server will be up and available to handle requests. -T flag means don’t allocate pseudo-tty (terminal). This is default mode in docker exec.

Note: sleep is needed between 2 docker-compose command, because first command returns immediately and service may be not available (for example, docker-container is not created yet). 60 second can be reduce. It is fairly big number, because it takes time to start HiveServer2, so we just save some resources (instead of busy wait, we’re sleeping). This optimization, but you do need some sleep.

Note: while the code snippet above works, you can use the following more efficient one (you will gain ~9 seconds):

(docker rm -f alex-local-hive || true) && 
(docker-compose kill alex-local-hive || true) &&
docker-compose up -d &&
sleep 60 &&
docker-compose exec -T alex-local-hive checkisup.sh

First line remove docker-container (if found). Seconds link remove docker-compose service. Third line will start docker-compose, that will start docker-container and will return when Hive Server will be up and available to handle requests. -T flag means don’t allocate pseudo-tty (terminal). This is default mode in docker exec.

Note: sleep is needed between 2 docker-compose command, because first command returns immediately and service may be not available (for example, docker-container is not created yet). 60 second can be reduce. It is fairly big number, because it takes time to start HiveServer2, so we just save some resources (instead of busy wait, we’re sleeping). This optimization, but you do need some sleep.

² Here, I will describe checkisup.sh in more details:

First of all I use func.sh (described below) for logging activities. You will see in stdoutand stderrwhat’s going on. If you use TTY with color, you will see output of stderrin red.

The main loop is done up to 10 attempts. There is (little) sleep before each attempt. If 10 attempts fails, than we have indication of failure in stderrand we exit with return code 7.

We run beeline client in the background in non-interactive mode (nostdin) and we’re trying to connect to localhost:10000 without username&password. We are redirecting stdoutto out_filename and stderrto err_filename.

If when launching belling we receive non-zero return code, we abort out main loop with the same return code. This should be very rare in practice.

We have inner loop that runs up to 10 attempts. There is couple of seconds sleep before each retry.

  • If we found ‘jdbc:hive2://in the out_filename, we print the line where we found this string to stdoutand we exit with return code 0.
  • If we found ‘beelinein the out_filename, we print to stdout that beeline is not ready yet, and we will make another attempt (we continue the main loop).
  • If neither is found, we print current retry attempt to stdout and make retry (new inner loop iteration).

Before making new attempt we’re checking if there is ‘Error’ string in err_filename. If found the line with ‘Error’ is printed to stdout.

If inner loop exceed 10 attempt, we will have some indication in stderrand return code 3 will be returned.

If main loop exceed 10 attempt, we will have some indication in stderrand return code 7 will be returned.

Limitations:

  1. err_filename and out_filename use unique name. It means if this script runs in parallel, there will be race condition. One possible solution can be is to attach UID to the name.
  2. Pause is done on constant amount of time (and hard-coded). If there are another Hive clients and they also have such timeout, we can have pause on the same time. One possible solution is to implement exponentiation back-off. Another — to pause on some random amount of time inside some bound interval.
  3. There is hard-coded dependency on specific strings. If Hive version will change, such messages can be changed and this script can fail.
  4. Number of attempt in the main loop (10) and number of attempt in the inner loop (10) is hard-coded.

³ Here, I will describe the reinit-hdfs.sh in more details:

Note: It appears that stop-dfs.sh doesn’t stop namenode, so we do it manually.

  • We’re stopping namenode and all other HDFS stuff, we stop Yarn (we don’t want that somebody will alter HDFS in meantime).
  • We’re manually delete from HDFS datanode, leftover of Yarn, etc.:

/tmp/hadoop-root — HDFS root scratch directory for Hive jobs. Here we delete leftover of the MapReduce jobs.

/tmp/hsperfdata_root — this directory is part of a Java performance counter. It is a log directory created by JVM while running the code. jps, jstatand other tools use this folder to avoid connection to the JVM (for example, to know what java processes are running). Removing this folder while all Java process are stopped ensure that start/stop script that rely on jps tool will work correctly. I’m not sure whether this step is 100% required, but is it cleaning up.

/tmp/*_resources — shared resources. When all Java processes are stopped, it is safe to delete this folder. Cleaning up.

/home/hadoop/hadoopdata/ is custom value that is set in hdfs-site.xml it is root directory of dfs.namenode.name.dir and dfs.datanode.data.dir. First parameter determines where on the local filesystem the DFS name node store it’s data. Second parameter determines where on the local filesystem an DFS data node should store its blocks. This ensure that we will have fresh HDFS, no leftover from previous run.

  • Next step is to format namenode. While all DFS process are down, we can format namenode. It will recreate name node meta-data files.
  • Now, I’m starting namenode, because start-dfs.sh is not doing this.
  • Next step, is to start up all DFS-related stuff.
  • Next step, is to start up Yarn-related stuff.
  • At last, I’m recreating empty folder (with permision 777) that Hive server expects to exists:
    /tmp
    /user/hive/warehouse
    /tmp/hadoop-yarn
    /tmp/hadoop-yarn/staging

⁴ Here, I will describe reinit-metastore.sh in more details:

See Hive Metastore Note.

First, we stop Hive Service (basically, we use kill -9 for this). Than we call internal script init-metastore.sh (basically, we delete metastore data and run schematool -initSchema), than we call start-hiveserver2.sh with parameter ‘javax.jdo.option.ConnectionURL=jdbc:derby:metastore_db;create=true’ (basically, it will start Hive Service in the mode of creating Metastore).

⁵ Here, I will describe stop-hiveserver2.sh in more details:

See Hive Metastore Note.

I’m using func.sh internal script to findpid and to killit. See func.sh for the detail description.

  1. Look for Hive process using ps utility and “[j]ava[[:blank:]]-Xmx256m[[:blank:]]-Djava.net.preferIPv4Stack=true” string in grep regular expression.
  2. If such process is not found, we print some error message to stdoutand exit with return code 1.
  3. If it found we take process_id as parameter and send to it killit function (it will basically send kill -9).

Note:

It is used to ommit process ‘grep java’ from the match.

Note:

  • This scripts depends on how start-up parameters of Hive server is reflects in ps utility. If whether ps utility changes the way how it shows the parameter or parameters itself will change in future Hive release, this script can break.

⁶ Here, I will describe start-hiveserver2.sh in more details:

See Hive Metastore Note.

I’m using func.sh internal script to findpid and to killit. See func.sh for the detail description.

  1. Look for Hive process using ps utility and “[j]ava[[:blank:]]-Xmx256m[[:blank:]]-Djava.net.preferIPv4Stack=true” string in grep regular expression.
  2. If such process is found, we first of all stop it.
  3. If start-hiveserver2.sh run without parameter, than we starting up Hive Server with ‘javax.jdo.option.ConnectionURL=jdbc:derby:metastore_db;
  4. If start-hiveserver2.sh run with parameter (it is intended to be ‘javax.jdo.option.ConnectionURL=jdbc:derby:metastore_db;create=true; but technically it can be anything) it will be passed through to hiveserver2.

Note:

It is used to ommit process ‘grep java’ from the match.

Note:

  • This scripts depends on how start-up parameters of Hive server is reflects in ps utility. If whether ps utility changes the way how it shows the parameter or parameters itself will change in future Hive release, this script can break.

 by the author.

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

[Leetcode 278] First Bad Version

Give me the Juice

“If it turns out that this is incurable, would you marry me?”

Git Reflog

The ultimate solution to “It works on my machine”

Mutable, Immutable… Everything is an Object!

BSC HOTEL

Schachter vs Villasenor Live’StReam!!

Online live stream search engine

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
alex_ber

alex_ber

Senior Software Engineer at Pursway

More from Medium

1-Introduction to Python

Checking your python code format on Azure Pipelines

Cleaner Queries in Python with Row Factories

Installing Hadoop on Ubuntu 20.04