Setting up a Thrift Server

Run sql queries against your data

Sai Peddy
6 min readFeb 20, 2018
Photo by Rob Lambert on Unsplash

This blog post will be the final installment of Setting up a Big Data SQL Querying System. This post will walk you through setting up a thrift server so that you are able to query the parquet data you had previously generated. Whether or not you are following the series of blog post this will still walk you through setting up a Spark/Hive Thrift server.

Before we get any further, you might be asking what exactly is thrift? This blog helped me tremendously when I was playing with the thrift server to improve its performance. It also has an excellent description of the thrift server and its history. Here is just a piece of that information:

Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it. — Russell Spitzer

Two notes before starting:

  1. This tutorial assumes you have spark downloaded — I downloaded Spark-2.2.1
  2. This is not a production deploy, this was all done locally.

Starting ThriftServer

Starting the thrift server is fairly simple:

#Spark_Path = the download of spark you just downloaded
cd ${Spark_Path}/sbin/
./start-thriftserver.sh

Your thriftserver should now have started. However, just running the thrift server like this gets you a pretty bare bone and weak system to run big data queries, but it can work. Expanding on this, you have a few options when configuring thrift — but we will cover just the following two:

  1. Limit the number of resources that the thriftserver application consumes
  2. Connect to a bigger spark cluster — to increase the allocation of resources

Configuring thrift:

This section is not mandatory but has useful configuration information.

The following configurations can be added to ./start-thriftserver.sh command.

Specify a default port: --hiveconf hive.server2.thrift.port=9999
Limit the executor memory: --executor-memory 1g — This gives the thrift server 1GB of memory per executor that it runs on.
Specify the Core Count: --conf spark.cores.max=100

So, you just witnessed 3 different ways to set configs for the thrift server. This might be confusing if this is your first time with thrift. If you visited the blog post above, you know that thrift is basically a combination of spark and hive, which leads to these configs. Typically --hiveconf refers to configurations in hive, while in this context --conf leads to configurations in spark. Everything else is prebuilt configs for the server to take. Prebuilt also includes specifying the spark master to connect to.

Connecting to masters:--master 10.111.111.111:7077,10.111.111.112:7077

When starting up the thrift server, if you specify a master (or multiple) for the server to connect to, then the server is able to use the resources available on that cluster. If you navigate to your spark cluster, you should see an application as such:

Clicking on that application will take you into the application page, and you should have the following information available to you.

The UI can definitely be useful, but also has a few quirks that would be worth knowing. The best way to learn is to start experimenting with queries and seeing what happens (I’ll show you one way to start experimenting in the next section). However, one useful thing to know is that a SQL query does not necessarily correlate to just one job — a query can sometimes create multiple jobs.

Perfect time to show a nice diagram about how the overall system connects and interacts.

Connecting Via Beeline:

Beeline should already be available with the package of spark you downloaded. Using the following commands, you can connect to your thrift server using beeline.

#Assumption: You are running thrift on the default portcd ${Spark_Path}/bin/beeline
!connect jdbc:hive2://localhost:10000
#You will be prompted for username & password-click enter if not set

If you are following the series of blogposts you should already have parquet data in this folder structure:

| testbucket
---|parquetFiles
---|year=2017
---|month=12
---|day=25
---|part0000
---|part0001
---|year=2018
---|month=02
---|day=17
---|part0000
---|part0001

with the data having a format similar to this:

{
"dest_ip": "20.123.123.12",
"dest_port": 8000,
"isBoolean": true,
"timestamp": "2018–07–08T17:48:40Z",
"@diffcultField1": "test",
"difficult-field2": "test",
"nested_field": {
"field_1": "fieldVal",
"field_2": "fieldVal"
}

A few new fields were added to show some special steps that are required in the creation of tables on top of this data. Adjust your table creation according to your data. Specifically, the fields@difficultField1, difficult-field2, & nested_field require slightly different syntax.

One more note prior to creating your table — it is entirely fine to have your data saved locally and not in S3 — update the LOCATION value in the table below to point to the local location.

Creating Your Table:

CREATE EXTERNAL TABLE parquetData(dest_ip string, dest_port int, isBoolean boolean, timestamp TIMESTAMP, `@difficultField` string, `difficult-field2` string, nested_field STRUCT<field_1:STRING, field_2:STRING>) PARTITIONED BY (year int, month int, day int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://testbucket/parquetFiles/' tblproperties ("parquet.compress"="SNAPPY");MSCK REPAIR TABLE parquetData;ALTER TABLE parquetData RECOVER PARTITIONS;

Let’s go over the 3 commands above and what they are doing. The result of the Create External Table command will be an external table with the name parquetData with the specified schema. Two important notes in the schema:
1. Fields with special characters are surrounded by “`”
2. Nested fields require that you specify the nested schema

Moving on — You can also specifying the what partitioning exists on top of the data. If you pass partitioning information at the time of query, your queries should get a performance boost. There are a few other pieces of the command that specify the different formats regarding the data. Such as the SERDE as well as the format of the input and output of the data— essentially specifying that the underlying data is parquet and how to handle that.

The last two commands go together. If you were to just create your external table then try querying it, you would get 0 results. The reason this happens is that when your data is in S3 — thrift doesn’t automatically pick up all the available partitions in S3. You have to repair the table and recover the partitions. If you perform this, then your external table will pick up all of your latest partitions available in S3. However, any new partitions will not be found until you run the commands again.

When you create tables in Hive, where partitioned data already exists in S3 or HDFS, you need to run a command to update the Hive Metastore with the table’s partition structure. — Stack Overflow User

You should now be able to start querying your data with SQL commands! Try a simple command

Select * From parquetData;

Enjoy the power ~

Depending on your scale, you may run into issues with performance. The blog I referred to at the start of the post should be pretty useful. This blog mentions a lot of options for performance improvements.

--

--

Sai Peddy

Data Engineer | Love to Learn | Interested in…too many things