Spark Thrift Server Deep Dive

somanath sankaran
Analytics Vidhya
Published in
3 min readDec 14, 2019

This is one of my stories in spark deep dive series

Photo by Jez Timms on Unsplash

One of the underrated and interesting service is spark thrift server.Let us see the uses of thrift-server in detail.

  1. Spark Thriftserver
  2. Uses Of Spark Thrift Server
  3. Starting thrift server and how it works
  4. Connecting Thift Server with SQL Alchemy
  5. Exploring Thrift Server UI

Spark Thrift Server:

It is the service which provides a flavour of server-client (jdbc/odbc)Facility with spark

Server Client facility means we don’t need the spark to be installed in our machine .Instead we will be a client and we will be given a server url to which

we can connect and use the data with our application for example in our use case we will using Pyhive Client to connect to spark ecosystem started in some server machine

Uses Of Spark Thrift Server

  1. Connect with BI tools like tablaeu,superset etc
  2. Connect spark table and queries with apps written in Java Python etc without starting a spark application

Starting thrift server

We can start thriftserver under $SPARK_HOME/sbin

On starting the thriftserver it will display it will be logging in a file

On inspecting the file I found out it internally calls spark class,So the advantage is we can specify spark properties along with start-thrift server with — conf

how it works

It internally calls hive thriftserver and will expose port localhost:10000 by default to which we can send sql queries to fetch results .The spark thriftserver will use the executor memory option specified to run the queries

So we have to increase the executor with — conf num-executors parameter to get improved latency

We can verify thriftserver is started by seeing web ui as well where spark will say it is running as a thrift server

Connecting Thift Server with pandas

We will be using pyhive which we will be using to connect to spark and execute spark sql queries

we have to install

pip install pyhive

pip install thrift

pip install thrift_sasl

Creating a hive connection object with pyhive

Passing the connection object and hive query to pandas.read_sql

Selecting a hive table

Exploring Thrift Server UI

In Spark UI under thrift server tab we will have the query executed and the IP from which the query came and more details as shown below

That’s all for the day !! :)

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

--

--