How Apache Zeppelin runs a paragraph

Jongyoul Lee
Sep 30, 2016 · 4 min read

Apache Zeppelin is one of the most popular open source projects. It helps users create their own notebooks easily and share some of reports simply. Most of users appreciate Apache Zeppelin’s functionality and extensibility. Most of contributors and administrators, however, shared their experience of having difficulties while debugging Apache Zeppelin because of its complicated structure. This post will describe how Apache Zeppelin handles users’ requests to run paragraphs from server module to interpreters.

Before diving into details, I’ll clarify some terms that would help you in understanding of this article. The first term is paragraph. It is a minimum unit to be executed. The second one is note which is a set of paragraphs, and also a member of notebook. Thus one instance has only one notebook which has many notes. You can see a notebook in a home page.

We also need to know what an interpreter is. Interpreter of Apache Zeppelin is the gateway to connect specific framework to run actual code. For instance, SparkInterpreter is a gateway to run Apache Spark, and JDBCInterpreter supports to handle JDBC drivers. Apache Zeppelin has 19 interpreters on a master branch.

The server module of Apache Zeppelin consists of three parts: handling rest/websocket, storing and loading data, and managing interpreters. This post will focus on the last one of managing interpreters. But it will help you understand whole path for running a paragraph. There’re two entry points for running paragraphs.

In the two above methods, those call note.run(id) at the end of the methods, that method finds an actual paragraph from id and submits the paragraph into the scheduler of an interpreter parsed by paragraph and note. This is the flow of the front-side.

Through the code above, you can guess the relationship between a note and interpreters in a code level. Every note has its own interpreters’ mapping and stores it into interpreterFactory, every interpreter has its own scheduler and runs a paragraph from the scheduler, and the status of paragraph is managed by jobListenerFactory. Concerning jobListenerFactory, I’ll write another post for the lifecycle of paragraph.

For the first step to understand interpreter, we should know how to initialize interpreters when Apache Zeppelin starts up. InterpreterFactory manages the lifecycle of interpreters. When you start up the server, InterpreterFactory initializes with two major steps. The first is to read the directory of ${ZEPPELIN_HOME}/interpreter which has many sub directories that have all of jars including third party’s frameworks. InterpreterFactory makes the list of available interpreters with default configuration, and which is used to make a new interpreter setting. Secondly, InterpreterFactory reads ${ZEPPELIN_HOME}/conf/interpreter.json which stores actual configurations of interpreters and includes mapping between notes and interpreters. This is same information in an interpreter tab of UI. It finishes with preparation on running a paragraph by interpreterFactory. Here is the link of the code: https://github.com/apache/zeppelin/blob/master/zeppelin-zengine/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java#L161

Before proceeding into the next step, you should know how Apache Zeppelin launches an interpreter. The main purpose of supporting different modes is to manage memory usage and overload of CPUs efficiently. No one wants to run MarkdownInterpreter per note, but most of users would like to run SparkInterpreter with their own instances. Apache Zeppelin supports three modes for managing interpreters. Shared mode is a traditional model. this mode shares all of resources. If you use SparkInterpreter with this mode, all running paragraph use one Spark instance. Scoped mode has different class loader in a same process. This mode will enable note to own separate resources within a same process. In case of using JDBCInterpreter, every note has its own connection. Isolated mode means that all notes can run paragraphs in different processes. There are two main functions to decide a mode.

Now, we will look into the method of getInterpreter. Basically, it returns an interpreter which runs a paragraph. To determine specific interpreter, this function has three steps. You will encounter a new term called replName when you dig into the code. It is sort of alias to call a specific interpreter. According to the type of replName, getInterpreter chooses different values. If it’s null, it returns a default interpreter. If it doesn’t have any comma, getInterpreter treats it as a name of default interpreter group. For example, “%pyspark” means as same as “%spark.pyspark”. At last, it has two words separated by dot, getInterpreter handles it as “%{group_name}.{interpreter_name}” and returns a specific interpreter. Here is the link of the code: https://github.com/apache/zeppelin/blob/master/zeppelin-zengine/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java#L1206

Another function of getInterpreter is to make a RemoteInterpreter. Apache Zeppelin launches different processes for different interpreters and manages them via Apache Thrift. It is to avoid conflicts among different interpreters’ dependencies. If the result interpreter is never called before, getInterpreter will make a RemoteInterpreter for that interpreter. RemoteInterpreter is a wrapper including Thrift client interface and a connector between a server process and interpreter processes.

It’s time to find where a paragraph is executed. Let’s go back to note.run(id). that method calls intp.getScheduler().submit(p) at the end. Paragraph implements Job interface, and scheduler will execute Job one by one. If some paragraphs are submitted into the scheduler of an interpreter, the scheduler will run Paragraph.jobRun().

It looks complicated but we focus on repl.interpret(script, context) only. Paragraph gets repl by calling getInterpreter and run the method of interpret. Then RemoteInterpreter will pass the script and context into Interpreter on a different process and obtain the result from another process.

It’s very basic flow about running paragraph and this article skipped some steps for your understanding. through this article, I tried to describe how Apache Zeppelin selects correct interpreter and how an interpreter gets a script and executes it. there are also many new concepts that I didn’t explain. I’ll explain them with another article. Apache Zeppelin is emerging project and changes so fast. I, however, hope this article helps you understand Apache Zeppelin more, and contribute to Apache Zeppelin easily.

Apache Zeppelin Stories

All things Apache Zeppelin written or created by the Apache Zeppelin community — blogs, videos, manuals, etc. Let us know if you would like to be added as a writer to this publication.

Thanks to Khalid Huseynov

Jongyoul Lee

Written by

Data, Streaming, and Apache Zeppelin

Apache Zeppelin Stories

All things Apache Zeppelin written or created by the Apache Zeppelin community — blogs, videos, manuals, etc. Let us know if you would like to be added as a writer to this publication.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade