Zeppelin 0.8.0 New Features
Zeppelin 0.8.0 is another major release after 0.7. It delivers lots of new features. This article would go through some of them.
- Yarn Cluster mode for Spark Interpreter
- IPython Interpreter
- Interpreter Lifecycle Manager
- Hadoop NotebookRepo
- Hadoop Config Storage
- Interpreter Recovery
- Generic ConfInterpreter
- New SparkInterpreter Implementation
Yarn Cluster Mode for Spark Interpreter
Before 0.8.0, Zeppelin only support yarn client mode for Spark Interpreter which means the driver would run in the same host of Zeppelin Server. The incur high memory pressure of the Zeppelin Server host especially when you run Spark Interpreter in isolated mode.
Zeppelin 0.8.0 makes Spark Interpreter to support yarn cluster mode where driver would run in the yarn cluster so that we mitigate the memory pressure on the Zeppelin host. The only thing you need to do is set
master as yarn-cluster in spark’s interpreter setting. Although you can set
zeppelin-env.sh，It is highly recommend to define
SPARK_HOME in interpreter setting. Here’s one screenshot of spark interpreter setting which support yarn-cluster mode.
If you want to run Spark Interpreter in kerberized cluster, you need to specify
zeppelin-site.xml. This keytab and principal is shared within zeppelin server where it would be used for storing notes or configuration on hdfs of kerberized cluster (will mention this later).
Impersonation is also supported for yarn cluster mode. You just need to enable it in interpreter setting. Ensure you select
Isolated Per User first, otherwise you won’t see the
User Impersonation Option.
IPython Interpreter is a new interpreter of 0.8.0 with purpose of replacing the old python interpreter of Zeppelin. IPython Interpreter provides comparable user experience like Jupyter Notebook. For the details, you can refer this link which has more details.
Interpreter Lifecycle Manager
Prior 0.8.0, user have to restart interpreter in interpreter setting page or note page to kill the interpreter process. It would cause resource wasting when user get off work but leave the interpreter alive. In 0.8.0, Zeppelin introduce interpreter lifecycle manager which can manage the lifecycle of interpreter, especially on when to terminate the interpreter. For now there’s only one implementation called
TimeoutLifecycleManager which would terminate interpreter if it is idle for some threshold which is one hour be default.
Following are 3 properties you can customize the interpreter lifecycle manager.
<description>LifecycleManager class for managing the lifecycle of interpreters, by default interpreter will
be closed after timeout</description>
<description>milliseconds of the interval to checking whether interpreter is time out</description>
<description>milliseconds of the interpreter timeout threshold, by default it is 1 hour</description>
In 0.8.0, Zeppelin add HDFS as another kind of NotebookRepo option. Storing notes on HDFS give you the more reliability due to hdfs replicas and security due to hadoop security.
To use Hadoop NotebookRepo, you need to make the following setting.
- Add org.apache.Zeppelin.notebook.repo.FileSystemNotebookRepo to zeppelin-site.xml
<description>hadoop compatible file system notebook persistence layer implementation</description>
zeppelin-env.sh so that Zeppelin can find the right hadoop configuration files.
zeppelin-site.xml which is the path on HDFS.
Hadoop Config Storage
Zeppelin has lots of configuration which is stored in local files prior 0.8.0
- interpreter.json (This file contains all the interpreter setting info)
- notebook-authorization.json (This file contains all the note authorization info)
- credential.json (This file contains the credential info)
In 0.8.0, Zeppelin make a unified storage for all these configuration files and support to store them on hdfs. In order to store configuration on HDFS, you need to make the following configuration changes
zeppelin.config.fs.dirto an HDFS path.
- Also specify
zeppelin-env.shso that Zeppelin can find the right hadoop configuration files.
For now, all the interpreter processes will be shutdown when Zeppelin Server is terminated. This cause inconvenience when doing upgrade or maintenance. It would be nice to have the running interpreter processes alive and be reconnected when Zeppelin server is restarted. This is also a pre-requisites for Zeppelin HA when fallback to a standby Zeppelin Server.
In 0.8.0, Zeppelin supports interpreter recovery by storing the info of running interpreter process to some persistent storage, so that Zeppelin Server can reconnect to these interpreter processes by reading these running info.
By default, recovery is not enabled which is the same as prio 0.8.0. You can set the following configuration to enable recovery.
<description>Location where recovery metadata is stored</description>
Zeppelin’s interpreter setting is shared by all users and notes. If you want to have different setting you have to create new interpreter, e.g. you can create
spark_1 with configuration
spark_2 with configuration
spark.jars=jar2 So that
spark_2 can use different dependency for different notes or users. But this approach is not so convenient and manageable especially when there are many more users using Zeppelin, the number of interpreters will be exploded.
Generic ConfInterpreter provides more fine-grained control on interpreter setting and more flexibility.
ConfInterpreter is a generic interpreter that could be used by any interpreters, including your custom interpreters. The input format should be property file format. And it requires to run before interpreter process launched. But when interpreter process is launched is determined by interpreter mode setting, so user needs to understand the interpreter mode setting of Zeppelin and be aware when interpreter process is launched. E.g. if we set spark interpreter as isolated per note. Then each note will launch one interpreter process, in this scenario, user need to put
ConfInterpreter as the first paragraph as the below example. Otherwise the custom setting can not be applied (Actually the paragraph will fail)
New Spark Interpreter
A new spark interpreter is added into 0.8.0, this refactor most of the spark interpreter to make it more robust and easy to maintain and extend. You can set
true to enable it. If you find some weird errors in the old spark interpreter, you can give the new spark interpreter a try.
Here’s other new features & improvement that I missed to cover (Thanks Maksim Belousov）
Main Page Improvements
- The home page became customizable.
- The note list was reworked. It is sorted now. Search by note name is very fast.
- Paragraphs run sequentially in the entire note. In the previous releases paragraphs start run for all interpreters simultaneously.
- Run paragraphs from top to current and from current to bottom.
- Compare git revisions.
- Global dynamic forms for whole note.
Code Editor Improvements
- TAB key can be used for auto-complete.
- Change the font size in paragraph.
- Search and replace code in note.
Result Display Improvements
- Angular UI Grid (http://ui-grid.info/) is now used to display the tabulated results. It is powerful, fast and more functional than previous table viewer.
- Support for streaming tables and builtin visualizations.
Helium is a plugin system that can extend Zeppelin a lot. You can add custom visualization or application to note.
Asynchronous query for metadata in JDBC Interpreter: now you queries are pushed to database immediately, you don’t need to wait the metadata query. You can set lifetime for autocomplete metadata in parameter “default.completer.ttlInSeconds”
SAP BusinessObjects Interpreter
A new interpreter was added: %sap. This interpreter can connect to your SAP Business Objects Platform, create queries over universes. You can use prompts and very complex conditions in “where” clause. Autocomplete help to write the query.