Apache Superset Beginner Experience
Written By: Robert Sunderhaft
One of the essential tasks of a data scientist is to obtain useful results from large data sets. Preferably, these findings are showcased through a simple and sleek visualization. With the right graph, layout, and information, data can transform from numbers into a form of art. To accomplish this though, a data scientist needs a platform to build upon.
There are many applications to start with, but my introduction to data science visualizations was through the up and coming Apache Superset. Since I am new to the field and haven’t used any other software, I can judge Superset without comparison to other applications. So here it is, the pros and cons of Superset through the eyes of a novice data scientist:
Many visualization options:
When I started using Superset, the very first feature that stuck out to me was the variety of visualizations one could make. Superset has anything from simple graphs to complex charts. With the variety of options that Superset boasts, it is more likely than not that you will find the right visualization for your data.
Simple To Get Started:
Onboarding onto Superset and making my first visualization was a relatively simple process. Their site is set up in a very clean manner where the stages from query, to visualization, to showcasing your result is very simplistic. With the help of experienced users I was able to start making effective charts and contributing to the overall team within the first week.
Not only is using the application simple, but since Superset is open source, it is free to the public. Coupled together, the simplicity and inexpensive nature of Superset makes it a great place to start making visualizations.
Limits An Amount Of Data Visualized:
My first and probably largest issue I had with Superset was their cap on the quantity of data you could put into a single visualization. If one was working with a large data set, they would most likely run into the same problem I ran into.
Let me break down my experience: Typically when I finished my query for the specific problem I was working on, I would attempt to create a chart to showcase the results in the form of a histogram, bar chart, or pie chart. When I was customizing my chart I would always have to choose a limit on the amount of data points presented in the graph with the largest allowed amount being 50,000 rows. Since I was working with a large data set containing millions of data points, this problem became apparent very quickly and prevented me from making a few queries.
Rigid And Not Flexible:
What I mean by rigid is that many of the options in Superset have limited choices that are not customizable. I found this problem to be very small but frequent across many features. For example, the dashboard, where one can showcase a group of visualizations in one place, limits the amount of tabs you can have in a single window. Another example would be the color options for graphs. Though Superset allows you to choose a color scheme, it won’t allow you to pick a specific color for a specific chart. These are just two examples, but there are many more features that display this same type of rigidity across the entire platform.
Organization of personal data and charts started to become a problem after I had saved 15 charts / queries. All charts are ordered by the last data of edit in a single column. As the number of queries start to pile up, it gets harder and harder to find the query you want to view or edit from the past regardless how well it was titled. Without any organization / folder system, it became increasingly harder for me to revise past charts over time.
Visualization And Query Disconnectivity -
To make a visualization in Superset, one must first make an appropriate query to the original data set to obtain useful results. Right after you finish your query, one can easily transform it into a chart or graph by a click of a button. This process is great until you want to change your original query to modify your chart or graph. Since Superset’s visualizations are independent of the queries, your modifications won’t transfer over. Thus, to change the underlying query in your chart one must create an entirely separate visualization. If one got as far as the dashboard and wanted to make a change on several graphs, the time required to fix them would be immense.
In its entirety, I think Superset was a great place to start my data visualization experience because of its simplicity. For a beginner, like myself, it was very easy to onboard, learn the platform, and start creating useful graphs.
Though there were several factors that I didn’t like about Superset, I still would say it is a great visualization tool. One possible reason for all of Superset’s complications is that they are still a very young platform. Since their application is open source though, the problems described in this article can be changed very quickly. It is for this reason that I would recommend using Superset now and in the future.
With our multiple large data sets on the 99P Labs Developer Portal we see this being a great tool to integrate. Our vision is to display our data set health checks on the developer portal with Superset. This will help quickly show our users data quality, which is a top priority. If you want to learn more our check out our data sets please visit our Developer Portal.