Deep Dive into Animation Plot with Real-time Data

Hayden Yan
SFU Professional Computer Science
15 min readFeb 11, 2022

Authors: Jiahe Wang, Zhi Zheng, Huiyi Zou, Shilin Wang, Hayden Yan

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit sfu.ca/computing/mpcs.

Animation Graph with Static Data

Data Visualization is undoubtedly one of the most trending and fascinating topics in Data Analysis. As shown in this graph, data visualization is typically considered the end product of the whole data analysis process. We can easily recognize patterns and derive insights from them.

Nowadays, various visualization tools and libraries such as Tableau, Matplotlib, and Plotly have quickly gained tremendous popularity thanks to the development of data science. In this article, we are going to look at a specific type of visualization called animation visualization, which is a powerful visualization technique for highlighting the trend of changes of the data and visualizing time series data. Visualization animation, being widely used in a variety of industries such as the financial sector and public health industry, evidently visualization animation possesses vast potential. It is a new essential skill that every data science enthusiast should learn.

This article will first introduce the steps to create the first animated line chart using Covid data from BC Centre for Disease Control in the following sections. Then we will use an animated bar chart race to show the GDP changes of 12 countries to demonstrate how to animate plots using Matplotlib with static time series data before moving on to live streaming data using Kafka.

Animated Line Chart

Now, let us try to use Matplotlib to animate a real-life example. There have been multiple intriguing visualizations done about COVID-19 since the pandemic began. In this article, we will use the dataset that contains the information about confirmed Covid cases every day starting from January 9th, 2020 up to the day this article is written(February 7th, 2022) and will be updated daily. First, let us take a look at the schema of our data:

The columns of interest are “Date” for our time series, “HA” as of different health authorities within BC Province, “HSDA” indicating the subsequent branch under the given Health Authority and “Cases_Reported,” which denotes the newly confirmed case of the day. We can see that there is already a very convenient category “All” under “HSDA” that represents the newly confirmed cases of the corresponding Health Authority. Therefore, we need to preprocess the data first by selecting the columns needed and selecting the rows where “HSDA” has “All” values using the “loc” function after loading the required libraries. It is worth noting that if there is no such category called “All,” we can alternatively aggregate all of the case counts of that date and health authority using the “Group by” function.

Now that we get the data ready, all we need to do is define the “animate()” function needed to update each frame, which later will pass it into Matplotlib’s FuncAnimation feature automates the animation for us.

First, dealing with DateTime data could be essential yet hard to comprehend. Suppose we want to plot the last 20 days of confirmed cases. We need to write a helper function first to select the needed rows. It is also worth noting that we need to think ahead that we are animating this plot, so we need to start from 20 days ago as day 1 up to today (day 20), and the plot will get progressed using the argument “index,” which denotes how many times this plot is animated.

After we have finished the helper function to get the data we wanted for plotting, we can move on to the next step: defining the plot in each frame.

Then, we can put everything together by calling the “FuncAnimation” function from Matplotlib, and we should be able to see the animation by calling “plt.show()”.

To save the animation, we can either use the default and recommended PillowWriter to save the animation into a Gif, or download and use “FFMpegWriter” to save the animation into a video such as an MP4 or AVI format.

Save as GIF:

Save as MP4:

Congratulations! Now we have created our first animation plot using Matplotlib! With more practice, we will feel more comfortable using this new skill to tackle static time-series data! We will discuss using Kafka to live stream data and Matplotlib to animate them in the following paragraphs.

Bar Race Chart

We might have seen videos on YouTube as “Top 15 Countries By GDP (1900–2019)” and may wonder how to create these animation graphs. We must admit that this kind of graph is very eye-catching. Once we learn it, we can make our presentation more vivid. At least, we can make YouTube videos!

We use actual and projected national GDP data from 1980 to 2026 from International Monetary Fund:

First, we change the first column’s name to “Countries” and change “no data” to 0. After that, we need to swap the rows and columns by pandas.DataFrame.transpose to make the bar chat race easier. Note that the index may interfere with us after swapping rows and columns. We need to drop the index. If reset_index does not work, temporarily save the current dataframe as a CSV file and add index = false when storing. When rereading it, the annoying index disappears. There are more than 200 countries and regions in our data, but we only take the GDP data of 12 countries as an example to ensure the running speed.

Now we have the final dataframe: we should use Years as an index for the bar chart race.

After data preparation, we will do a bar chart race. We need the matplotlib package Bar Chart Race to create such a chart. The first step is to install the package. The example is using Jupyter Notebook, installing and importing the package. Then here it comes the magical moment:

We can modify the parameters to suit our desired effect, including colour, figure size, number of bars, etcetera.

The graph below shows the result. The bar chart race package makes it easy to make fantastic animation graphs. The main task is to turn our data into a directly usable structure in the data preparation.

Getting Familiar with Kafka

One of many Data Engineers/Scientists’ daily duties is to deal with enormous amounts of data. Data can come from anywhere. Despite static datasets, we frequently need to deal with real-time data streaming. Confusions can quickly arise when consumers receive a large amount of data from different sources and various data types.

This situation might be easy to handle since it immediately reminds us of the already familiar publish-subscribe messaging pattern. Apache Kafka is an event streaming platform. Kafka acts as the middleware for multiple apps and services. As a result, Kafka will become the new data source, regardless of the origin of the data. Data conversion is not required because the data format is standardized, making communication between programs easier.

According to Apache Kafka’s official documentation, Kafka combines three key capabilities to implement for event-streaming:

  • Publish and subscribe to streams of records.
  • Store streams of records in a fault-tolerant, persistent manner.
  • Process the stream when the recording occurs.

Kafka runs on one or more servers across multiple data centres. Kafka is a distributed system that makes it scalable. Kafka also has the advantages of high availability and high fault tolerance. These advantages allow it to run on servers across multiple data centres.

Kafka runs as a cluster composed of multiple nodes called brokers. Each broker can be a leader or a replica of the leader.

Topics and Logs. Source: Apache Kafka Official Documentation

Brokers are responsible for managing the partitions. Messages exist in different partitions according to the key. Each partition is an immutable, ordered sequence of records appended to a structured commit log. The message retention period is configurable. On the other hand, Kafka is not a database. These messages are only “temporarily” saved as a log.

Kafka stores streaming records as topics. Topics are simply sets of partitions. Each topic is unique and formed by the partitions located in the same or across different brokers.

A Kafka cluster with 4 brokers, 1 topic and 2 partitions, each with 3 replicas. Source: Intra-cluster Replication in Apache Kafka

Replication is another critical feature of Kafka. As previously stated, a broker can be either a leader or a replica. Consider a cluster with four brokers: two are the leaders, and the others are the replicas of the leaders. The message generated by the producer will first save in a partition of the leader, and then the replica will fetch the message. The partitions that make up this topic can be partitions distributed among multiple brokers, regardless of leaders or replicas. Because of these replicas, Kafka will ensure that each partition will have backups, ensuring topic availability.

A single stream record is a key-value pair associated with a timestamp. Kafka stores all messages into the same partition with the same key. If the application does not provide a timestamp, the producer will stamp the record with its current time. We can configure timestamp as CreateTime or LogAppendTime: The former represents when the producer created the message; the latter represents when the broker received/wrote the message. A timestamp allows the consumer to obtain the time of the message and have a time series for analysis.

Kafka is capable of dealing with real-time data pipelines. A real-time application typically necessitates a constant flow of data that can be processed instantly or with minimal delay. Zero-latency is one of the requirements for achieving real-time. Although an article discusses the real-time capability of Kafka, the author indicates that Kafka is not hard real-time but rather near real-time. It remains the de facto standard for reliable data processing at scale in real-time due to the brokers’ significant data processing capability with low latency from producer to consumer.

Producer, Consumer, Streams, Connect. Source: Apache Kafka Official Documentation

The relationship between producer and consumer is simple for those who already understand the publish-subscribe messaging pattern: the producer allows an application to publish records to one or more Kafka topics; the consumer allows applications to subscribe to those topics and process the data stream. So, how does this relate to the topic of this article? Processing real-time data streams nowadays have become an essential part of data analysis. Visualizing real-time data such as stock prices and currency exchange with data streaming technology makes data analysis significantly more straightforward.

Real-Time Animation Graph with Kafka and Matplotlib

For the following example, we will demonstrate how to track real-time currency prices by deploying Apache Kafka and Pandas. Plotting on a real-time animation graph will also occur with Matplotlib.

Prerequisites:

1. Because Kafka is written in Java and Scala and executed on the JVM, the user must have Java 8+ installed in the local environment. With Java installed on the machine, one can then verify it with the following command:

$ java -version

Should Java not be installed, please follow this command to install OpenJDK.

$ sudo apt-get install openjdk-8-jdk

2. To employ Apache Kafka with Python, we will utilize the Python library: Kafka-Python. Use the following command to install the library.

$ pip install kafka-python

Step 1: Obtain Kafka:

Download the latest Kafka release and extract It:

$ tar -zxf kafka_2.12-3.1.0.tgz$ cd kafka_2.12-3.1.0

Step 2: Start Zookeeper to Manage the Kafka Cluster:

A top-level software created by Apache, Zookeeper acts as a centralized service. Utilized to sustain naming and configuration data, Zookeeper furnishes flexible, robust synchronization within distributed systems. In addition, Zookeeper tracks the Kafka cluster nodes’ statuses and stays abreast of Kafka topics and partitions. A Kafka server can only start when the Zookeeper is running, so the first task is to begin a Zookeeper instance.

Inside of the extracted kafka_2.12–3.1.0, one can discover some handy files:

bin/zookeeper-server-start.sh: begin the server.

config/zookeeper.properties: proffers the default configuration to run the Zookeeper server.

Start the Zookeeper server by running:

$ bin/zookeeper-server-start.sh config/zookeeper.properties

A confirmation should appear once the server has begun:

Step 3: Begin the Kafka Brokers:

The Kafka Broker, as implied by the name, acts as a controller or server that performs a specific operation after receiving the request. As the core of the Kafka Cluster, brokers serve as the connectors to outside words such as consumer, producer, and confluent connector.

Akin to earlier actions, there are some handy files inside the extracted kafka_2.12–3.1.0:

bin/kafka-server-start.sh: start the Kafka server.

config/server.properties: enables the operation of the Kafka server on its default configuration.

The command to utilize for Kafka is:

$ bin/kafka-server-start.sh config/server.properties

A confirmation should appear stating that the server has started:

Step 4: Create New Topics:

Kafka topics are the feed names for the organization of messages. Each topic operates with a unique name across the holistic Kafka cluster. Topics are in key-value format; the messages can be sent to and read from the specific topics. Thus, producers write data on the topics. Consumers read data from the topics.

To manually develop topics, run kafka-topics.sh and then insert:

--bootstrap-server: a list of brokers that can serve as a starting point to link to the cluster.

--replication-factor: explains the desired number of copies of the data (should a broker become inoperable, the data remains on the others).

--partitions: denotes that the data should split between this intended number of brokers.

--topic: lists the topic name that one desires to use

To complete this action, follow the command:

$ bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --topic Currency --replication-factor 1 --partitions 1

Step 5: Send Messages with the Producer Application

Producers are applications that publish data to Kafka using the topics of choice. The producer can select what message to assign to a particular partition within the specific topic.

5.1 Import Dependencies and Initialize a New Kafka Producer

Please be aware of the following arguments:

bootstrap_servers=[‘localhost:9092’]: establishes the host and the port the producer should contact to bootstrap initial cluster metadata. Setting this here is not mandatory because the default is localhost:9092.

value_serializer=lambda x: dumps(x).encode(‘utf-8’): illustrates how to serialize the data before sending them to the broker, and then will encode the data into a utf-8 after converting the data to a JSON file.

5.2 Send Data about a Particular Topic in Kafka

Now, we intend to continue receiving updated real-time data via the API. This can be completed with an infinite loop where data is from the Currency API. After an automatic conversion, the data is encoded to a JSON object with a value serializer. Finally, the information is converted into Python objects.

For the same loop, the data will be sent to a broker. This step can be completed by calling the send method on the producer and pinpointing the topic and the data.

Allow the notebook to continue running so that the Kafka consumer can observe these messages.

Step 6: Receive Messages with the Consumer Application

By utilizing a KafkaConsumer node within a message flow to subscribe to the particular topic on the Kafka server, the KafkaConsumer node will receive messages subsequently published on the Kafka topic as an input to the message flow.

6.1 Import Dependencies and Initialize a New Kafka Consumer

Noted the following arguments:

The primary argument is about the topic: “Currency” for this case.

bootstrap_servers=[‘localhost:9092’]: equivalent to our producer

auto_offset_reset=‘latest’: applies when the consumer restarts reading after breaking down or being turned off. It can be set to either the earliest or the latest. When targeted as the latest, the consumer begins reading at the end of the log. When set to earliest, the consumer commences reading at the latest committed offset. For our situation, this is what we desire.

consumer_timeout_ms=1000: it takes timeout value in milliseconds. For our situation, the session will be closed if the consumer does not observe any message for 1 second.

6.2 Receive Data on a Particular Topic in Kafka and Save Data into a CSV File

As with our actions in producer, we will employ a never-ending loop to extract messages from the consumer. Until the broker becomes unresponsive, the consumer will keep listening. With the JSON object, we can access the value of a message, which will later be converted into a pandas dataframe. From the dataframe, the latest ask price of USD/JPN and the last refreshed timestamp will be extracted to be stored in Python lists. The Python lists are converted into a pandas dataframe and ultimately saved in a CSV file.

Allow this notebook to continue running.

Step 7: Generate Animated Plot via Matplotlib

7.1 Import Dependencies and Create Figure Object

The backend will receive the figure instance and allow the pyplot interface to manage the figure.

7.2 Define Function for Animation

Let us define the animate function, which will be called repeatedly by the FuncAnimation for updating the live rate plot. Since the consumer is constantly updating the CSV file, we read the CSV file and update the values of the plot inside this function.

The first parameter of this function is frame by default. We will not use it and ignore it here.

We call read_csv from pandas to load the CSV file and with parse_dates and date_parser enabled to parse strings in the “time” column, whose data type should be a DateTime. Since the range of axes may change in every update, we use cla() to clear the active axes in the figure to update the plot. And then, we generate a line plot with markers by matplotlib.

However, tick positions and the labels sometimes do not display as expected. If not set manually, it only has the date, but we also want to show the exact time. Here, we use fig.gca() to get the current instance of Axes and set_major_formatter to set the format of the x-axis to display the full Datetime with milliseconds. We set the labels of the x-axis and rotation of x-ticks to make the figure have the labels corresponding to the data and avoid overlap when there are too many labels.

7.3 Create Animation Object

Finally, it is time to create an animation object by FuncAnimation.

ani = FuncAnimation(fig, animate, interval=1000)

We pass our figure object and the animation function to it. Here, set the interval to 1000, which will wait for 1000 milliseconds every time before calling the next animate().

The figure will be continuously updated once the graph.py program runs with producer and consumer.

References

--

--

Hayden Yan
SFU Professional Computer Science

This blog is co-owned by Jiahe Wang, Zhi Zheng, Huiyi Zou, Shilin Wang and Hayden Yan from Simon Fraser University