Step-by-step guide to your first interactive dash app — Part IV

Paolo Molignini, PhD
8 min readOct 27, 2022

--

This is the fourth part of my guide to writing and deploying a dash app for data visualization. If you haven’t done so yet, please read part I to III before reading this. The repository for the monkeypox-tracking app that we are using as an example can be found here.

In part III, we took a look at the app layout. By now you should know how to define and start a Dash server, what is a Dash layout, what components we can add in it, what are callbacks and what they do (although we haven’t seen how they actually work in detail — more on this in part VI). Today, we will look in depth at the modules that extract and fit the data from the .csv files. These modules are input.py, extractor.py, and fit.py located in the folder modules in my repository. We will now go through all the commands in these modules (except the import statements — which should be clear by now) and see what they do.

Input

I believe that it’s useful to always place all your input variables in a separate module. By input variables I mean global variables that are shared among many different modules. That way, you can have a compact view of all the inputs you need to run your program and only one file to modify if you need to work with different inputs.

Let’s go through all the global variables assigned in input.py and listed in the picture above. These are all variables that need to be updated / checked whenever you add more data to your app.

  • colors: this is a python dictionary containing the colors of layout components, such as the background and text colors. This dictionary will be passed to the layout.
  • cmap: this is a python list of pairs (integer, string) that defines the range of colors used in the worldmap colorbar. Once passed to the choropleth component, the program will automatically interpolate between the four points given. 0.0 corresponds to the smallest value in the color scale (‘#EEEEEE’ is a very light grey) and will indicate countries with no cases. Then the colors will be interpolate going from grey to blue, then from blue to yellow, and finally from yellow to red. The country with the current maximal number of cases will always be bright red. Of course, you are free to change the aspect of the colorbar to include different colors, a pre-set colormap (I particularly like the “perceptually uniform sequential colormaps” in Python, such as “inferno” or “viridis”), or a different interval spacing. I chose this custom colormap because I wanted the value for zero cases to be of a completely different color than the rest. In that way, I can immediately visualize whether a country has cases or not.
  • disabled_days: these are the days that will not be clickable in the DatePicker component because there is no data associated to them, and have to be hard coded (i.e. if there are dates missing when you update the assets, you need to include them here). When I started collecting data about monkeypox, I was doing it daily. Since I was scouting the internet for sources, it was quite time consuming. As more countries recorded cases it became too tedious to update the data daily. I could’ve simply copied the same .csv files to the next day without updating them, but I didn’t want to have inaccurate information. So I decided to simply remove the dates with no data from the clicker, so they could not be displayed on the choropleth map.
  • country_colors: this is a list of strings naming the colors used for the curves in the time evolution of each country, more or less in hue progression.
  • today: the date of the latest available data, which in the best case scenario should be today. Note that it is a datetime object, like all the other date variables.
  • date_delta: datetime object corresponding to an interval of one day.
  • start_date, current_date, end_date: datetime objects that define the interval of available historical data that we go through when building the time evolution for each country.
  • current_date_str, current_date_tag: the current date converted into different string formats.
  • country_list: a list of countries to display as default in the plot of the time evolution.
  • initial_day_fit: default value for the number of days (after the initial case on 2022/5/7) after which we start performing the exponential fit.

Extractor

The module extractor contains only a function called data_extractor that generates the dataframes that we need to pass to the Dash graphical objects to be displayed on the app. Given the starting/end date and the list of countries whose cases we want to display (defined in input.py above), the extractor opens the .csv file labelled by (available) date, then counts the number of cases for every country, and saves the result in a dataframe called cases_df. In reading the data, it also sums up all cases in every country to get the time evolution of the cumulative total cases.

The key commands here are

country_cases = daily_df.loc[daily_df[‘COUNTRY’] == country, ‘CASES’].iloc[0]

(line 32) which, for every date, selects the column labeled ‘CASES’ for the row labeled by the current country in the date dataframe, and

dict2 = {‘DATE’: [current_date], ‘TOT CASES’: [daily_df[‘CASES’].sum()]}

(line 35) which creates a dictionary listing the total cases (in column ‘TOT CASES’) for every date (in column ‘DATE’).

This dataframe is then concatenated (line 37–38) with the dataframes containing the cases of each country to obtain a final dataframe which lists the number of cases in each country and in total for every available date (see figure below for the corresponding .csv file shown in numbers). The rest of the commands are just iterations over all dates and countries, and reading/saving data as .csv files to be used later by other functions.

In the end, the extractor returns the dataframe for the time evolution of each country, and the dataframe containing the global cases for the current date to be displayed.

Fit

Besides plotting the raw number of cases as a function of the date, we also want to perform a simple fit to see whether the data is growing exponentially or not. Exponential growth is one of the hallmarks of uncontrolled epidemics and is a strong warning sign that fast actions are needed to reduce transmission. In an exponential growth scenario, if there are 100 cases today, there might be 1'000 in a week, and 10'000 in two weeks and so on. The closer the fit matches with the raw data, the more accurately it can be described by an exponential.

The exponential fit is performed by the function data_fit in the module fit.py (shown in the screenshot below) and relies on the numpy function polyfit.

The exponential fit goes as follows. We first read in as a pandas dataframe the total number of cases generated before with the extractor and saved in total_cases.csv (line 17). Then we loop over all countries whose data we want to display (contained in country_list). The independent variable x is the date, but we cannot use in the string format that we saved it as. So, in line 25 we rewrite it as an integer and divide by 10⁹. This should give us the number of seconds from 1st of January 1970 — called the Unix epoch time (the 10⁹ is needed because the datetime format contains information up to nanoseconds). Then, we can convert the seconds into days by dividing by an additional 86400 (line 26). Note that we also reset the starting value to be the initial datapoint available, so in the end the variable x0 gives us the number of days that have elapsed since the first case. Finally, we take x to start after initial_day_fit number of days since the first recorded case. Recall that this is an additional input variable, which we added because we want to be able perform the fit on later data, since the case progression might not look exponential at first.

Then, we need to extract the depend variable y for each date. This is done in line 34–36, where we add +1 to all data points because we will need to take the logarithm of the data when performing the fit, and since some data points might be zero, log(0) would be ill-defined. Note also that we shift the initial data point by initial_day_fit as for the x variable (x and y must have the same length!).

Now that we have both x and y, we can finally perform the fit. In principle, we could call a more elaborate fitting library that can fit functions directly to something like A*exp(B*x) or even more complex. However, there is a neat trick that we can employ to use a simple polynomial regression like the one provided by polyfit. This is briefly explained in the comment in the code. We are looking for a fit of

y = A*exp(B*x).

However, if you take the logarithm of both sides of this equation, and using the sum rule for logarithms, i.e.

log(a*b) = log(a) + log(b)

we get

log(y) = log(A) + B*x

Now we have a linear function in x that we can fit with a simple polynomial fit! We just need to fit log(y) against x. Now it should also be clear why we needed to add +1 to our y-data. After the fit, the slope (zero-th value of pars) will give B directly, while the intercept (the first value of pars) will give log(A), so we just need to exponentiate it to get A.

With these fit parameters applied to the original independent variable x, we now can create a new set of datapoints that represent the cases if they followed the exponential function fitted from the raw data. This is all done in line 47, where we additionally subtract the value of +1 that we had added initially to make the log well-behaved, and recover the correct number of cases. Finally, in line 51, we add the newly fitted data to the dataframe under the column tagged “country_fit”, e.g. “Switzerland_fit” for the fit of the Swiss cases.

In lines 54 to 59, we perform the same fit all over again, but this time for the total number of cases. In line 61, we save the newly updated dataframe containing the fit into total-cases.csv, shown below.

After all this extraction and manipulation, the data is finally ready to be displayed :) We will see the details of how this is done in part V, where we will discuss the module plots.py and how figures are generated therein. Then we will move on to a discussion of the callback in part VI. This will conclude our detailed analysis of the app, and you’ll be ready to deploy it on Heroku in part VII!

Thanks for reading!

--

--

Paolo Molignini, PhD

Researcher in theoretical quantum physics at Stockholm University with a passion for programming, data science, and probability & statistics.