From GG to Riski — plotting in R has never been so simple

Dor Ap
Riskified Tech
Published in
5 min readJan 19, 2020

Diversity among researchers’ backgrounds is a pillar of a healthy and successful research department. Nonetheless, aligning and formalizing research output and the way that data is presented is important for increasing efficiency and maximizing the department’s impact within the organization.

Based on The Grammar of Graphics, the ggplot2 package by Hadley Wickham is an extremely useful & flexible package for visualization object creation. However, the package is so versatile that much of the time it requires intensive formatting and customization before the output can be presented in a research report.

Thus, the goal of developing Riskified’s visualization package, riskiplot, was to generate a quick, simple & consistent way for generating visualization objects in order to address the following notions:

  • Save researchers’ time
  • Align research output across researchers
  • Generate visualization objects according to the company’s visual brand language
  • Present results in a clear way, based on data visualization best practices

The riskiplot package is based on ggplot2, and was written to handle many annoying formatting tweaks while still taking advantage of ggplot’s flexibility. It tackles the most commonly used graph configurations, such as bar, point, line, histogram & density. Each riskiplot function outputs a ggplot object, so that the object can be further adjusted using ggplot2 functions. In addition, the package includes two helper functions that allow researches to apply Riskified’s theme & unique brand colors to custom gg objects.

The arguments across Riskiplot functions are standardized to give the user the ability to quickly apply facets, flip coords, assign text labels, fill by another dimension, adjusting the text base size, adding titles & subtitles in a consistent language.

Examples

Below are examples of common graph configurations that an analyst would generate during an exploration phase. We use a mock dataset, “generic_data”, where each row represents an order that has been submitted to Riskified. It includes the following columns:

status — the order status, which can be approved / chargeback / declined / cancelled

model score — a score between 0 & 100 given by Riskified’s classification model

class — the order’s label for training (should approve / should decline)

us_ip — an indication for whether the order’s IP location is in the US

This dataset will be used in the following bar chart configuration examples. At the end of the post I’ve included snapshots of other function outputs as well.

The following graphs are the output of the original geom_bar / geom_col ggplot functions. They were wrapped by Riskiplot’s create_geom_bar, which assures that the output of the function will be aligned with Riskified brand colors & theme, configures the result in a clean and consistent manner and, as mentioned above, allows the user to configure the desired output in a simple and quick way.

create_geom_bar, presents a bar chart, where the height of the bar is proportional to the number of cases within each group. In its most basic configuration it will receive a dataset & x_col_name and will generate the following:

The same graph in ggplot would require the following, excruciatingly-detailed, code chunk:

Flipping coords (cflip = T) & switching the labels to present proportion (bar_type = “prop”):

The following corresponding code chunk will generate the same graph:

Fill by us_ip indication (fill_by = “us_ip) & present results using the dodge option (position = “dodge”):

Using proportions per category (position = “stack”):

Getting class as an x value and facet by us_ip (facet_by = “us_ip”):

An additional option within the bar_type argument is “user_fun”, which enables the user to apply a summarizing function on one of the columns within the dataset and group by the chosen x_col_name; for example, let’s assume that I would like to compute the median model _score value for each status (FUN = c(“median”, “model_score”) & bar_type = “user_fun”):

On the other hand, if the user wants to perform the pre-processing phase by themselves, they can use the last option of bar_type, which is “asis” — presenting the data post processing:

Day-to-day effect

Since Riskiplot’s creation in 2018, we’ve looked back to quantify the amount of time saved. As a rough estimate, we produce 20 reports per month with 6 plots on average; each plot can be produced in 3 minutes instead of 12, therefore, using the package has saved 18 hours (9 * 6 * 20) a month, on average. Using the package has saved great research time while ensuring consistency across our plots.

Our team workflow benefited from riskiplot in two additional ways. First, once we had our visualization package up and running, there was no need to perform code review on each ggplot object that was created by researchers, so many hours of code review were avoided. Second, once we established and aligned how visualization outputs should look, significant time was saved around fixing existing plots that did not follow data presentation best practices.

Finally, as external departments are exposed to consistent reports and plots, communication around action items becomes clearer and faster. The impact? Greater research efficiency thanks to the reduction in overall time from the end of a research study to implementing research recommendations in production.

We are working on making the riskiplot code public soon, which will enable you to customize riskiplot to your own needs by quickly adding your company’s coloring scheme and desired theme. It will be that much faster to create consistent plots in line with your brand — stay tuned for more!

Here’s a sneak peek at some additional geoms that were created using the package:

--

--