How Tokopedia sharpen its Data Team’s skills in Data Analysis using R

Published in

Tokopedia Data

5 min readDec 19, 2018

Growth Mindset is one of the DNAs that must be owned by Nakama (a nickname for Tokopedia’s employees). Meaning, a Nakama must always have a developing mind and is always open to learn new knowledge. With that in mind, Tokopedia’s Data Team held a training and sharing session which aims to provide a place for Nakamas to exchange ideas and knowledge. This activity was held on 10th and 11th December 2018, and the topic was related to one of the most used tools utilised by Data team, which is R Programming Language.

*Participants of R Training doing R Challenge*

Data team participated in this training and sharing session by registering themselves. The training was limited to 10 people in order to have a more intensive session. Speakers of this training and sharing session were four Nakamas whom have more experience and deeper knowledge about R. It began with a pre-test related to R on the first day in order to measure Nakamas’ knowledge about R before training.

This training and sharing session was divided into 6 modules delivered in two days.

1. First module: Introduction to R

In this module, we were taught about an introductory and the creator of R, also the advantages and disadvantages of R compared to other data processing tools. In this case, R was compared to Python programming language. The session also explained about R studio features, such as R console, R code editor, environment, history, connection plot, package, help, and view. This aimed to make participants familiar with R features before continuing to data processing.

2. Second module: Introduction to Vectors & Data Frames

This module was about definition of vector, syntax, and how vector is operated in R. We were also told about definition of data frame, how data frame is imported from CSV file and BigQuery, and explanation of syntax used in data frame, such as syntax to get the first or last row of a table, syntax for editing data, and syntax to see the structure of the data frame.

(Left Image: A vector operation that can be done in R) (Right Image: Result of *str*() function to see the structure of the dataset.)

3. Third Module: Data Wrangling

Data wrangling is a process of transforming and processing “raw data” into a format that is suitable for analysis needs. This module explains how to select, filter, reorder data frames, rename, group, add a new column, join table, transpose columns, summarize data to view descriptive data, and pipe syntax to connect multiple syntaxes in one run. This module requires participants to install dplyr and tidyr package.

*Example of pipe usage used to connect “mutate syntax (to add a difftemp column)” with“rename the cloudcover column to masscloud,” and “order avgtemp in descending”.*

4. Fourth Module: Statistics Descriptive

In this module, it was explained that descriptive statistics is a method to provide information that is contained in a dataset, such as mean and median of numerical data or frequency observation for nominal data. We were also taught to show a summary of statistics in a plot as well.

*Result of Summary () function with its statistics information such as min, median, mean, and max value for each variable in “TrailRail” dataset*.

5. Fifth Module: Data visualization with ggplot

This module explained that ggplot is a library or package for making visualization in R. Compared to plot(), ggplot package is more powerful for making graphics in forms of bars, dots, lines, and text. The topic of ggplot itself is divided into 5 parts, the first part is Setup. At this part, we learnt to determine which part of a table were the x and y-axis of a graph. The second part is Layer which showed geom or geometric object as a visual representation of data. We could choose to display visualization in bars, dots, lines, or text. The third part is Label. This is the part where we can set the title of the graph that we have created. The fourth part is Theme, which is the process to determine colors, letters, font size, and the position of letters from the graph. Finally, the fifth part is Facet. This is a function to break apart graph into several graphs based on category, so we do not manually create each graph from the beginning.

*One R Visualization example, containing functions, such as aes(), geom(), labs(title), theme(), and facet_wrap().*

6. Sixth Module: Correlation and Regression Analysis

This module explained ways to do correlation and regression analysis. Correlation is used to see the relationship between variables in data. Some data correlation methods include Kendal, Spearman, and Pearson. While linear regression is used to model continuous variables, y as a mathematical function of one or more x variables so that we can predict the value of y when only x is known.

*Result of correlation () function to determine correlation percentage of avgtemp and hightemp, and lm () function to predict the value of volume based on avgtemp rate.*

On the second day after all the modules were delivered, the training participants were given a challenge to practice all modules within 90 minutes. In this challenge, participants were formed into groups of two people. Besides being able to practice, the trainees had to also present the result. After that, participants had to do a post-test to measure the improvement of participants’ knowledge after getting trained.

At the end of training, the best group in making presentation as well as participants who got good grades on the post-test were selected to be the winners. Free movie tickets were given as the prizes for the winners. Yay!