My First Technical Assignment Pt1

Kerry Benjamin
The Data Logs
Published in
6 min readJun 26, 2017

Going through a real world challenge

#WOCinTech Chat featuring the dope Taitiana King

Hello everyone. Nice to see you again. Today’s post will be about the take home technical assignment that I was assigned by Bright. Like I mentioned previously it was fun and tiring at the same time. I enjoyed it knowing I was being asked real business questions and asked to think of or propose solutions. Basically things data scientists and analysts will have to do on a day to day basis. Because of this, this post will also double as assignment 3 on the Data Science Learning Club forums(Check it out if you want data science project ideas). Let’s get started.

I was given questions based on user acquisition, user retention, and investigating further. The data set consists of 29,999 users who signed up during the month of April, the date the signed up, and the date they unsubscribed if at all. Before we dive into any of these questions we have to do a bit of house cleaning. While the data set wasn’t dirty per say it can be cleaned up to make things more convenient for you and your stakeholders. So we’ll load in the data and necessary packages so we can start scrubbing.

#Package loading
library(readr) #Used for loading data
library(tidyverse) #Loads in multiple useful packages
library(lubridate) #Used for Dates and Time
#Loading Data
Bright_Users <-read_csv("C:/Users/Kerry/Desktop/data_analyst_intern_test_data.csv")
#Preview of Data set
head(Bright_Users)
id start_date end_date
1410821 2017-04-01 00:03:23 NULL
1410822 2017-04-01 00:04:11 NULL
1410823 2017-04-01 00:08:04 NULL
1410824 2017-04-01 00:08:22 NULL
1410825 2017-04-01 00:10:58 NULL
1410826 2017-04-01 00:12:53 NULL

It’s a start. You can’t really see anything too useful from looking at this so let’s fill in the blanks. We’ll start by using dplyr and tidyr to manipulate the Bright_Users data frame to give us the day of the week and separate the date and time.

Bright_Users <- separate(Bright_Users, start_date, c("Date", "Time"), sep = " " )Bright_Users <- mutate(Bright_Users, Weekday = wday(Date, label = T))

Now we’ll further clean this so that time resembles a normal 12 hour clock and the Date column turns into the Date data type. We’ll see why this is important later.

Bright_Users$Time <- format(strptime(Bright_Users$Time, format='%H:%M:%S'), '%r')
Bright_Users$Date <- ymd(Bright_Users$Date)
Bright_Users <- Bright_Users[,c(1,2,5,3,4)]
head(Bright_Users)
id Date Weekday Time end_date
1410821 2017-04-01 Sat 12:03:23 AM NULL
1410822 2017-04-01 Sat 12:04:11 AM NULL
1410823 2017-04-01 Sat 12:08:04 AM NULL
1410824 2017-04-01 Sat 12:08:22 AM NULL
1410825 2017-04-01 Sat 12:10:58 AM NULL
1410826 2017-04-01 Sat 12:12:53 AM NULL

Looks much better now. The user retention segment of our questions are going to focus on the people who unsubscribe from Bright. Let’s create a new data frame so analysis becomes easier.

# Creating unsubscribers data and making it easier to read
Unsubscribers <- filter(Bright_Users, end_date != "NULL")
Unsubscribers <- separate(Unsubscribers, end_date, c("Date_Left", "Time_Left"), sep = " ")
Unsubscribers$Time_Left <- format(strptime(Unsubscribers$Time_Left, format='%H:%M:%S'), '%r')
Unsubscribers$Date_Left<- ymd(Unsubscribers$Date_Left)
diff_days <- difftime(Unsubscribers$Date_Left, Unsubscribers $Date, units = c('days'))
Unsubscribers <-mutate(Unsubscribers, days_unsub = diff_days)
Unsubscribers <- select(Unsubscribers, id, Date, Weekday, Time, Date_Left, days_unsub)
head(Unsubscribers)
id
Date

Weekday

Time

Date_Left

days_unsub
id Date Weekday Time Date_Left days_unsub
1410853 2017-04-01 Sat 12:37:01 AM 2017-04-15 14 days
1410996 2017-04-01 Sat 01:27:15 AM 2017-04-12 11 days
1410997 2017-04-01 Sat 01:27:17 AM 2017-04-01 0 days
1410998 2017-04-01 Sat 01:28:49 AM 2017-04-07 6 days
1411298 2017-04-01 Sat 03:39:14 AM 2017-04-05 4 days
1411433 2017-04-01 Sat 04:45:39 AM 2017-05-16 45 days

The format is very similar to creating the original data frame but with a few important additions. I turned end_date into the exact date and time a user unsubscribes. I also calculated how long it took each individual to unsubscribe from Bright in days_unsub. Now let the real fun begin!

User Acquisition

  1. How many users signed up per week?

This one is easy. There were about 4 weeks in April so we need 4 data frames. We’ll extract them from Bright_Users by filtering out specific dates.

Wk1 <- filter(Bright_Users, Date >= "2017-04-1" & Date <= "2017-04-07") 
dim(Wk1)
Wk2 <- filter(Bright_Users, Date >= "2017-04-8" & Date <= "2017-04-14")
dim(Wk2);
Wk3 <- filter(Bright_Users, Date >= "2017-04-15" & Date <= "2017-04-21")
dim(Wk3);
Wk4 <- filter(Bright_Users, Date >= "2017-04-22" & Date <= "2017-04-27")
dim(Wk4)
[1] 7364 5 #The 1st number is the number of subscribers
[1] 6874 5 # The 2nd number is the number of variables(Ex Date)
[1] 9805 5
[1] 5956 5

Weeks 1–4 have 7364, 6874, 9805, & 5956 signups respectively. Man they are crushing it!

2. What day are users most likely to sign up?

Hmm going to have to start breaking out the math cap for this one. Here’s how I solved this. Well there are 7 days in a week so 1/7 chance for choosing any day. That’s just about 14%. Let’s visualize all signups by day and see what we find.

ggplot(Bright_Users, aes( x = Weekday)) + geom_bar(fill = "violet") 
Bright_Table <- table(Bright_Users$Weekday)
Bright_Table
prop <- prop.table(Bright_Table) * 100
prop

Would you look at that. Tuesday leads the pack on signups for the month of April. Over 10,000 signups. Monday is about half as much. Let’s get some exact numbers.

Sun   Mon  Tues   Wed Thurs   Fri   Sat 
1537 5019 10148 3906 3936 3542 1911
Sun Mon Tues Wed Thurs Fri Sat
5.123504 16.730558 33.827794 13.020434 13.120437 11.807060 6.370212

So Tuesday has exactly 10,148 sign ups in all. This is about 34% of all signups during April. Considering the 14% that a person can sign up on any day of the week, I think it is safe to say people are most likely to sign up on a Tuesday.

3. How can Bright act on this data?

Now a question like this and the next steps that come after would usually be answered by multiple people. You and a domain expert/team. In this case to best maximize how to be even better at acquisitions, you would probably work wih the marketing department, especially those focused on content creation and curating. Considering that this was just me, I used what I remembered from marketing classes to come up with some ideas.

I think Bright should evaluate the top 3 marketing channels that new users come from and evaluate the content that is produced on those channels. If they aren’t already, they should be tagged so they can be analyzed. See if you can re-purpose what you have for them or do a running theme for Wednesdays. There is a significant drop off from the middle of the work week. Perhaps a “hump day” special? Might be a good mood lifter. Also they maybe want to look up the states or time zones the current users are in. Perhaps Bright is more popular on the West Coast?

That’s it for now. In Part 2 we’ll wrap up the analysis . If you have any questions be free to ask. I hope you enjoy how data analysis can be used to evaluate and help a business.

If you enjoyed reading this and can’t wait for part 2, hit the recommend button.

--

--

Kerry Benjamin
The Data Logs

I'm a Connector, Opportunity Seeker, Learning Data Science and Supporter of STEAM education.