Sales Data Analysis In R

Clayton

Clayton Kasiyandima
9 min readNov 3, 2022

2022–10–11

Sales Data Analysis In R including visualizations

This is an Analysis sample sales in R which i will utilize the “tidyverse”,“dplyr”,“ggplot2” and “tidyr” libraries to manipulate data and produce some visuals.( The observations do not reflect real world occurrences)

Step 1: Importing data data

The data is imported from an excel document into a data frame called foodsales.”all libraries already installed”

library(readxl)
foodsales <- read_excel("documents/sampledatafoodsales.xlsx")

Step 3: Loading Libraries

Here we load the libraries(all liraries already installed)

library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(janitor)
library(dplyr)
library(tidyr)
library(lubridate)
options(dplyr.summarise.inform = FALSE)# this suppresses the summarise group warnings

Step 4: Checking the Data and columns

Here we check the data to view the first few rows, check the columns and a get a glimpse of the data

head(foodsales)#Views the first 6 rows## # A tibble: 6 × 8
## OrderDate Region City Categ…¹ Product Quant…² UnitP…³ Total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angeles Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## # … with abbreviated variable names ¹​Category, ²​Quantity, ³​UnitPrice,
## # ⁴​TotalPrice
slice(foodsales,1:10) #Views the first 10 rows## # A tibble: 10 × 8
## OrderDate Region City Categ…¹ Product Quant…² UnitP…³ Total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angel… Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## 7 2020-01-19 00:00:00 East Boston Cracke… Whole … 149 3.49 520.
## 8 2020-01-22 00:00:00 West Los Angel… Bars Carrot 51 1.77 90.3
## 9 2020-01-25 00:00:00 East New York Bars Carrot 100 1.77 177
## 10 2020-01-28 00:00:00 East New York Snacks Potato… 28 1.35 37.8
## # … with abbreviated variable names ¹​Category, ²​Quantity, ³​UnitPrice,
## # ⁴​TotalPrice
colnames(foodsales)# Checks the column name headers## [1] "OrderDate" "Region" "City" "Category" "Product"
## [6] "Quantity" "UnitPrice" "TotalPrice"
str(foodsales) #check also the data types for the data## tibble [244 × 8] (S3: tbl_df/tbl/data.frame)
## $ OrderDate : POSIXct[1:244], format: "2020-01-01" "2020-01-04" ...
## $ Region : chr [1:244] "East" "East" "West" "East" ...
## $ City : chr [1:244] "Boston" "Boston" "Los Angeles" "New York" ...
## $ Category : chr [1:244] "Bars" "Crackers" "Cookies" "Cookies" ...
## $ Product : chr [1:244] "Carrot" "Whole Wheat" "Chocolate Chip" "Chocolate Chip" ...
## $ Quantity : num [1:244] 33 87 58 82 38 54 149 51 100 28 ...
## $ UnitPrice : num [1:244] 1.77 3.49 1.87 1.87 2.18 1.77 3.49 1.77 1.77 1.35 ...
## $ TotalPrice: num [1:244] 58.4 303.6 108.5 153.3 82.8 ...

Step 5: clean the column names and check if all data types are correct

Before Analysis we clean the column names and check data types for the data

my_sales_clean_data<-clean_names(foodsales) # Clean the data
#colnames(clean) # view the new column Names
my_sales_clean_data <- na.omit(my_sales_clean_data)#remove Nullsslice(my_sales_clean_data,1:10) # View first 10 lines using the new columns## # A tibble: 10 × 8
## order_date region city categ…¹ product quant…² unit_…³ total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angel… Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## 7 2020-01-19 00:00:00 East Boston Cracke… Whole … 149 3.49 520.
## 8 2020-01-22 00:00:00 West Los Angel… Bars Carrot 51 1.77 90.3
## 9 2020-01-25 00:00:00 East New York Bars Carrot 100 1.77 177
## 10 2020-01-28 00:00:00 East New York Snacks Potato… 28 1.35 37.8
## # … with abbreviated variable names ¹​category, ²​quantity, ³​unit_price,
## # ⁴​total_price

Step 6: Analyze the data

a)View Sales By Category

Here we do some analysis to check which categories had the most sales

my_sales_clean_data %>% 
select(city, category, product,order_date, quantity, unit_price, total_price) %>%
group_by(category) %>%
summarise( total_sales_by_category = sum(total_price)) %>%
arrange(desc(total_sales_by_category))
## # A tibble: 4 × 2
## category total_sales_by_category
## <chr> <dbl>
## 1 Cookies 17212.
## 2 Bars 10536.
## 3 Crackers 3340.
## 4 Snacks 2238.

Here we note that Cookies and Bars is the category of snacks that had the most sales with three times the bottom two. Cookies tops the sales chart

b)View Sales By Product

Here we do some analysis to check which Products had the most sales

my_sales_clean_data %>% 
select(city, category, product,order_date, quantity, unit_price, total_price) %>%
group_by(product) %>%
summarise( total_sales_by_product = sum(total_price)) %>%
arrange(desc(total_sales_by_product))
## # A tibble: 9 × 2
## product total_sales_by_product
## <chr> <dbl>
## 1 Carrot 7411.
## 2 Oatmeal Raisin 7310.
## 3 Arrowroot 5330.
## 4 Chocolate Chip 4572.
## 5 Whole Wheat 3340.
## 6 Bran 2945.
## 7 Potato Chips 1652.
## 8 Pretzels 586.
## 9 Banana 179.

c)View Sales By Order date

Here we do some analysis to check which Products had the most sales in a particular period”the library(lubridate)” is used to get the year and month pat

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
group_by(saleyear,month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(saleyear,month_name)
## # A tibble: 24 × 3
## # Groups: saleyear [2]
## saleyear month_name total_sales_by_date
## <dbl> <ord> <dbl>
## 1 2020 Jan 1706.
## 2 2020 Feb 926.
## 3 2020 Mar 1648.
## 4 2020 Apr 1052.
## 5 2020 May 1393.
## 6 2020 Jun 2309.
## 7 2020 Jul 1263.
## 8 2020 Aug 1491.
## 9 2020 Sep 1524.
## 10 2020 Oct 1756.
## # … with 14 more rows

View Sales By Order date(2020 sales)

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear == 2020) %>%
group_by(month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(desc(total_sales_by_date))
## # A tibble: 12 × 2
## month_name total_sales_by_date
## <ord> <dbl>
## 1 Jun 2309.
## 2 Oct 1756.
## 3 Jan 1706.
## 4 Mar 1648.
## 5 Dec 1610.
## 6 Sep 1524.
## 7 Aug 1491.
## 8 May 1393.
## 9 Nov 1311.
## 10 Jul 1263.
## 11 Apr 1052.
## 12 Feb 926.

View Sales By Order date(2021 sales)

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear == 2021) %>%
group_by(month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(desc(total_sales_by_date))
## # A tibble: 12 × 2
## month_name total_sales_by_date
## <ord> <dbl>
## 1 Nov 1979.
## 2 Apr 1537.
## 3 Dec 1515.
## 4 Mar 1405.
## 5 Oct 1290.
## 6 May 1241.
## 7 Jan 1235.
## 8 Aug 1215.
## 9 Feb 1125.
## 10 Jun 1119.
## 11 Jul 849.
## 12 Sep 826.

Analysis By Visualization(Plots)

this section we analyze by visualization

Sales By product

ggplot(data = my_sales_clean_data ,aes(x= product, y= total_price, fill =product))+
geom_col() +
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Sales By Product",
subtitle = "Product sales",
caption = "Sample sales data downloaded for test purposes",
x="Product",
y="Total Sales")

Observations:

  1. Carrot and Oatmeal Raisin had the most product sales.
  2. Banana and Pretzels are the products struggling to get sales

Sales By City by category(wrap)

Here we get a view of all Sales done for each City

ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =category))+
facet_wrap(~city)+
labs(title ="Sales By City by category",
# subtitle = "Sales By City by category",
caption = "Sample sales data downloaded for test purposes",
x="Category",
y="Total Sales")

Observations:

1)Carrot and Oatmeal Raisin had the most product sales across cities.

2)Banana and Pretzels are the products struggling to get sales.

3)Boston appears to have most sales across products.

4)In New York ArrowRoot has higher sales volumes than other products

Sales By City by category(grid)

ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =product))+
facet_grid(~city)+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Sales By City By Category",
subtitle = "Products by Category",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")

Sales By order Date

Total sales each month for all combined products each month

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
group_by(saleyear,month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = factor(saleyear), group = factor(saleyear))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")

Observations:

1)Sales are higher in early parts of the year and year ends(there is a dip from May to August). No enough data to explain the pattern

2)In 2020, June had an unusual jumb in sales

Month Sales By Product(multiple lines) (2020)

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(saleyear,month_name,product) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = product, group = factor(product))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")

To get more insight into this behavior, we ran analysis by product as shown below.

Observations:

1)We Noted that Oatmeal Raisin and carrot recorded the most sales in this month

2)This pattern was not observed throughout the observed period. Not enough data to explain why this happened in 2020

Month Sales By Category(multiple lines) (2020)

my_sales_clean_data %>% 
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(month_name,category ) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = category , group = factor(category ))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")

Observations:

1)The same trend was observed for the month of June which explains why Cookies and Bars had the most sames in that month

Month Sales By Category (Pie) (2020)

my_sales_clean_data %>%
# mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(category) %>%
summarise( total_sales_by_cat = sum(total_price)) %>%
mutate(total_sales_by_cat_perc = `total_sales_by_cat` / sum(`total_sales_by_cat`)) %>%
mutate(labels = scales::percent(total_sales_by_cat_perc,accuracy = 1L)) %>%
ggplot( aes(x = "", y = total_sales_by_cat_perc, fill = category)) +
geom_col(color = "black") +
geom_label(aes(label = labels),
color = "white",
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")

Observations:

  1. Bars and Cookies make up more than 70 % of the sales. Snacks have the least sales across cities.

LinkedIN( www.linkedin.com/in/clayton-kasiyandima-95022861/)

Github(https://github.com/claykays)

The End

--

--

Clayton Kasiyandima

Passionate about Data and Analytics| Tableau desktop |Power BI| R | SQL | Excel |Python