Sales Data Analysis In R
Clayton
2022–10–11
Sales Data Analysis In R including visualizations
This is an Analysis sample sales in R which i will utilize the “tidyverse”,“dplyr”,“ggplot2” and “tidyr” libraries to manipulate data and produce some visuals.( The observations do not reflect real world occurrences)
Step 1: Importing data data
The data is imported from an excel document into a data frame called foodsales.”all libraries already installed”
library(readxl)
foodsales <- read_excel("documents/sampledatafoodsales.xlsx")
Step 3: Loading Libraries
Here we load the libraries(all liraries already installed)
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()library(dplyr)
library(ggplot2)
library(janitor)library(dplyr)
library(tidyr)
library(lubridate)options(dplyr.summarise.inform = FALSE)# this suppresses the summarise group warnings
Step 4: Checking the Data and columns
Here we check the data to view the first few rows, check the columns and a get a glimpse of the data
head(foodsales)#Views the first 6 rows## # A tibble: 6 × 8
## OrderDate Region City Categ…¹ Product Quant…² UnitP…³ Total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angeles Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## # … with abbreviated variable names ¹Category, ²Quantity, ³UnitPrice,
## # ⁴TotalPriceslice(foodsales,1:10) #Views the first 10 rows## # A tibble: 10 × 8
## OrderDate Region City Categ…¹ Product Quant…² UnitP…³ Total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angel… Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## 7 2020-01-19 00:00:00 East Boston Cracke… Whole … 149 3.49 520.
## 8 2020-01-22 00:00:00 West Los Angel… Bars Carrot 51 1.77 90.3
## 9 2020-01-25 00:00:00 East New York Bars Carrot 100 1.77 177
## 10 2020-01-28 00:00:00 East New York Snacks Potato… 28 1.35 37.8
## # … with abbreviated variable names ¹Category, ²Quantity, ³UnitPrice,
## # ⁴TotalPricecolnames(foodsales)# Checks the column name headers## [1] "OrderDate" "Region" "City" "Category" "Product"
## [6] "Quantity" "UnitPrice" "TotalPrice"str(foodsales) #check also the data types for the data## tibble [244 × 8] (S3: tbl_df/tbl/data.frame)
## $ OrderDate : POSIXct[1:244], format: "2020-01-01" "2020-01-04" ...
## $ Region : chr [1:244] "East" "East" "West" "East" ...
## $ City : chr [1:244] "Boston" "Boston" "Los Angeles" "New York" ...
## $ Category : chr [1:244] "Bars" "Crackers" "Cookies" "Cookies" ...
## $ Product : chr [1:244] "Carrot" "Whole Wheat" "Chocolate Chip" "Chocolate Chip" ...
## $ Quantity : num [1:244] 33 87 58 82 38 54 149 51 100 28 ...
## $ UnitPrice : num [1:244] 1.77 3.49 1.87 1.87 2.18 1.77 3.49 1.77 1.77 1.35 ...
## $ TotalPrice: num [1:244] 58.4 303.6 108.5 153.3 82.8 ...
Step 5: clean the column names and check if all data types are correct
Before Analysis we clean the column names and check data types for the data
my_sales_clean_data<-clean_names(foodsales) # Clean the data
#colnames(clean) # view the new column Namesmy_sales_clean_data <- na.omit(my_sales_clean_data)#remove Nullsslice(my_sales_clean_data,1:10) # View first 10 lines using the new columns## # A tibble: 10 × 8
## order_date region city categ…¹ product quant…² unit_…³ total…⁴
## <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 00:00:00 East Boston Bars Carrot 33 1.77 58.4
## 2 2020-01-04 00:00:00 East Boston Cracke… Whole … 87 3.49 304.
## 3 2020-01-07 00:00:00 West Los Angel… Cookies Chocol… 58 1.87 108.
## 4 2020-01-10 00:00:00 East New York Cookies Chocol… 82 1.87 153.
## 5 2020-01-13 00:00:00 East Boston Cookies Arrowr… 38 2.18 82.8
## 6 2020-01-16 00:00:00 East Boston Bars Carrot 54 1.77 95.6
## 7 2020-01-19 00:00:00 East Boston Cracke… Whole … 149 3.49 520.
## 8 2020-01-22 00:00:00 West Los Angel… Bars Carrot 51 1.77 90.3
## 9 2020-01-25 00:00:00 East New York Bars Carrot 100 1.77 177
## 10 2020-01-28 00:00:00 East New York Snacks Potato… 28 1.35 37.8
## # … with abbreviated variable names ¹category, ²quantity, ³unit_price,
## # ⁴total_price
Step 6: Analyze the data
a)View Sales By Category
Here we do some analysis to check which categories had the most sales
my_sales_clean_data %>%
select(city, category, product,order_date, quantity, unit_price, total_price) %>%
group_by(category) %>%
summarise( total_sales_by_category = sum(total_price)) %>%
arrange(desc(total_sales_by_category))## # A tibble: 4 × 2
## category total_sales_by_category
## <chr> <dbl>
## 1 Cookies 17212.
## 2 Bars 10536.
## 3 Crackers 3340.
## 4 Snacks 2238.
Here we note that Cookies and Bars is the category of snacks that had the most sales with three times the bottom two. Cookies tops the sales chart
b)View Sales By Product
Here we do some analysis to check which Products had the most sales
my_sales_clean_data %>%
select(city, category, product,order_date, quantity, unit_price, total_price) %>%
group_by(product) %>%
summarise( total_sales_by_product = sum(total_price)) %>%
arrange(desc(total_sales_by_product))## # A tibble: 9 × 2
## product total_sales_by_product
## <chr> <dbl>
## 1 Carrot 7411.
## 2 Oatmeal Raisin 7310.
## 3 Arrowroot 5330.
## 4 Chocolate Chip 4572.
## 5 Whole Wheat 3340.
## 6 Bran 2945.
## 7 Potato Chips 1652.
## 8 Pretzels 586.
## 9 Banana 179.
c)View Sales By Order date
Here we do some analysis to check which Products had the most sales in a particular period”the library(lubridate)” is used to get the year and month pat
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
group_by(saleyear,month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(saleyear,month_name)## # A tibble: 24 × 3
## # Groups: saleyear [2]
## saleyear month_name total_sales_by_date
## <dbl> <ord> <dbl>
## 1 2020 Jan 1706.
## 2 2020 Feb 926.
## 3 2020 Mar 1648.
## 4 2020 Apr 1052.
## 5 2020 May 1393.
## 6 2020 Jun 2309.
## 7 2020 Jul 1263.
## 8 2020 Aug 1491.
## 9 2020 Sep 1524.
## 10 2020 Oct 1756.
## # … with 14 more rows
View Sales By Order date(2020 sales)
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear == 2020) %>%
group_by(month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(desc(total_sales_by_date))## # A tibble: 12 × 2
## month_name total_sales_by_date
## <ord> <dbl>
## 1 Jun 2309.
## 2 Oct 1756.
## 3 Jan 1706.
## 4 Mar 1648.
## 5 Dec 1610.
## 6 Sep 1524.
## 7 Aug 1491.
## 8 May 1393.
## 9 Nov 1311.
## 10 Jul 1263.
## 11 Apr 1052.
## 12 Feb 926.
View Sales By Order date(2021 sales)
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear == 2021) %>%
group_by(month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
arrange(desc(total_sales_by_date))## # A tibble: 12 × 2
## month_name total_sales_by_date
## <ord> <dbl>
## 1 Nov 1979.
## 2 Apr 1537.
## 3 Dec 1515.
## 4 Mar 1405.
## 5 Oct 1290.
## 6 May 1241.
## 7 Jan 1235.
## 8 Aug 1215.
## 9 Feb 1125.
## 10 Jun 1119.
## 11 Jul 849.
## 12 Sep 826.
Analysis By Visualization(Plots)
this section we analyze by visualization
Sales By product
ggplot(data = my_sales_clean_data ,aes(x= product, y= total_price, fill =product))+
geom_col() +
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Sales By Product",
subtitle = "Product sales",
caption = "Sample sales data downloaded for test purposes",
x="Product",
y="Total Sales")
Observations:
- Carrot and Oatmeal Raisin had the most product sales.
- Banana and Pretzels are the products struggling to get sales
Sales By City by category(wrap)
Here we get a view of all Sales done for each City
ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =category))+
facet_wrap(~city)+
labs(title ="Sales By City by category",
# subtitle = "Sales By City by category",
caption = "Sample sales data downloaded for test purposes",
x="Category",
y="Total Sales")
Observations:
1)Carrot and Oatmeal Raisin had the most product sales across cities.
2)Banana and Pretzels are the products struggling to get sales.
3)Boston appears to have most sales across products.
4)In New York ArrowRoot has higher sales volumes than other products
Sales By City by category(grid)
ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =product))+
facet_grid(~city)+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Sales By City By Category",
subtitle = "Products by Category",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")
Sales By order Date
Total sales each month for all combined products each month
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
group_by(saleyear,month_name) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = factor(saleyear), group = factor(saleyear))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")
Observations:
1)Sales are higher in early parts of the year and year ends(there is a dip from May to August). No enough data to explain the pattern
2)In 2020, June had an unusual jumb in sales
Month Sales By Product(multiple lines) (2020)
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(saleyear,month_name,product) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = product, group = factor(product))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")
To get more insight into this behavior, we ran analysis by product as shown below.
Observations:
1)We Noted that Oatmeal Raisin and carrot recorded the most sales in this month
2)This pattern was not observed throughout the observed period. Not enough data to explain why this happened in 2020
Month Sales By Category(multiple lines) (2020)
my_sales_clean_data %>%
mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(month_name,category ) %>%
summarise( total_sales_by_date = sum(total_price)) %>%
ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = category , group = factor(category ))) +
geom_line()+
theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
labs(title ="Monthly Sales(YOY)",
subtitle = "Sales by Order date",
caption = "Sample sales data downloaded for test purposes",
x="Product Category",
y="Total Sales")
Observations:
1)The same trend was observed for the month of June which explains why Cookies and Bars had the most sames in that month
Month Sales By Category (Pie) (2020)
my_sales_clean_data %>%
# mutate(month_name = month(order_date, label = TRUE)) %>%
mutate(saleyear = year(order_date)) %>%
filter(saleyear== 2020) %>%
group_by(category) %>%
summarise( total_sales_by_cat = sum(total_price)) %>%
mutate(total_sales_by_cat_perc = `total_sales_by_cat` / sum(`total_sales_by_cat`)) %>%
mutate(labels = scales::percent(total_sales_by_cat_perc,accuracy = 1L)) %>%
ggplot( aes(x = "", y = total_sales_by_cat_perc, fill = category)) +
geom_col(color = "black") +
geom_label(aes(label = labels),
color = "white",
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")
Observations:
- Bars and Cookies make up more than 70 % of the sales. Snacks have the least sales across cities.
LinkedIN( www.linkedin.com/in/clayton-kasiyandima-95022861/)
Github(https://github.com/claykays)