Sales Data Analysis In R

Clayton

9 min readNov 3, 2022

2022–10–11

Sales Data Analysis In R including visualizations

This is an Analysis sample sales in R which i will utilize the “tidyverse”,“dplyr”,“ggplot2” and “tidyr” libraries to manipulate data and produce some visuals.( The observations do not reflect real world occurrences)

Step 1: Importing data data

The data is imported from an excel document into a data frame called foodsales.”all libraries already installed”

library(readxl)
foodsales <- read_excel("documents/sampledatafoodsales.xlsx")

Step 3: Loading Libraries

Here we load the libraries(all liraries already installed)

library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()library(dplyr)
library(ggplot2)
library(janitor)library(dplyr)
library(tidyr)
library(lubridate)options(dplyr.summarise.inform = FALSE)# this suppresses the summarise group warnings

Step 4: Checking the Data and columns

Here we check the data to view the first few rows, check the columns and a get a glimpse of the data

head(foodsales)#Views the first 6 rows## # A tibble: 6 × 8
##   OrderDate           Region City        Categ…¹ Product Quant…² UnitP…³ Total…⁴
##   <dttm>              <chr>  <chr>       <chr>   <chr>     <dbl>   <dbl>   <dbl>
## 1 2020-01-01 00:00:00 East   Boston      Bars    Carrot       33    1.77    58.4
## 2 2020-01-04 00:00:00 East   Boston      Cracke… Whole …      87    3.49   304. 
## 3 2020-01-07 00:00:00 West   Los Angeles Cookies Chocol…      58    1.87   108. 
## 4 2020-01-10 00:00:00 East   New York    Cookies Chocol…      82    1.87   153. 
## 5 2020-01-13 00:00:00 East   Boston      Cookies Arrowr…      38    2.18    82.8
## 6 2020-01-16 00:00:00 East   Boston      Bars    Carrot       54    1.77    95.6
## # … with abbreviated variable names ¹Category, ²Quantity, ³UnitPrice,
## #   ⁴TotalPriceslice(foodsales,1:10) #Views the first 10 rows## # A tibble: 10 × 8
##    OrderDate           Region City       Categ…¹ Product Quant…² UnitP…³ Total…⁴
##    <dttm>              <chr>  <chr>      <chr>   <chr>     <dbl>   <dbl>   <dbl>
##  1 2020-01-01 00:00:00 East   Boston     Bars    Carrot       33    1.77    58.4
##  2 2020-01-04 00:00:00 East   Boston     Cracke… Whole …      87    3.49   304. 
##  3 2020-01-07 00:00:00 West   Los Angel… Cookies Chocol…      58    1.87   108. 
##  4 2020-01-10 00:00:00 East   New York   Cookies Chocol…      82    1.87   153. 
##  5 2020-01-13 00:00:00 East   Boston     Cookies Arrowr…      38    2.18    82.8
##  6 2020-01-16 00:00:00 East   Boston     Bars    Carrot       54    1.77    95.6
##  7 2020-01-19 00:00:00 East   Boston     Cracke… Whole …     149    3.49   520. 
##  8 2020-01-22 00:00:00 West   Los Angel… Bars    Carrot       51    1.77    90.3
##  9 2020-01-25 00:00:00 East   New York   Bars    Carrot      100    1.77   177  
## 10 2020-01-28 00:00:00 East   New York   Snacks  Potato…      28    1.35    37.8
## # … with abbreviated variable names ¹Category, ²Quantity, ³UnitPrice,
## #   ⁴TotalPricecolnames(foodsales)# Checks the column name headers## [1] "OrderDate"  "Region"     "City"       "Category"   "Product"   
## [6] "Quantity"   "UnitPrice"  "TotalPrice"str(foodsales) #check also the data types for the data## tibble [244 × 8] (S3: tbl_df/tbl/data.frame)
##  $ OrderDate : POSIXct[1:244], format: "2020-01-01" "2020-01-04" ...
##  $ Region    : chr [1:244] "East" "East" "West" "East" ...
##  $ City      : chr [1:244] "Boston" "Boston" "Los Angeles" "New York" ...
##  $ Category  : chr [1:244] "Bars" "Crackers" "Cookies" "Cookies" ...
##  $ Product   : chr [1:244] "Carrot" "Whole Wheat" "Chocolate Chip" "Chocolate Chip" ...
##  $ Quantity  : num [1:244] 33 87 58 82 38 54 149 51 100 28 ...
##  $ UnitPrice : num [1:244] 1.77 3.49 1.87 1.87 2.18 1.77 3.49 1.77 1.77 1.35 ...
##  $ TotalPrice: num [1:244] 58.4 303.6 108.5 153.3 82.8 ...

Step 5: clean the column names and check if all data types are correct

Before Analysis we clean the column names and check data types for the data

my_sales_clean_data<-clean_names(foodsales) # Clean the data
#colnames(clean) # view the new column Namesmy_sales_clean_data <- na.omit(my_sales_clean_data)#remove Nullsslice(my_sales_clean_data,1:10) # View first 10 lines using the new columns## # A tibble: 10 × 8
##    order_date          region city       categ…¹ product quant…² unit_…³ total…⁴
##    <dttm>              <chr>  <chr>      <chr>   <chr>     <dbl>   <dbl>   <dbl>
##  1 2020-01-01 00:00:00 East   Boston     Bars    Carrot       33    1.77    58.4
##  2 2020-01-04 00:00:00 East   Boston     Cracke… Whole …      87    3.49   304. 
##  3 2020-01-07 00:00:00 West   Los Angel… Cookies Chocol…      58    1.87   108. 
##  4 2020-01-10 00:00:00 East   New York   Cookies Chocol…      82    1.87   153. 
##  5 2020-01-13 00:00:00 East   Boston     Cookies Arrowr…      38    2.18    82.8
##  6 2020-01-16 00:00:00 East   Boston     Bars    Carrot       54    1.77    95.6
##  7 2020-01-19 00:00:00 East   Boston     Cracke… Whole …     149    3.49   520. 
##  8 2020-01-22 00:00:00 West   Los Angel… Bars    Carrot       51    1.77    90.3
##  9 2020-01-25 00:00:00 East   New York   Bars    Carrot      100    1.77   177  
## 10 2020-01-28 00:00:00 East   New York   Snacks  Potato…      28    1.35    37.8
## # … with abbreviated variable names ¹category, ²quantity, ³unit_price,
## #   ⁴total_price

Step 6: Analyze the data

a)View Sales By Category

Here we do some analysis to check which categories had the most sales

my_sales_clean_data %>% 
  select(city, category, product,order_date, quantity, unit_price, total_price) %>% 
  group_by(category) %>%
summarise( total_sales_by_category = sum(total_price)) %>% 
arrange(desc(total_sales_by_category))## # A tibble: 4 × 2
##   category total_sales_by_category
##   <chr>                      <dbl>
## 1 Cookies                   17212.
## 2 Bars                      10536.
## 3 Crackers                   3340.
## 4 Snacks                     2238.

Here we note that Cookies and Bars is the category of snacks that had the most sales with three times the bottom two. Cookies tops the sales chart

b)View Sales By Product

Here we do some analysis to check which Products had the most sales

my_sales_clean_data %>% 
  select(city, category, product,order_date, quantity, unit_price, total_price) %>% 
  group_by(product) %>%
summarise( total_sales_by_product = sum(total_price)) %>% 
arrange(desc(total_sales_by_product))## # A tibble: 9 × 2
##   product        total_sales_by_product
##   <chr>                           <dbl>
## 1 Carrot                          7411.
## 2 Oatmeal Raisin                  7310.
## 3 Arrowroot                       5330.
## 4 Chocolate Chip                  4572.
## 5 Whole Wheat                     3340.
## 6 Bran                            2945.
## 7 Potato Chips                    1652.
## 8 Pretzels                         586.
## 9 Banana                           179.

c)View Sales By Order date

Here we do some analysis to check which Products had the most sales in a particular period”the library(lubridate)” is used to get the year and month pat

my_sales_clean_data %>% 
  mutate(month_name = month(order_date, label = TRUE)) %>% 
  mutate(saleyear = year(order_date)) %>%
 group_by(saleyear,month_name) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
   arrange(saleyear,month_name)## # A tibble: 24 × 3
## # Groups:   saleyear [2]
##    saleyear month_name total_sales_by_date
##       <dbl> <ord>                    <dbl>
##  1     2020 Jan                      1706.
##  2     2020 Feb                       926.
##  3     2020 Mar                      1648.
##  4     2020 Apr                      1052.
##  5     2020 May                      1393.
##  6     2020 Jun                      2309.
##  7     2020 Jul                      1263.
##  8     2020 Aug                      1491.
##  9     2020 Sep                      1524.
## 10     2020 Oct                      1756.
## # … with 14 more rows

View Sales By Order date(2020 sales)

my_sales_clean_data %>% 
  mutate(month_name = month(order_date, label = TRUE)) %>% 
  mutate(saleyear = year(order_date)) %>%
  filter(saleyear == 2020) %>% 
 group_by(month_name) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
    arrange(desc(total_sales_by_date))## # A tibble: 12 × 2
##    month_name total_sales_by_date
##    <ord>                    <dbl>
##  1 Jun                      2309.
##  2 Oct                      1756.
##  3 Jan                      1706.
##  4 Mar                      1648.
##  5 Dec                      1610.
##  6 Sep                      1524.
##  7 Aug                      1491.
##  8 May                      1393.
##  9 Nov                      1311.
## 10 Jul                      1263.
## 11 Apr                      1052.
## 12 Feb                       926.

View Sales By Order date(2021 sales)

my_sales_clean_data %>% 
  mutate(month_name = month(order_date, label = TRUE)) %>% 
  mutate(saleyear = year(order_date)) %>%
  filter(saleyear == 2021) %>% 
 group_by(month_name) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
   arrange(desc(total_sales_by_date))## # A tibble: 12 × 2
##    month_name total_sales_by_date
##    <ord>                    <dbl>
##  1 Nov                      1979.
##  2 Apr                      1537.
##  3 Dec                      1515.
##  4 Mar                      1405.
##  5 Oct                      1290.
##  6 May                      1241.
##  7 Jan                      1235.
##  8 Aug                      1215.
##  9 Feb                      1125.
## 10 Jun                      1119.
## 11 Jul                       849.
## 12 Sep                       826.

Analysis By Visualization(Plots)

this section we analyze by visualization

Sales By product

ggplot(data = my_sales_clean_data ,aes(x= product, y= total_price, fill =product))+
geom_col() +
   theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
   labs(title ="Sales By Product", 
        subtitle = "Product sales",
        caption = "Sample sales data downloaded for test purposes",
        x="Product",
        y="Total Sales")

Observations:

Carrot and Oatmeal Raisin had the most product sales.
Banana and Pretzels are the products struggling to get sales

Sales By City by category(wrap)

Here we get a view of all Sales done for each City

ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =category))+
   facet_wrap(~city)+
  labs(title ="Sales By City by category", 
      #  subtitle = "Sales By City by category",
        caption = "Sample sales data downloaded for test purposes",
        x="Category",
        y="Total Sales")

Observations:

1)Carrot and Oatmeal Raisin had the most product sales across cities.

2)Banana and Pretzels are the products struggling to get sales.

3)Boston appears to have most sales across products.

4)In New York ArrowRoot has higher sales volumes than other products

Sales By City by category(grid)

ggplot(data = my_sales_clean_data)+ geom_col(mapping = aes(x= category, y= total_price, fill =product))+
   facet_grid(~city)+
   theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
   labs(title ="Sales By City By Category", 
        subtitle = "Products by Category",
        caption = "Sample sales data downloaded for test purposes",
 x="Product Category",
 y="Total Sales")

Sales By order Date

Total sales each month for all combined products each month

my_sales_clean_data %>% 
   mutate(month_name = month(order_date, label = TRUE)) %>% 
   mutate(saleyear = year(order_date)) %>%
   group_by(saleyear,month_name) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
 ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = factor(saleyear), group = factor(saleyear))) +
   geom_line()+
   theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
   labs(title ="Monthly Sales(YOY)", 
        subtitle = "Sales by Order date",
        caption = "Sample sales data downloaded for test purposes",
        x="Product Category",
        y="Total Sales")

Observations:

1)Sales are higher in early parts of the year and year ends(there is a dip from May to August). No enough data to explain the pattern

2)In 2020, June had an unusual jumb in sales

Month Sales By Product(multiple lines) (2020)

my_sales_clean_data %>% 
   mutate(month_name = month(order_date, label = TRUE)) %>% 
   mutate(saleyear = year(order_date)) %>%
    filter(saleyear== 2020) %>% 
   group_by(saleyear,month_name,product) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
 ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = product, group = factor(product))) +
   geom_line()+
   theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
   labs(title ="Monthly Sales(YOY)", 
        subtitle = "Sales by Order date",
        caption = "Sample sales data downloaded for test purposes",
        x="Product Category",
        y="Total Sales")

To get more insight into this behavior, we ran analysis by product as shown below.

Observations:

1)We Noted that Oatmeal Raisin and carrot recorded the most sales in this month

2)This pattern was not observed throughout the observed period. Not enough data to explain why this happened in 2020

Month Sales By Category(multiple lines) (2020)

my_sales_clean_data %>% 
   mutate(month_name = month(order_date, label = TRUE)) %>% 
   mutate(saleyear = year(order_date)) %>%
    filter(saleyear== 2020) %>% 
   group_by(month_name,category ) %>% 
   summarise( total_sales_by_date = sum(total_price)) %>% 
 ggplot( mapping = aes(x= month_name, y= total_sales_by_date,color = category , group = factor(category ))) +
   geom_line()+
   theme(axis.text.x = element_text(angle = 45,colour = "blue"))+
   labs(title ="Monthly Sales(YOY)", 
        subtitle = "Sales by Order date",
        caption = "Sample sales data downloaded for test purposes",
        x="Product Category",
        y="Total Sales")

Observations:

1)The same trend was observed for the month of June which explains why Cookies and Bars had the most sames in that month

Month Sales By Category (Pie) (2020)

my_sales_clean_data %>%
 # mutate(month_name = month(order_date, label = TRUE)) %>% 
   mutate(saleyear = year(order_date)) %>%
    filter(saleyear== 2020) %>%
   group_by(category) %>%
   summarise( total_sales_by_cat = sum(total_price)) %>% 
   mutate(total_sales_by_cat_perc = `total_sales_by_cat` / sum(`total_sales_by_cat`)) %>% 
   mutate(labels = scales::percent(total_sales_by_cat_perc,accuracy = 1L)) %>% 
   ggplot( aes(x = "", y = total_sales_by_cat_perc, fill = category)) +
   geom_col(color = "black") +
   geom_label(aes(label =  labels),
              color = "white",
              position = position_stack(vjust = 0.5),
              show.legend = FALSE) +
   coord_polar(theta = "y")