Datawizard: An introduction to the dplyr dependency free R package for data wrangling

Simisani Ndaba
ILLUMINATION
Published in
5 min readJan 15, 2024

--

Photo by emrecan arık on Unsplash

There are a few dependency free R packages that are commonly used for data wrangling such as poorman. The other being, Datawizard. Dependancy free R packages are packages do not depend on the functionality of another R package. I discovered Datawizard when researching for my talk on R packages for data cleaning. The latest version of Datawizard 0.9.1 was released on the 9th of September 2023.

What is Datawizard

Datawizard is used for data transformation and statistic operations and is also part of the easystats collection. Although Datawizard is easily overshadowed by dplyrs’ popularity, it can be the next alternative to dplyr for datra wrangling. The Datawizard package was developed by Indrajeet Patil, Etienne Bacher, Dominique Makowski, Daniel Lüdecke, Mattan S. Ben-Shachar and Brenton M. Wiernik.

Apart from the function similarities with dplyr for merging, arranging, grouping and unique values. The difference is in the Datawizard function naming which begins with dataand an underscore after, followed by the function action at the end, data_extract().

This article provides a short tutorial on functions from the Data wizard package.

Lets look at a few of the Datawizard functions for data wrangling using a dataset that can show us how the functions work.

  1. Installing the Datawizard package

Begin by installing and loading the Datawizard package.

#Installing and loading the Datawizard package
install.packages("datawizard")
library(datawizard)

2. Read the dataset using data_read()

The data_read() function imports data from various file types. It is a small wrapper around haven::read_stata(), readxl::read_excel() and data.table::fread() .

#read the dataset using the data_read() function
house_price <- data_read("https://raw.githubusercontent.com/sndaba/RPackagesForDataCleaning/main/NYC_2022.csv")
View(house_price) #output dataset sample seen below
sample of the dataset. Photo by Author.

3. Peek at the values and type of variables using data_peek()

The function creates a table data frame, showing all column names, variable types and the first values (as many as fit into the screen).

#data_peek shows a summary of the each variables' details
data_peek(house_price)
data frame summary showing the type of each variable and examples of values in a variable. Photo by Author.

4. Statistical summary using data_codebook()

data_codebook() generates codebooks from data frames, i.e. overviews of all variables and some more information about each variable (like labels, values or value range, frequencies, amount of missing values).

#generate an overview of statistics of missing, number of values, frequency of a value
(code <- data_codebook(house_price))
Output from codebook().Photo by Author.

5. Replacing missing values with convert_na_to()

Replace missing values in a variable or a data frame usingconvert_na_to().

#missing data for numeric and characters
house_price_missing <- house_price <- convert_na_to(house_price, replace_num = 0, replace_char = "missing")

6. Searching for columns

find_columns() returns column names from a data set that match a certain search pattern, while get_columns() returns the found data.

#finding columns
find_columns(house_price_missing, starts_with("neighbourhood"))

#output shows columns at the bottom
[1] "neighbourhood_group" "neighbourhood"
#get_columns()
get_columns(house_price_missing, starts_with("neighbourhood"))
get_columns() output shows values of the columns. Photo by Author.

7. Look for columns based on pattern name with data_seek()

The data_seek() looks for variables in a data frame, based on patterns that either match the variable name (column name), variable labels, value labels or factor levels. Matching variable and value labels only works for “labelled” data, i.e. when the variables either have a label attribute or labels attribute.

#looks for columns even with a typo. "hot" is similar to "host" or "hood"
data_seek(house_price, "hot", fuzzy = TRUE)
list of columns that a close to the label “hot”. Photo by Author.

8. Remove columns with data_remove()

The data_remove() removes columns from a data frame. All functions support select-helpers that allow flexible specification of a search pattern to find matching columns, which should be reordered or removed.

#remove data.frame,column
house_price <- datawizard::data_remove(house_price, "latitude", "longitude")

#remove data.frame,column
house_price <- datawizard::data_remove(house_price,"id")

9. Column reordering with data_reorder()

The data_reorder() will move selected columns to the beginning of a data frame. The other column ordering function, data_relocate() (not covered in this article), will reorder columns to specific positions, indicated by before or after.

#add the names of the cols in the new order
house_price <- house_price_missing <- datawizard::data_reorder(house_price,c("host_id","name"))

#add the names of the cols in the new order
house_price <- datawizard::data_reorder(house_price,c("host_name","name"))

#add the names of the cols in the new order
house_price <- datawizard::data_reorder(house_price,c("host_id","host_name"))
columns reordered. Photo by Author.

10. Rename some columns using data_rename()

#the column "price" will change to "house_price"
house_price <- datawizard::data_rename(house_price,"price","house_price")

11. Filtering and Matching with data_filter()and data_match()

Both functions return a filtered (or sliced) data frame or row indices of a data frame that match a specific condition. data_filter() works like data_match(), but works with logical expressions or row indices of a data frame to specify matching conditions.

#match rows following variable conditions with data_match()
View(data_match(house_price, data.frame(neighbourhood_group = "Brooklyn")))
data frame subset with rows relating to neighbourhood_group column set to “Brooklyn”. Photo by Author.
#filtering using logical expressions
View(data_filter(house_price, room_type == "Private room" & house_price > 120000))
data frame subset with room_type set to “Private room” and house_price > 120000. Photo by Author.

In Summary
The Datawizard package is an all purpose Data Science package where you can get operations for data formation, statistical summaries and data cleaning.

Continue reading on Datawizard and coding on Datawizard repository.

--

--

Simisani Ndaba
ILLUMINATION

Teaching Assistant in the Department of Computer Science at the University of Botswana. Interests are in Machine Learning and Data Science.