I work primarily with quantitative data, so recently when I had a few projects that involved merging on strings, I had to enter a new, messy world. My first challenge? Encountering bizarre character strings from a .csv file that I read into R. As many in the quantitative analysis field have been trained to do (see the From Excel section in this tutorial and the Answer to this StackOverflow question), I started by saving my .xlsx file as a .csv. This step resulted in bits of string that looked completely normal when viewing in Excel, as shown below:
And then in R, those same bits got messed up, as shown here:
Certainly, this threw more than a small monkey wrench into my task of merging data by string columns. If you Google this problem, you might think you need to specify the encoding in R (e.g., encoding = “utf-8”).
But the solution is actually much simpler: Read the data in directly from Excel instead!
Obviously, for this solution to work, you need to have your data saved as an .xlsx spreadsheet in the first place. The following code walks through the few steps needed:
# Load library needed to read directly from Excel
library(openxlsx)# Establish the file path for the working directory
wd_file_path <- “~/example_directory/”# Set the working directory to the file path established above
setwd(“wd_file_path”)# Read in the data!
matched_content <- read.xlsx(“example_data_with_strings.xlsx”)
How the same string looks in R now:
Research at Panorama Education is expanding! Check out this new Senior Data Scientist role.