The DataViz Process, Part 1: Finding and Cleaning Your Dataset

Published in

Digital Scholarship Lab @MarquetteRaynor

10 min readOct 12, 2021

I want to write about a new tool I’ve found for data visualization.

But the process of creating a data visualization doesn’t begin with using the tool I’ve decided on. It actually begins a few steps before that — with finding and then cleaning a dataset.

So let’s talk about those steps before I introduce the new tool I’ve been playing with and share some tips for using it in Part 2.

Finding Your Data

The first part of the Data Visualization process is to identify a dataset that you can explore or use to answer your research question. Sometimes this is as easy as having a subject or dataset assigned to you by your professor or boss.

Screen capture of the Raynor Library’s A-Z Databases page.

But sometimes you need to do a little work to narrow down the right dataset for your visualization. If you have a topic area, you can use the list of curated resources provided by the subject librarians at the Research and Information department of the Raynor Library. I recommend checking out the Data & Statistics page in particular for a list of dataset resources.

But you might also look elsewhere: there are freely available datasets in places like Google Scholar or Public Data Directory, Data.Gov, local and institutional Open Data initiatives, and more. One thing you should always do with a dataset that you’ve found online, however, is to evaluate it critically, just as you would with any other source of information you’re using in your work (see this LibGuide on Evaluating Reliable Sources for strategies on how to assess resources).

For my dataset, I decided to use the Metropolitan Museum of Art’s open access dataset to visualize . This dataset is freely available on their GitHub page, and updated frequently as they continue to explore, asses, and expand their collection. So I downloaded the file — provided in a .txt format — into a new folder on my desktop so it wouldn’t get lost in my Downloads folder, making sure that I added the date to the filename: MetObjects_20211011.txt. This is because the Met dataset, like many, is constantly being updated. So I want to make sure that if I ever come back to it, I’ll know that it represents the data of a specific date, and that there might be a newer, more accurate dataset available.

Exploring Your Dataset

Your dataset might come in a very clean format, but it’s far likelier that you’ll have to do some measure of cleaning in order to get it in the perfect shape for you to use in your visualization.

First, you want to open your dataset up and just take a look at it. How many rows does it contain? How many columns? What are the columns, do they have clear and plain names? Is there a reference guide (usually either a separate tab in a dataset or a separate document) that explains what each column represents? What kind of information is in your dataset — do you need all of it to answer your research question or create your data viz?

Get familiar with your dataset, because the more familiar you are with what kind of data you have, the better you’ll be able to clean it and use it to create a visualization that represents the answer to the question you’re seeking to answer.

My MetObjects dataset has almost 500,000 rows of data in 54 columns! That’s a ton of information, and I definitely don’t need all of it. What I really want to know is what department of the museum grew the most, or acquired the most new art since 2000. So now I have to work to clean my data and reduce it to a more manageable size that addresses that question.

Cleaning Your Data

What is data cleaning? Essentially, cleaning data is just a process of putting your data into a consistent, uncorrupted form. Maybe the dates are written in a variety of formats: MM-DD-YYYY, DD-Month-YYYY, MM-DD-YY, etc. A clean dataset would have reformatted all the dates into a single format of your choosing. What matters is that data formats and types are consistent across your dataset.

Tip: One thing that I cannot emphasize enough is that you keep a record of every action you take when data cleaning. This is so that if you have to, you can explain exactly how you narrowed down a huge dataset to just the parts you needed for your visualization. Even if it’s as simple as a .txt file with a list of changes you’ve made (see my example below) it’s an important part of getting your dataset ready for visualization.

An example of the changelog I keep when cleaning a dataset.

There are a couple of broad steps to cleaning your data:

Remove irrelevant data: If you’re using US Census data to visualize the ages of homeowners in your state, you’ll only need a small fraction of the data you might have downloaded. So you can delete all of the columns (also known as variables) that don’t pertain to the visualization you want to create, and all of the rows (or observations) for states you’re not interested in.
Find and Remove Duplicates and Errors: Sometimes, if data is combined from different sources, some information will be included twice. Or a dataset file might have been corrupted and some information missing. There are tools (like Fuzzy Lookup, OpenRefine, or using the filtering features in programs like Excel or Google Sheets) that can help you identify if your data has duplicates or missing data.
Define Data Types: If you’ve ever used a spreadsheet program before, you might know that sometimes the program tries to guess what kind of data you’re entering into it. Or if you’ve ever tried to enter a date into a cell and watched it transform into a number that makes absolutely no sense to you. This is because sometimes you have to force these programs to read your data in a specific way by specifically telling it to interpret data as a date, number, text, etc.

I’ve never laughed harder than I laughed at this tweet. (Source: PleaseBeGneiss @ Twitter)

4. Check for Consistency: You want to make sure that your data has been entered consistently, and if it hasn’t, to put it in consistent formats before you begin working on your visualization. This means, for a few examples, making sure all dates are entered in the same way, all addresses are formatted the same, that all names are entered LASTNAME, Firstname (or however you decide they should appear). The point is you want to be consistent in how your data is entered and looks.

Cleaning My Met Data

The first step is to import the .txt file I downloaded from the Met’s Github site into a spreadsheet program. I first tried to import it into Google Sheets, but received an error message that my file was too big. So I imported it into Excel instead. One thing you want to make sure of when importing a dataset is to specify the delimiter if you can. The delimiter is the punctuation or method used to represent the structure of a dataset — comma-separated, tab-separated, etc. What this means is that the data is entered in a plain-text format, and every comma (or tab) signals a new cell in the row, and every new line signals a new row. If you don’t know the delimiter, both Google Sheets and Excel are pretty good at detecting it automatically.

But if things look wrong after you’ve imported the data — all the content is in a single column, for example — just delete that file and import it again, choosing a different delimiter this time.

Now that it’s imported, it’s time for me to explore it a bit. I can see that the column names are very descriptive and obvious, which is great because I won’t need a guide or reference to understand what each column of data represents (see the raw data image at the beginning of this post).

Looking over the dataset, I can identify some columns that are going to be the most important for me to use:

Object ID: This is the unique identifier for each piece of art contained in the dataset.
Department: This is one half of what I’m primarily interested in, the department each art object belongs to.
AccessionYear: This is the second half, the year each art object was acquired.

There’s a lot more data in this dataset though, and so before I make any final decisions on what I’m going to filter out, I just take a quick look through the other columns to see if there’s anything else I might be interested in including in the final visualization. A lot of the data is incomplete (“Artist Display Name,” “Culture,”, “Artist Gender,” etc.) and some of the data seems to be odd, like it was imported incorrectly into the dataset itself or includes multiple pieces of data in a single cell (this is a big sign that the data isn’t clean).

See the highlighted row below? The pipe or | seems to indicate that there is used to separate the multiple artists/contributors and their biographical data in the dataset. If I was interested in the artist data, I might take steps to clean up those rows, separating out multiple artists into their own rows. But that data doesn’t pertain to my question, and I’m going to delete those columns anyway.

I did find some other interesting information in my perusal of the columns, however. Three variables focused on the date of each item — “Object Date,” “Object Begin Date” and “Object End Date.” I’ve decided to keep the “Object End Date” column in addition to the ones above, however, because I think it might add another layer of interesting data to my visualization, letting my audience see the breakdown how art from different periods was added into the collection. And I chose the end date data instead of the other two columns because the “Object Date” includes estimated dates (“ca. 1880”) and spans (“1760–1800”), which I worry might unnecessarily complicate my final visualization, and because I feel that “Object Begin Date” wouldn’t represent the final form of an object (as tastes, plans, etc., change over time).

Now my dataset contains just four columns instead of the original 54: Object ID, Department, Accession Year, Object End Date. And each column is directly related to the question I want to ask and how I want to visualize that final dataset. (Don’t forget to note down the changes made in a changelog!)

Next, I need to address the chronologic element of my question. So I sort my data by the Accession column and then I use the filter function to deselect every year from 2000 to 2021 and any row that doesn’t have data in this column (blanks). I can then use CTL+A and DEL to delete everything that has an acquisition date before 2000. Then if I go back to the filter, only options for 2000 to 2021 should remain and only that data should be present.

Now I can use the filter function to get a quick look at the remaining data in each column — and thankfully, it looks pretty clean. All the dates are in the same formats, each object is linked to a single department. But … in the “Object End Date” column, some of the dates are entered for 2089, 2065, 2032, etc. These could have been mistakes made when the data was entered, or there could be a really good reason for why some pieces of art have end dates in the future, one that I’m not aware of. But these look like bad data or unexplainable outliers to me, and there’s only about 130 of them. So I’m going to make the decision to delete them since the absence of 130 observations out of a dataset that now numbers around 70,000 rows shouldn’t skew my final results too much.

So after this, my 54 column, almost 500,000 row dataset has been reduced to 4 columns and just over 70,000 rows of the data that I’m most interested in. That’s a much more manageable amount of data!

The final step is to save this copy of the dataset, indicating that it’s the clean version. I usually use the same file name and add _clean to the end so I can keep the different versions straight: MetObjects_20211011_clean.txt.

I’m ready to begin working with my new data visualization tool, and I think it’s one you’ll really find useful. So make sure to check out Part 2 next week!

Downloads

Use this link to download the files related to this post. Included in the zip are:

MetObjects_20211011.txt: the original dataset downloaded from the Met’s Github repository
MetObjects_20211011.csv: the dataset imported into Excel and saved as a comma-separated document
MetObjects_20211011_clean: the cleaned dataset, ready for use in visualization software
MetObjects_20211011_changelog: the record of all the changes made to the original dataset’s .csv file