Metropolitan's Messy Art

Metropolitan Museum of Art, Courtesy of GetYourGuide.com

Based in New York, the Metropolitan Museum of Art, “the MET,” is one of the world’s largest and finest museums, showcasing pieces and artifacts from all over the world. This museum also provides datsets of information on more than 470,000 artworks in its Collection for unrestricted commercial and non-commercial use. You can find those datasets here: https://github.com/metmuseum/openaccess

But what happens when this data is messy? Historical artifacts can have tons of inaccuracy or missing data ranging from where they actually came from, what period they’re from, etc. This dataset is missing values, has inconsistent information, missing documentation, possible duplication, mixed text, and numeric data. So how might one clean that up?

The first step is to load in the data and see what rows and columns you have.

Like the source stated, the Met really does have data on over 470,000 pieces of artwork —

To be exact, they have 491,263 pieces of artwork with 44 different columns of ways to categorize these pieces. Just to show exactly what we might be working with, I’ve pulled out a random piece of artwork so we can understand what the columns are and where there’s actually data.

Object Number                                                       16.74.34
Is Highlight False
Is Public Domain False
Object ID 25
Department American Decorative Arts
Object Name Centavos
Title Coin, 10 Centavos
Culture Mexican
Period NaN
Dynasty NaN
Reign NaN
Portfolio NaN
Artist Role NaN
Artist Prefix NaN
Artist Display Name NaN
Artist Display Bio NaN
Artist Suffix NaN
Artist Alpha Sort NaN
Artist Nationality NaN
Artist Begin Date NaN
Artist End Date NaN
Object Date 1860–70
Object Begin Date 1860
Object End Date 1870
Medium Nickel
Dimensions Diam. 3/4 in. (1.9 cm)
Credit Line Gift of Mrs. Russell Sage, 1916
Geography Type Made in
City NaN
State NaN
County NaN
Country Mexico
Region NaN
Subregion NaN
Locale NaN
Locus NaN
Excavation NaN
River NaN
Classification Metal
Rights and Reproduction NaN
Link Resource http://www.metmuseum.org/art/collection/search/25
Metadata Date 2/11/2019 8:00:01 AM
Repository Metropolitan Museum of Art, New York, NY
Tags NaN
Name: 24, dtype: object

You can see that there are tons of NaN cells just in one single row. It might be smart to use a value_counts() on a few of these columns to see if there’s any data in them at all. When you look into the data, you’ll see that things like “Rights and Reproduction” and “Link Resource” that have the exact same data for each piece of art. Other things, such as “Tags” have next to nothing in it, so it may be easier to simply take out the column rather than trying to clean it up. This part of the data cleaning process is important in that it can narrow down the amount of data you have to sift through, which can save you loads of time in the future.

The next part of the process involves looking for duplicates. One place to look for these is in the ‘Object Number.’ Removing these duplicates will be beneficial for future analysis and is a great way to shrink your data down a bit more.

Of the original 491,263 rows, it got rid of 480,864 duplicates. This leaves us with 10,399 rows of data.

You’ll notice that there are still quite a few NaN (blank/empty) values in this dataset. At this point, it could be worth asking if there’s a single row of data that has a complete value of data.

In order to analyze this data, it’s important to gain a better understanding of what’s in this data and how things are written out. One option to do this is to pull out specific columns and use a value_counts() to understand what’s in it.

For example, pulling out ‘Dynasty’ presents you with this:

Dynasty 7–10 (?)               92
Dynasty 18 78
Dynasty 21 26
Dynasty 19–20 19
Dynasty 19 13
Dynasty 21–26 (?) 9
Dynasty 6 7
Dynasty 12–18 7
Dynasty 8–11 7
Dynasty 26–30 6
Dynasty 20 6
Dynasty 27–30 5
Dynasty 30 and later 5
Dynasty 21–25 5
Dynasty 21 (?) 5
Dynasty 12–17 4
Dynasty 26 4
Dynasty 30? 4
Dynasty 6, first half 4
Dynasty 25 4
Dynasty 9–12 3
Dynasty 4 3
Dynasty 25–26 3
Dynasty 7–10 2
Ptolemaic Dynasty 2
Dynasty 11 2
Dynasty 27 2
Dynasty 20 (?) 2
Dynasty 12 2
Dynasty 17 2
Dynasty 30 2
Dynasty 26–29 2
Dynasty 5 2
Dynasty 21? 1
Dynasty 12, mid 1
Dynasty 8–9 1
Dynasty 19, early 1
Dynasty 25? 1
Dynasty 12, late –13 1
Dynasty 6–8 1
Dynasty 11 (?) 1
Dynasty 26 or later 1
Dynasty 12, late–early 13 1
Dynasty 11–17 1
9 or later 1
Dynasty 18–20 1
Dynasty 26–Ptolemaic Period 1
Dynasty 21–24 1

These dynasties are useful for a lot of Asian artifacts that are already split up into dynasties, however, you can see from this that these columns are full of question marks and unequal ranges. So it may not be the most useful column to focus on when you begin analyzing things.

You’ll notice that a lot of this data isn’t numerical, which can make it a bit tricky to plot. You may also notice that quite a few columns have cells that look like this:

American                                                                                            1113
French 482
Japanese 402
Italian 331
French|French 246
Austrian 224
British 219
French|French|French 214
German 186
American|American 168
Italian|Italian 141
Dutch 81
British|British 77
Netherlandish 66
Italian|Italian|Italian 56
British|British|British 43
Chinese 42
Netherlandish|Netherlandish 37
American|American|American 34
French|Italian 30
American, born Germany 29
Spanish 28
Netherlandish|Netherlandish|Netherlandish 26
German|German 22
French|French|French|French 21
Netherlandish|Netherlandish|Netherlandish|Netherlandish 20
Italian|French 20
British, Scottish 18
Scottish 17
Hungarian 16
...
British|Irish|British 1
German|Swiss 1
French|Italian|British|Italian|Italian|Italian 1
Irish|American|American|American 1
Scottish|British, Scottish|British|British 1
British, Scottish|British|British|American|British, Scottish 1
British, born Sweden|British 1
British, born France|British, Scottish|British 1
French|French|Russian|Russian 1
Scottish|Dutch|German|German (?) 1
British, born France 1
American|Swiss 1
American, born Germany|American 1
Italian|British 1
German|Swiss|German|German 1
Mexican|Mexican|Mexican|Mexican|Mexican 1
French|British|British|British, Scottish|French|French|French|French|French|French|French|French 1
Italian|Italian|French 1
British|American|Scottish 1
American, born France|American 1
Danish|British|British?|British|British|British|British 1
French|Scottish 1
Danish|Danish 1
Scottish|British|German 1
Irish|German|British|British|British 1
Netherlandish|Netherlandish|Dutch 1
Central European 1
Italian|Italian|Italian|Italian|Italian|Italian|Italian 1
Scottish|British|Italian 1
German|German|Dutch 1
Name: Artist Nationality, Length: 397, dtype: int64

This is the result of the column, “Artist Nationality” where multiple rows have duplicates, “?” (question marks), or a “|” (pipe). You might guess that this will also make it difficult to plot or work with- and you’d be correct.

Now that you’ve cleaned the data up a bit, let’s play with the columns some more to determine what columns are the cleanest.

You may find that not a single column is clean and many functions do not work when there is so much missing data. Don’t fear! There are many neat ways to fill in or shape this missing data, but for now, let’s find something simple from the cleanest column- the credit line. Let’s find who donated the most. We can do this by finding who donated the most using a value_counts().

Hopefully, this introduction to cleaning a very large and messy dataset will set you on your way for your own dataset. There were lots of roadblocks and a lot of frustration in finding how many functions don’t work when you are missing so much data. If this has taught you anything, hopefully, it’s that you really do love clean data. If you have any questions, comments, suggestions or concerns, feel free to reach out.

--

--