Metropolitan's Messy Art

Published in

Spring 2019 — Information Expositions

6 min readFeb 25, 2019

Metropolitan Museum of Art, Courtesy of GetYourGuide.com

Based in New York, the Metropolitan Museum of Art, “the MET,” is one of the world’s largest and finest museums, showcasing pieces and artifacts from all over the world. This museum also provides datsets of information on more than 470,000 artworks in its Collection for unrestricted commercial and non-commercial use. You can find those datasets here: https://github.com/metmuseum/openaccess

But what happens when this data is messy? Historical artifacts can have tons of inaccuracy or missing data ranging from where they actually came from, what period they’re from, etc. This dataset is missing values, has inconsistent information, missing documentation, possible duplication, mixed text, and numeric data. So how might one clean that up?

The first step is to load in the data and see what rows and columns you have.

Like the source stated, the Met really does have data on over 470,000 pieces of artwork —

To be exact, they have 491,263 pieces of artwork with 44 different columns of ways to categorize these pieces. Just to show exactly what we might be working with, I’ve pulled out a random piece of artwork so we can understand what the columns are and where there’s actually data.

Object Number                                                       16.74.34
Is Highlight                                                           False
Is Public Domain                                                       False
Object ID                                                                 25
Department                                          American Decorative Arts
Object Name                                                         Centavos
Title                                                      Coin, 10 Centavos
Culture                                                              Mexican
Period                                                                   NaN
Dynasty                                                                  NaN
Reign                                                                    NaN
Portfolio                                                                NaN
Artist Role                                                              NaN
Artist Prefix                                                            NaN
Artist Display Name                                                      NaN
Artist Display Bio                                                       NaN
Artist Suffix                                                            NaN
Artist Alpha Sort                                                        NaN
Artist Nationality                                                       NaN
Artist Begin Date                                                        NaN
Artist End Date                                                          NaN
Object Date                                                          1860–70
Object Begin Date                                                       1860
Object End Date                                                         1870
Medium                                                                Nickel
Dimensions                                            Diam. 3/4 in. (1.9 cm)
Credit Line                                  Gift of Mrs. Russell Sage, 1916
Geography Type                                                       Made in
City                                                                     NaN
State                                                                    NaN
County                                                                   NaN
Country                                                               Mexico
Region                                                                   NaN
Subregion                                                                NaN
Locale                                                                   NaN
Locus                                                                    NaN
Excavation                                                               NaN
River                                                                    NaN
Classification                                                         Metal
Rights and Reproduction                                                  NaN
Link Resource              http://www.metmuseum.org/art/collection/search/25
Metadata Date                                           2/11/2019 8:00:01 AM
Repository                          Metropolitan Museum of Art, New York, NY
Tags                                                                     NaN
Name: 24, dtype: object

You can see that there are tons of NaN cells just in one single row. It might be smart to use a value_counts() on a few of these columns to see if there’s any data in them at all. When you look into the data, you’ll see that things like “Rights and Reproduction” and “Link Resource” that have the exact same data for each piece of art. Other things, such as “Tags” have next to nothing in it, so it may be easier to simply take out the column rather than trying to clean it up. This part of the data cleaning process is important in that it can narrow down the amount of data you have to sift through, which can save you loads of time in the future.

The next part of the process involves looking for duplicates. One place to look for these is in the ‘Object Number.’ Removing these duplicates will be beneficial for future analysis and is a great way to shrink your data down a bit more.

Of the original 491,263 rows, it got rid of 480,864 duplicates. This leaves us with 10,399 rows of data.

You’ll notice that there are still quite a few NaN (blank/empty) values in this dataset. At this point, it could be worth asking if there’s a single row of data that has a complete value of data.

In order to analyze this data, it’s important to gain a better understanding of what’s in this data and how things are written out. One option to do this is to pull out specific columns and use a value_counts() to understand what’s in it.

For example, pulling out ‘Dynasty’ presents you with this:

Dynasty 7–10 (?)               92
Dynasty 18                     78
Dynasty 21                     26
Dynasty 19–20                  19
Dynasty 19                     13
Dynasty 21–26 (?)               9
Dynasty 6                       7
Dynasty 12–18                   7
Dynasty 8–11                    7
Dynasty 26–30                   6
Dynasty 20                      6
Dynasty 27–30                   5
Dynasty 30 and later            5
Dynasty 21–25                   5
Dynasty 21 (?)                  5
Dynasty 12–17                   4
Dynasty 26                      4
Dynasty 30?                     4
Dynasty 6, first half           4
Dynasty 25                      4
Dynasty 9–12                    3
Dynasty 4                       3
Dynasty 25–26                   3
Dynasty 7–10                    2
Ptolemaic Dynasty               2
Dynasty 11                      2
Dynasty 27                      2
Dynasty 20 (?)                  2
Dynasty 12                      2
Dynasty 17                      2
Dynasty 30                      2
Dynasty 26–29                   2
Dynasty 5                       2
Dynasty 21?                     1
Dynasty 12, mid                 1
Dynasty 8–9                     1
Dynasty 19, early               1
Dynasty 25?                     1
Dynasty 12, late –13            1
Dynasty 6–8                     1
Dynasty 11 (?)                  1
Dynasty 26 or later             1
Dynasty 12, late–early 13       1
Dynasty 11–17                   1
9 or later                      1
Dynasty 18–20                   1
Dynasty 26–Ptolemaic Period     1
Dynasty 21–24                   1

These dynasties are useful for a lot of Asian artifacts that are already split up into dynasties, however, you can see from this that these columns are full of question marks and unequal ranges. So it may not be the most useful column to focus on when you begin analyzing things.

You’ll notice that a lot of this data isn’t numerical, which can make it a bit tricky to plot. You may also notice that quite a few columns have cells that look like this:

American                                                                                            1113
French                                                                                               482
Japanese                                                                                             402
Italian                                                                                              331
French|French                                                                                        246
Austrian                                                                                             224
British                                                                                              219
French|French|French                                                                                 214
German                                                                                               186
American|American                                                                                    168
Italian|Italian                                                                                      141
Dutch                                                                                                 81
British|British                                                                                       77
Netherlandish                                                                                         66
Italian|Italian|Italian                                                                               56
British|British|British                                                                               43
Chinese                                                                                               42
Netherlandish|Netherlandish                                                                           37
American|American|American                                                                            34
French|Italian                                                                                        30
American, born Germany                                                                                29
Spanish                                                                                               28
Netherlandish|Netherlandish|Netherlandish                                                             26
German|German                                                                                         22
French|French|French|French                                                                           21
Netherlandish|Netherlandish|Netherlandish|Netherlandish                                               20
Italian|French                                                                                        20
British, Scottish                                                                                     18
Scottish                                                                                              17
Hungarian                                                                                             16
                                                                                                    ... 
British|Irish|British                                                                                  1
German|Swiss                                                                                           1
French|Italian|British|Italian|Italian|Italian                                                         1
Irish|American|American|American                                                                       1
Scottish|British, Scottish|British|British                                                             1
British, Scottish|British|British|American|British, Scottish                                           1
British, born Sweden|British                                                                           1
British, born France|British, Scottish|British                                                         1
French|French|Russian|Russian                                                                          1
Scottish|Dutch|German|German (?)                                                                       1
British, born France                                                                                   1
American|Swiss                                                                                         1
American, born Germany|American                                                                        1
Italian|British                                                                                        1
German|Swiss|German|German                                                                             1
Mexican|Mexican|Mexican|Mexican|Mexican                                                                1
French|British|British|British, Scottish|French|French|French|French|French|French|French|French       1
Italian|Italian|French                                                                                 1
British|American|Scottish                                                                              1
American, born France|American                                                                         1
Danish|British|British?|British|British|British|British                                                1
French|Scottish                                                                                        1
Danish|Danish                                                                                          1
Scottish|British|German                                                                                1
Irish|German|British|British|British                                                                   1
Netherlandish|Netherlandish|Dutch                                                                      1
Central European                                                                                       1
Italian|Italian|Italian|Italian|Italian|Italian|Italian                                                1
Scottish|British|Italian                                                                               1
German|German|Dutch                                                                                    1
Name: Artist Nationality, Length: 397, dtype: int64

This is the result of the column, “Artist Nationality” where multiple rows have duplicates, “?” (question marks), or a “|” (pipe). You might guess that this will also make it difficult to plot or work with- and you’d be correct.

Now that you’ve cleaned the data up a bit, let’s play with the columns some more to determine what columns are the cleanest.

You may find that not a single column is clean and many functions do not work when there is so much missing data. Don’t fear! There are many neat ways to fill in or shape this missing data, but for now, let’s find something simple from the cleanest column- the credit line. Let’s find who donated the most. We can do this by finding who donated the most using a value_counts().

Hopefully, this introduction to cleaning a very large and messy dataset will set you on your way for your own dataset. There were lots of roadblocks and a lot of frustration in finding how many functions don’t work when you are missing so much data. If this has taught you anything, hopefully, it’s that you really do love clean data. If you have any questions, comments, suggestions or concerns, feel free to reach out.

Metropolitan's Messy Art

Written by Marissa Kelley