Assignment 4

Heather Kim
Intro to Machine Learning
2 min readSep 24, 2019

List to dataset: https://www.kaggle.com/new-york-city/nyc-dog-names

For this assignment, I chose the dataset for NYC dog names, which shows the names of dogs that are registered to NYC Department of Health and counts how many dogs have the same name. All dogs in NYC are required to have licenses, so I thought this would be a fairly accurate representation of the names of dogs in New York.

The format of the data is CSV and there are 2 columns (Names, Count of Animals with that name) and 16220 rows excluding the title row.

Some information for names and count is missing, as indicated by “?”.

If I had to guess why this data was collected, I guess the reason is to help keep track of the information of the dogs that have the same names. It would also be useful to check if a dog is licensed or not. Also, for potential dog owners, this would be useful in choosing a name for their dogs if they want a unique name. It helps when their dog is lost. For instance, it would be difficult to find a dog named “Bear” (which 272 other dogs share the same name) when you call out its name. For study purposes, this can be used to study why people give their dogs certain names and whether the dog breed has any influence in that.

As for data underrepresentation, dogs owned temporarily or in foster would not be in this data set, as their names can change when they have a new owner.

Using machine learning, a model trained with this data could match an inputted dog name and see how many dogs share the same name. For example, there could be a digital form asking what you want to name your dog, and when you type in a name and hit “Enter”, it would tell you how many dogs share that name. Or, potentially this can be used to search the popular name for a certain breed of dog.

--

--