Wikipedia API Dog Breed Comparison

For this assignment, I was interested in the differentiation between two very different but common dog breeds throughout America. The first breed of dog that I wanted to look into was the American Pit Bull Terrier, as it has become one of the most controversial pets to own due to BSL (Breed Specific Legislation) that has outlawed the ability to own a Pit Bull in certain states. On the other end of the spectrum, I was interested in comparing the Pit Bull with a socially accepted and common household dog breed such as the Labrador Retriever. Some of my initial questions for this project is which dog breed receives more attention online via revision history, and what kind of trends can we find throughout our comparison?

To begin my EDA Wikipedia API Scrape, we must import and load in all our packages to allow us to curate the data. We will use json, BeautifulSoup, pandas, seaborn, datetime, and matplotlib for the data visualizations.

After we are finished with out imports, then we move into the scraping process by doing a url query to determine wheres the API severs is, then create an empty dictionary to add the properties of the page, page title, and revision history. We will grab 500 revisions for each of the two dog breeds to compare, and we will make the “rvdir” category into “newer” so that it will show us the older revisions then move into what it looks like now after 500 revisions.

Next we will create “json_response” which will request the url and structure the parameters for the json.

To get a quick look at what our data looks like for the Pit Bull, and we can see all the columns and categories of data that we could analyze from here.

From here we can check our dataset by looking into our new Dataframe under “revisions” to get a visual sense of what structure our dataset will look like.

We then will now shift and move into doing the same process to collect the online data for Labrador Retrievers, and the same sort of actions will be taken from the previous lines of code above to ensure that both of these two data scrapes have uniform and comparable data that will allow us to inspect the two accurately within the same parameters.

#Scraping Code Sourced : Brian C. Keegan, Ph.D. — Assistant Professor, Department of Information Science University of Colorado Boulder

The next step in the EDA process is to plot and visualize both of the charts in order to compare them both and potentially find any trends across the two breeds. To do this, we will use a simple line graph for the revision history across the API data for each of these pages. We will also be using the y-axis to indicate the size in bytes for how much information has been added or removed from each page.

Pit Bull:

Labrador Retriever:

When comparing the two plotting outcomes, it’s evident that there are higher rates of information removal/deletion on the pit bull page. This makes me question if the breed controversy has an impact due to the fact that between the 1980’s to 2017, there was a reported 389 deaths from pit bull attacks. The breed is less favored throughout traditional media and usually holds a negative stigma as a breed in general. The labrador retriever seems to have a faster revisional growth rate compared to the pit bull.

The next section of my EDA process led me to investigate additional data in order to help me understand why this Wikipedia revisional popularity difference could be explained. While researching on Kaggle, I found this dataset that has over 9,000 animal bite reports across the United States, and I wanted to see where my two chosen dog breeds stack up. To begin, we simply load up the .csv file and then I cleaned up the empty data and printed all unique species that were recorded in this overall dataset.

It’s interesting to see that there are nine different animal species that have been reported biting people, although the largest majority of bites come from dogs throughout this dataset. The next step I wanted to do was to compare all dog breeds by the cumulative sum number value of reported bite history across the top 10 breeds.

It is no surprise that the pit bull is the highest reported dog breed, although I did find it fascinating that the labrador retriever ranked third on the list. My prior hypothesis of pit bull breed violence being a catalysts for the online revision and information trends shown is difficult to prove with this chart since the labrador is also one of the highest reported dog breeds. My next step in this comparison is to superimpose both Wikipedia revision trends for both dog breed pages, and to attain a better visual representation for the different trends previously shown.

(Purple = Pit Bull) (Blue = Labrador Retriever)

From the final chart above, you can see that the pit bull Wikipedia page has a consistent pattern of data removal and then re-added back, whereas the labrador pattern initially exhibits this type of removal but then slowly remains in a steady increase of information without information removal after 2010. Something interesting from this chart that I did not realize at first is that sometime shortly before 2010 there was a sharp spike and increase in pit bull page information, but then quickly drops and removes a large section of data that never reaches the same level of information again. The page then exhibits similar patterns of data removal that seem to halt the overall progress of this page in terms of information growth. There are a couple of processes that I would like to do with the project in the future, as this EDA has lifted a couple more questions that I think would be interesting to investigate. Some of these are, what characteristics throughout dog breeds generate negative connotations? How does the popularity of a breed grow throughout specific geological/demographical locations? How heavily does a dog breed stigmatization impact the types of Wikipedia interactions online?

Overall, I believe that this type of Wikipedia API scrape has potential to unlock a lot of really cool connections and online trends throughout revision history and overall popularity of content online. There is still a lot that I would like to do with this project, as the subject became more and more interesting as I worked on it. I still has many questions that I want to answer, although I think that this is a pretty good foundation for future EDA processes. It is always interesting to see when your prior initial predictions across comparison do not match up to the reality of the situation once it is uncovered, and I think that is why I am so curious to try and understand what about these two breeds generates the online traffic that we have seen today.

--

--