Scraping Wikipedia Tutorial -EDA:Nissan’s ‘Skyline’ GT-R vs. GT-R Popularity Match Up
I will be introducing a basic tutorial on how to scrape and analyze data from Wikipedia’s website. After providing the standard topics and skills of scraping I will perform and explain how to use EDA on the scraped data. There are two main ways to gather data from a website, one requires more coding comprehension and parsing skills and the other uses a in house API. Not every website had a API to access so sometimes you are forced to scrape data from a website by inspecting a html source then parse that content into tabular data. In my tutorial I will be informing novice Python users on how to access and use an API to gather data to perform exploratory data analysis on any Wikipedia page. To start you want to identify a Wikipedia article or two that seem interesting to explore. You will need to set up a request to call Wikipedia’s API and have it provide you data. Specific API set up would include calling the http where the API is working, then format that query data into a dictionary, from there you can set it to parse the content from the specified page, then tell the API you want text or revisions or counts and get that data back as a json format(or whichever you prefer). Once you have specified all the required call requests to the AIP you can start to pick out the data within the scraped parsed json.
Since I was curious about the popularity of Nissan’s Skyline GT-R versus new GT-R I planed to use the two pages statistics to prove how the popularity of the classic model has held up compared to the newest model. My hypothesis is that the classic Nissan Skyline will contain more Wikipedia user interest such as page views or revision content than its newest GT-R release. I thought the classic Skyline would be more popular over the years, gaining more popularity the older the car’s history becomes. Due to new import regulations the classic Skyline can now access US consumers making it a consistently popular topic even considering its outdated age.
For this tutorial and EDA task I used Wikipedia’s API to scrape data from two pages, “Nissan Skyline GT-R” a car from the late 90’s, and the newer “Nissan GT-R” released later in 2007. The classic “Skyline GT-R” is a very iconic Japanese vehicle which attracts car enthusiasts due to its presence, Hollywood appearances like fast and furious and overall legacy. The Skyline GT-R was only available in oversea countries, the United States did not receive them because of smog and import regulations. Nissan announced the new GT-R model coming to United States and overseas countries in the late 2000’s which borough interest to the car communities. A recent craze has emerged of importing the early edition Skyline GT-R’s, it’s a very sought after vehicle. Now knowing this background information I wanted to explore how enthusiasts love for the classic Skyline GT-R has been retained over the years, compared to the interest of the recent release of Nissan’s GT-R using data scraped from Wikipedia’s API.
My process for EDA on these two similar but different Nissan Skyline GT-R and GT-R models was to first scrape revision history and then page views. Doing this allowed me to explore data frame trends of popularity using raw counts for comparing each pages revision history to evaluate overall interest in one model versus the other. The page view count also gave me the ability to see a timeline of popularity between the two models based on users visiting the particular models page. After exploring the data frames and its content I then used visual plots to discover new trends and insights involving the Skyline GT-R to the newer GT-R. Using visualizations was important in this EDA process to further support the arising trends shown by the raw data counts.
I am performing this EDA because I questioned how the classic 90’s Nissan Skyline GT-R has retained it’s popularity compared to its newest 2007 GT-R model release. The first visualization plot I made presents revision counts and how much additional data was added to the specific page. To get the revision history from both pages I reverted my API ‘prop’ call from “text” to “revisions”. The plot clearly shows a positive trend representing the continuing interest and activity from the classic Nissan Skyline GT-R Wikipedia page. It was interesting to see the large spike in content between the 400th and 450th revisions. I then compared the two pages revision counts and data size in one visualization. In the revision history comparison I could conclude the old Nissan Skyline has around 50,000 bytes of added data very close to the new GT-R’s page data amount. Furthermore the comparison uncovered more data was added overall on the Nissan Skyline GT-R exposing its educated revisers are consistent. The new GT-R has more revisions overall but the amount of data on the page does not drastically differ from the new versus old models. What I discovered was the old Skyline GT-R gained more data at a faster rate showing it’s just as popular if not more than its new release based on the rapid interest the page gained. This comparison plot of the revision timeline proves my research question is practical.
Due to the immense amount of information on the classic Skyline GT-R page I looked at the top 3 contributors. The top fanatics and contributors were users “Willirennen” with 16 content additions, “Zunaid” with 14 and “Impreziv” with 12. Another random thing I did was look at how the page titles would be displayed in each Wikipedia available languages. Since my page is a cars Manufacturer and Model name it was standardized most language. These spellings would not vary except in Japanese and Zhongwen (simplified Chinese) where unique alphabet symbols are used. I continued to ponder, has the original Skyline GT-R retained user interest with unique updates over the years, or has the new GT-R gained the spotlight of user interest now encumbered with revision and page view activity? It appears the new Nissan GT-R overall has more total revisions submitted since the start of the page. This could be because many users are updating little bits of information or simple formant and grammatical revisions leading to a drastic increase to this overall count. Maybe the Nissan Skyline GT-R with half the revisions of its newer version had large data updates stemming from a few main users with lack of small edit revisions. I curiously looked into the total counts of these revisers. The unique editors between the two pages are close with only a difference of 488 users. The OG Skyline contained 1,308 unique revisers and the new GT-R had 1,796 unique user revisers. I was also able to gather simple metrics uncovering the first revision date of each page. This exposed the difference in time frames between the old Skyline GT-R and new GT-R. The ‘Nissan Skyline GT-R’ article was first edited on 2003–01–24. The ‘Nissan GT-R’ article was first edited on 2007–02–19.
The line up of users who provided content and revisions to these Wikipedia pages may not constitute which Nissan model is more desirable to learn about or popular among society. To gather more structured analysis I scraped daily activity data from both Wikipedia pages which looked into page views per day, then by month, which I used to further compare popularity. To get the page views data from both pages I used a “get_pageviews” call from pythons urllib tool. I used this additional data to strengthen my EDA and dig deeper into other metadata attributes to analyze which Nissan model retains the higher popularity amongst Wikipedia users. For the daily page view count I produced a new data frame using pandas that shows the total count of user views per page by day. What I discovered was how both pages have grown over the years, how popularity rises, fell off, or sways towards one Nissan model. I used the collected view differences to understand the popularity trend of the classic Skyline GT-R versus the new GT-R. After plotting and analyzing the page views data frame I could support that during the anticipated release of the new Nissan GT-R the Wikipedia page views outnumbered the classic. Interestingly the classic Nissan Skyline GT-R’s have a large popularity in the car enthusiast scene based on this data. They are reaching their legal US import age which I believe is the reason the Skyline GT-R Wikipedia page has retained interest for over 10 years and more recently from 2012 to 2019 the page views outnumber the new GT-R page views. This is showing that the older Skyline GT-R’s are becoming more sought after now that the older models are legal for import and the most popular Skyline models are a few years away from legal import. While the page view difference per day may only be a few hundred views difference it still impressive and shows the older Nissan Skyline GT-R is just as popular if not more popular or monumental to car enthusiasts than its newest GT-R releases. It appears that people are more interested in either the history of the Skyline GT-R model or anticipate importing their own classic Skyline GT-R since it’s an increasingly popular opportunity for US enthusiasts.
Throughout this web data scraping tutorial and exploratory data analysis process I have discussed many interesting discoveries along side factors that support my hypothesis claim. To conclude my EDA I plotted the visualization showing results of monthly user page views to predict why spikes and drops happen overtime. I believe the data is shaped this way due to a surge of anticipation for the new Nissan GT-R release where for a bit the GT-R Wikipedia page outperformed the classic Nissan Skyline GT-R in total page views per day. Around 2017 as import of the classic Skyline GT-R’s become legal you can see the drastic increase in user interest as page views raise above the more recently overhyped new GT-R model. What I found interesting during this EDA was that the classic Skyline GT-R’s seems to retain a constant Wikipedia user base weather it be page views or new content revisions. More impressively after the release of the new Nissan GT-R its page views and user interest drastically dwindled while the classic Skyline increased in user interest and revisions. As the classic Nissan Skyline GT-R holds it’s popularity over the course of the years and the new Nissan GT-R only received interest for a short period of time I believe the results prove that the older original Nissan Skyline GT-R’s have a larger popularity among Wikipedia’s car enthusiasts.