Image for post
Image for post

Web scraping and text analysis of the events taking place during the Edinburgh Fringe

In this post, we dive into the basics of scraping websites, cleaning text data, and Natural Language Processing (NLP). I’ve based the context of this project around the Edinburgh Fringe, the world’s largest arts festival, currently taking place between the 2nd and 26th of August.

Getting data into Python from web scraping

For my analysis, I wanted to acquire text about all of the events taking place during the Fringe festival. This text data was located across several webpages, which would take a long time to manually extract.

This is where the Python library requests helps us out, as it can be used to make HTTP requests to a particular webpage. From the response of making a HTTP request, the website text data can be obtained. This text data, however, is a large scroll of text, and this is where the library BeautifulSoup is implemented to parse the returning HTML we get from the webpage so that we can efficiently extract only the content we want. …


Image for post
Image for post
Image of the Victoria line presented as a hologram ©Christopher Doughty

Using d3.js and pepper’s ghost as a creative way to present data

In his book “Envisioning Information”, Edward Tufte talks about our visualisations being caught in the two-dimensional flatlands of screen and paper [1]. I wanted to explore an alternative way to visualise data, so I looked for a creative method to induce excitement in the viewer and escape the flatland of a computer screen. Techniques such as augmented reality achieve this by adding layers to what already exists; however, I opted for something far easier and cheaper. Using a sheet of plastic, I created a holographic illusion of a data visualisation.

The final visualisation can be viewed on the following page (requires a viewer): https://penguinstrikes.github.io/content/pepper_ghost/index.html


An exploration of how various data science techniques can be used to narrow down baby name ideas

Image for post
Image for post
Source: Christopher Doughty

Finding the perfect name for your baby can be a challenge. It is hard to find a name not only that both you and your partner like but also that fits with your surname. There is then the added complication of name popularity, i.e. whether you pick something common or unique. When trying to wade through all of these aspects, could data science make the name selection process easier?

To investigate the topic, I gathered some data on baby names. Data for first names and last names was compiled from the Office of National Statistics (ONS) and National Records of Scotland (NRScotland) [1]. My final dataset contained 31,697 first names and 6,453 last names. …


Building a DC-GAN to create Golden Retriever images and using a survey to test if people could distinguish real from fake images

Image for post
Image for post

The experiment was simple: could a machine learning (ML) model produce Golden Retriever images that people would mistake for being real? The reason for choosing dogs… was because dogs are awesome!

In our current climate, we often hear the term ‘fake news’, and with ML models becoming more advanced, their ability to create non-human content is only getting better. I therefore thought it was appropriate to give people the chance to test how good they were at spotting ‘real’ or ‘fake’ images. …

About

Christopher Doughty

Senior Data Scientist at Skyscanner. BSc Behavioural Biology from the University of St. Andrews

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store