Which arrondissement in Paris gladdens people the most?

One day I was asking my friend: where should we leave in Paris to be over the moon? Let’s give an answer using data science!

Basile Goussard
Analytics Vidhya
4 min readSep 21, 2020

--

This day I had two solutions, the first one was to go on the internet and search for it. The second one was to solve it by myself. As you expect I decided to go for the first one but I did not find a clear answer. So let’s try to solve it by ourselves!

As usual, let’s split the work :

0°) Dataset

1°) Starter Pack

2°) Preprocessing

3°) Processing

4°) Visualisation

0°) Dataset

At the beginning I had no data to answer the following question :

Where should we go to Paris to be over the moon?

Photo by Jakob Owens on Unsplash

By scrolling on social media I just realized that we had an infinite amount of data available for free! Indeed people are giving their opinion on social media. Moreover they are describing those information thanks to hashtags. For this study, I used Instagram but it could be easily expended to Pinterest, Facebook, Twitter, …

Here it’s one example of my input data. I retrieve all the comments and # for this specific picture. Moreover, I retrieve the location of the data. Based on it I was obliged to make the following assumptions for the rest of the study :

  • People who post on Instagram with the location put the good location.
  • If you are happy then you write a happier post that if you are not.
  • If the majority say that it’s a good place we will admit that it’s a good place.

Now we have to ‘industrialize’ the process : A bigger amount of data !!
I retrieve 8762 posts with #ParisStreet by using a script (code available on github here) (JSON.xz files are the metadata) :

1°) Starter Pack

As usual, I used anaconda with three Jupyter notebooks based on three environments :

  • one environment to download the data from the social media (Instaloader [link] to download data from Instagram)
  • one environment to preprocess & visualised the result (regular geo environment with pandas [link], geopandas [link], NumPy [link] & keplerGl [link]
  • one environment to get insight from the data with pandas [link], NumPy [link] & Afinn [link]

2°) Preprocessing

First, we are going to keep just posts that are geolocated. Let’s dive into the metadata. Here we can find in the metadata (JSON format) another JSON with the name ‘location’.

import json          #load json 
import pandas as pd #do some data analysis
import lzma #open .xz file
import glob #retrieve all the file from our directory
post_list = glob.glob('./#parisstreet' + '/*.xz')adress = dict()
for post in post_list:
metadata= json.load(lzma.open(post))
location = metadata['node']['location']
if location :
if location['address_json']:
adress[file] = json.loads(location['address_json'])
df_adress = pd.DataFrame.from_dict(adress, orient = 'index')

Inside we can find the n° of the street, the street itself, the town, and the country. For each of those locations, we retrieve the latitude & longitude. I suggest you to read the following article from Abdishakur which provides a clear view of geocoding method (street address to location)

Now we have to filter our posts. Indeed we are going to keep only those with an exact location inside Paris. Aftre this preprocessing the number of post is 1694 !(compare to the 8762 at the beginning)

Almost 82% of our posts are not useful for our study! Let’s save the other 18% in a new folder (preprocessing)

3°) Processing

Let’s dive into the sentiment analysis part. I used afinn which is the easiest sentiment analysis algorithm. Thanks to Afinn we can retrieve the sentiment of each post based on the description.

Now we can average all sentiments per arrondissement to retrieve the final score!

4°) Visualization

Here I’m representing the result based on the sentiment per arrondissement & based on the quantile. Indeed the idea is to compare arrondissement and not to give an absolute value of the sentiment in the city.

As we can see the arrondissement with the best mark is the 17eme and the worst one is the 14eme … Sorry Louis 😉 !

🥇 : 17ème
🥈 : 02ème
🥉 : 11ème

Thanks for reading and have fun with your geospatial data ! (Code available on github)

--

--

Basile Goussard
Analytics Vidhya

Engineer exploring the potential of Earth Observation through AI & data fusion