Restaurants and cafés in Seoul: A simple data science project

A step-by-step guide to implementing CRISP-DM methodology in a data mining project

Thierry Laplanche
Analytics Vidhya

--

“Yourself and Yours”​, Hong Sang-soo, 2016.

Seoul is the sprawling capital of South Korea with 10 million people and a density of 16,000/km2. It is divided in 25 districts (called gu), each with a population ranging from 135 000 to 670 000, and 424 smaller administrative neighborhoods (dong).

The 25 districts of Seoul

The city is home to an incredible number of F&B businesses: more than 110 000 restaurants and 15 000 cafés, many of which come and go rather quickly. Since eating out and sipping on a cup of coffee is a major aspect of Seoulites’ social life, we chose this theme to try and put into practice some basic concepts of data science. Among these concepts: clustering, or the art of grouping things that share common patterns. Would it be possible to cluster the districts and neighborhoods of Seoul based on what we can eat there?

To conduct this project, we will follow the CRISP-DM methodology, which provides a structured approach to planning a data mining project. Each phase can be described in one question.

A. Business understanding

Question: What is the problem we are trying to solve?

We want to find out if there is any pattern in the type of restaurants (including cafés) located in every district and neighborhood of Seoul.

  • Are some cuisines / types of restaurants more represented in certain areas than in others?
  • Is it possible to group districts or neighborhoods together based on the sole criteria of food?

B. Analytical approach

Question: What type of model can answer our problem?

Our problem is clearly a clustering problem. We will therefore rely on a clustering model to solve it. Clustering models are numerous, with the two most popular being K-means clustering and hierarchical clustering.

Fortunately, most clustering algorithms are already implemented in open source libraries for the language we will use (Python), therefore we won’t have to do much coding. The most critical and the most tedious part of this project, as with most data science projects, will be to collect and clean the data.

C. Data requirements

Question: What kind of data to look for?

In order to cluster geographical areas based on the restaurants they host, we obviously need data regarding:

  • The administrative divisions of Seoul: the list of districts (gu) and neighborhoods (dong) and their coordinates to visualize on a map.
  • The restaurants (as many as possible) and, more importantly, the type of cuisine they serve.

D. Data collection

Question: Where to look for the data?

D.1. Geographical data

Open Data Plaza”, a Seoul government-owned website, offers a large variety of data and statistics (demographics, land, employment, etc.) for download in structured data formats. Among them, the coordinates of each neighboorhood will come in handy when searching for venues: http://data.seoul.go.kr/dataList/datasetView.do?infId=OA-13223&srvType=S&serviceKind=1&currentPageNo=null

Open Data Plaza

An easy way to plot areas on a map is to use GeoJSON data, which allows to draw the precise shapes of geographical areas. Data for both districts and neighborhoods are available on this GitHub repository.

D.2. Venues data

Collecting data on restaurants is a bit trickier. Our first idea was to utilize one of the popular map services in Korea: KakaoMap and NaverMaps. Both offer an API to query a list of restaurants based on coordinates or an address, but the information they send back makes it hard to classify these places according to their cuisine. The category would be either too general (‘Western Food’), or too precise (the name of the franchise, in case of a restaurant chain).

So we turned to another service, Foursquare, the famous local search engine used all over the world, although not that popular in Korea. On Foursquare, restaurants are pretty well labeled.

Foursquare categories

We parsed the API results into a set object. In Python, a set is a collection of items in which every element is unique. This allowed to automatically eliminate any duplicate, in case Foursquare returned the same places on different queries. We fed this set to a Pandas DataFrame, an easy-to-use data structure. At this point we were able to retrieve a total of 15,576 restaurants and cafés.

D.3. The address problem

As we were working on the next phase of the project, we noticed that something was wrong in the Foursquare data. Some places had their address missing, others showed only the street, or the country… We couldn’t do anything with such inconsistent data.

We had to go back to the data collection phase to collect more data about each venue. That is not a problem, since the process flow of data science projects is iterative, which means a lot of back-and-forth testing new hypothesis, adding or discarding features, and fine-tuning various parameters.

Based on the hopefully reliable information we got from Foursquare (the latitude and longitude), we looked for a way to transform coordinates into a structured address. That is when Kakao’s API came to the rescue, thanks to its “coordinates to address conversion” function which does exactly what it says on the tin.

Kakao API

We were able to complete our data frame with the district and neighborhood for each place, looking all good for the next phase.

E. Data understanding / cleaning

Question: How to clean and shape the data set so that it fits the model?

This is typically the most tedious and time-consuming part of a data science project, especially as the size of the data set gets bigger. Fortunately, in this project we deal mainly with structured data, meaning they are already well ordered in tables or dictionaries that are easy to parse with basic Python libraries.

We analyzed the collected data to determine what kind of ‘cleaning’ would be needed prior to running the clustering algorithm.

E.1. Address check

We wanted to make sure that all venues collected are really part of Seoul. Indeed, the Foursquare API returned a list of venues based on neighborhoods coordinates and a radius to search around these coordinates. It would not be surprising if it crossed the limits of the city. To discard any restaurant located out of the capital, we checked the list of unique districts contained in our data frame. We could easily spot any value which is not one of the 25 districts of Seoul.

E.2. Marginal categories

A simple look at the list of categories allowed to identify odd values, some of them totally unrelated with food (‘Design Studio’, ‘Event Space’), others which can be considered as outliers because they are in very few numbers and can hardly be representative of any neighborhood.

Plotting the number of places collected for each category shows that the vast majority of categories are close to insignificant.

We decided to remove all categories (and venues corresponding to these categories) for which we collected less than 3 venues. This operation reduced the number of categories in our data set from 150 to 99.

Number of venues retrieved for each category

E.3. Minimum number of venues

Just like a survey with a small sample size isn’t saying much, it wouldn’t make much sense to analyze the neighborhoods for which we retrieved only a couple of places. A neighborhood with only a few restaurants won’t teach us anything and can only pollute our data. Let’s call the function describe to see statistics on the number of venues retrieved for each neighborhood.

On a total of 456 neighborhoods, the average number of venues is 34. Half count more than 24 places. 25% of neighborhoods host less than 10 places. The most represented neighborhood in our data set counts 149 restaurants (it is the neighborhood of Sinsa-dong, in the district of Gangnam-gu). With a single line of code and a threshold set to 10, we could reduce the number of neighborhoods to 347 and the number of places to 14,996.

Let’s visualize on a map the density of restaurants in our data set, thanks to Folium library. This library matches the GeoJSON data of Seoul’s neighborhoods with the figures in our data set.

Density of restaurants in the dataset for each neighborhood

E.4. Feature engineering

Data preparation phase is also the phase in which is performed feature engineering, which consists in selecting and shaping the variables to be fed to the algorithm. The only feature for our project is the type of cuisine. This categorical variable (a variable that can take on one of a limited possible values, such as ‘French Restaurant’ or ‘Fried Chicken Joint’) needs to be converted to numerical values to be understood by the machine. That is easily done with what is called ‘one-hot encoding’, the process of turning N possible values for a variable (in our case, 99 types of cuisine) into N binary variables (1 if the restaurant is of that type, 0 otherwise). In Python, Pandas library’s function get_dummies does the job in one line of code.

Our features also need to be scaled, so that they all fall in the same range, most commonly in a range between 0 and 1. In our case, we will calculate the ratio of each type of cuisine for every district and neighborhood (e.g. 0.42 % Korean restaurants, 0.38 % cafés, 0.2 % Chinese restaurants). More information on feature engineering : https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114.

After grouping restaurants by neighborhoods and calculating the ratio for each cuisine, we can write a function that returns the most common cuisines in every neighborhood.

We are now ready to feed the algorithm!

F. Data modeling

Question: How to fit the model so that it answers our problem?

It is time to run the algorithm that will hopefully solve our problem, feeding it with the data we shaped in the previous step. Data modeling is also the phase in which the data set is split into training and test data when using supervised types of learning (such as decision trees or neural networks). But for an unsupervised machine learning technique such as clustering, that is not needed, and all we have to do is find the optimal parameters for our algorithm.

F.1. K-Means clustering

The K-Means algorithm requires that we set the number of clusters (the parameter k) prior to fitting the model. Finding the optimal value of k is possible with the elbow method, which consists of running the model on different values of k, evaluating a score based on the sum of squared errors (SSE), and plotting the values on a graph. The optimal value of k is where the curve marks an angle. Below, we can see that the curve (in blue) is rather smooth, which suggests that our data is not easy to cluster. Yet, the algorithm found an optimal value of 3 clusters for districts (on the left), and 4 clusters for neighborhoods (on the right).

F.2. Hierarchical clustering

Hierarchical clustering treats each sample (in our case, each district or each neighborhood) as a single cluster and then successively merge pairs of clusters until all clusters have been merged into a single cluster. The optimal value of k can be determined looking at the following graph (called a dendrogram) which shows how the aggregations are performed. Here again, the optimal number of clusters is 3 for districts, and 4 for neighborhoods.

G. Model evaluation

Question: Does the model answer the problem or does it need to be adjusted?

Let’s visualize the clusters computed by the algorithm. On a map, first, to see if districts and neighborhoods are geographically clustered, and then by looking at tables to try and find any pattern among clusters.

G.1. Districts

District clusters
  • Cluster 0 (green): Located in the periphery, far from the business centers, these districts show the highest ratio of Korean restaurants, as well as a mix of popular cuisines (BBQ, noodles, fried chicken, Chinese food).
  • Cluster 1 (pink): These “intermediate” districts show a rather lower ratio of Korean restaurants, and a fair balance of cafés and meat restaurants.
  • Cluster 2 (yellow): With a high concentration of cafés (exceeding that of Korean restaurants if we aggregate cafés and coffee shops), BBQ joints and bakeries, this is where people work and hang out.

G.2. Neighborhoods

Neighborhood clusters
  • Cluster 1 (dark green): This is where to go for a good BBQ. Well, barbecue is everywhere in Seoul!
  • Cluster 2 (purple): In this cluster we find neighborhoods known for hosting a large number of cafés: Gahoe-dong, Mullae-dong 2-ga, Gye-dong and the hip place Ikseong-dong.
  • Cluster 3 (yellow): This cluster groups neighborhoods with a high ratio of general venues such as Korean restaurants and noodle joints. It includes popular neighborhoods such as Jegi-dong, Insa-dong, Jongro 4-ga and Jongro 5-ga.
  • Cluster 4 (light green): This cluster is less clear and appears like a mix of other clusters. We may assume that these neighborhoods don’t show any predominant type of cuisine.

H. A few remarks…

  • In the “Data modeling” part, we have seen that our data is not easy to cluster. It might be because we didn’t get enough data, or because the feature “type of cuisine” is not discriminatory enough to put a clear label on every neighborhood. We could find ways to add more data (such as pubs and bars) or more features in the balance.
  • Clustering has proven more effective and easier to analyze for districts than for neighborhoods. This could be explained by the difference in the amount of data we collected in average for each area unit (43 venues per neighborhood vs. 620 for each district).
  • Foursquare is seldom used in Korea, and its database is mostly populated by foreign visitors. This could create some bias in the data returned from the API. The analysis may have been more accurate using a local map service. But as we saw, it would have required more cleaning work to come up with distinct categories.
  • Even with Foursquare, more work could have been done to make the categories more coherent. For instance, we could have grouped the categories “Café” and “Coffee Shop”, or “Japanese Restaurant” with “Ramen Restaurant” and “Sushi Restaurant”.

--

--