Optima . Blog
Published in

Optima . Blog

Outcome of this work: Polygons delineating Cairo’s neighborhoods superimposed on a satellite map.

Mapping Cairo’s Neighborhoods

Obtaining neighborhood definition polygons from the Wikimapia API and working with them in R

A short while back, we were working on a data analysis problem that involved investigating supply chain efficiency at hundreds of distribution outlets around the city of Cairo (the Capital of Egypt). For the analysis to be of value, it had to be conducted and interpreted at the level of Cairo’s neighborhoods (aka areas or regions). The catch was that the given dataset of distribution outlets had every outlet’s lat-long coordinate but didn’t offer any mapping between an outlet and the neighborhood it resides in.

Solving the problem required that we find answers to the following three questions: (1) From where can we get sufficiently reliable “polygons” that represent Cairo’s neighborhoods? (2) How can these polygons be represented and manipulated in R? (3) How to map a lat-long coordinate to its encompassing neighborhood polygon in R?

  1. Obtaining neighborhood boundaries for Cairo, Egypt

Had the pursuit at hand been about a city in the more developed world, finding administrative region boundary data wouldn’t be much of an issue. For instance, the NYC department of city planning releases regularly updated datasets of administrative and political boundaries of the city’s districts and neighborhoods.

As for Cairo, there are some scattered resources around the web for neighborhood boundary data. However, as far as we have seen, those resources are either out of date or cripplingly incomplete. We conjecture that the two key reasons for the poor availability of neighborhood mapping data for Cairo are: (1) Significant regions of the city grew and continue to grow spontaneously and rather haphazardly unchecked by any strict planning framework. (2) The country is yet to subscribe to the open government data paradigm.

Ultimately, our research led us to Wikimapia. Wikimapia is an open content, collaborative mapping project where real world geographical objects are described (and updated) by the “crowd”.

On close manual inspection, we found that Wikimapia’s definitions for Cairo’s neighborhood were more than adequate for our purpose. The good news didn’t stop there: Wikimapia offers an intuitive and rather generous API that allows users to pull data as needed.

The first step to acquiring region boundaries from Wikimapia is to generate a free API key from here. Note that if you wish to just run a small number of queries, you don’t really need to create an API key. You can just set the API Key in your API calls to “example” as we do below.

Now that we have an API key, we can go ahead and shoot an API call directly from within our R code as shown in the gist below.

R code to obtain region boundary data from Wikimapia

Let’s look at the key aspects of how the code above works.

  • Function inputs min.coord and max.coord are the bottom left and top right lat-long coordinates of the bounding box for the regions within which we want Wikimapia to return region boundaries.
  • The number of regions we want Wikimapia to return is defined by count. Note that Wikimapia will return regions sorted in descending order of their surface area. Also note that the max count per page is 100. However, you can obtain as many regions as you wish by incrementing the page parameter per API call.
  • func holds the name of the Wikimapia API call we wish to make. The right API call to use when one wishes to return all regions within a certain bounding box is place.getbyarea. For a complete list of all Wikimapia API calls, you may wish to look at the Wikimapia API documentation.
  • We use the URLencode() function to make sure that we have a valid URL (the function percent-encodes any non English alphanumeric characters most importantly spaces).
  • The getURL() function (from the RCurl package) makes the HTTP request and stores the returned JSON response in an R character vector.

If you take a look at the json.response object (character vector), you will find it rather difficult to follow (or work with in anyway for that matter). We will rectify that situation in the next section.

Before we move on, let’s re-state that Wikimapia’s crowdsourced region boundary data is immensely valuable especially when working on regions where data from formal channels is scarce (as is the case with Cairo). However, there are a few things to keep in mind:

  • Wikimapia’s neighborhood boundary definition data is, like most crowdsourced data, rather noisy (at least as far as Cairo is concerned) and especially for the relatively smaller regions.
  • There is no concept of hierarchy for Wikimapia’s regions. For instance you can’t request all Cairo regions down to two levels of depth. All you get is regions sorted by area so you can very well end up with areas of the map that aren’t covered by any regions at all (because all the regions defined there are rather small) and areas of the map where you have several levels of overlapping regions (because they all have relatively larger areas).

We had to go through a long process of data cleaning to arrive at the clean non-overlapping regions you see in the opening picture of this post. We will probably write another blog post describing the work we did to filter out most of the noise.

2. Representing and Manipulating Region Boundary Polygons in R

R package RJSONIO comes with the handy function fromJSON() which reads content in JSON format and converts it into R objects. The structure of the JSON is maintained and is represented by a hierarchy of R lists. For our Wikimapia JSON response (obtained by the code gist above), the structure of the output of fromJSON() is depicted in the figure below.

Structure of the JSON response of the place.getbyarea Wikimapia API call

The two pieces of information we would want to extract from the above structure are the coordinates of each polygon’s vertices and each polygon’s name. Here is one way to do that for the first polygon (first list item of the json.response$places list):

coerced.json.response = fromJSON(json.response)
region = coerced.json.response$places[[1]]
polygon.coords = data.frame(do.call(rbind, region$polygon)) polygon.name = ifelse(nzchar(region$title), region$title, "unnamed")

The most popular way to work with geospatial data in R is via the sp package which provides classes and methods to model most important types of geographical objects including: points, lines, rings and polygons. We will limit our discussion here to polygons.

The key classes offered by the sp package to adequately models polygons are: Polygon, Polygons, and SpatialPolygons. The interplay between these four classes is described in the figure below.

The three key polygon classes in R package sp

The Polygon object represents a simple polygon. Think about it as a wrapper around a data frame containing the coordinates of the polygon’s vertices. A Polygons object combines (more like melts together) multiple simple Polygon objects to make up a more complex polygon that may contain holes. A Polygons object must have a name (called an ID). Typically, a Polygons object would be used to model a city neighborhood. The SpatialPolygons object groups together one or more Polygons objects to model a city. In addition, a SpatialPolygons object is what is required by most spatial analysis functions in package sp (and geometry operations package rgeos which is often heavily used in conjunction with package sp for spatial data analysis). A key point to note here is: even if the analyst is working with a Polygons object representing a single city neighborhood, she would still need to encapsulate that single Polygons object within a SpatialPolygons object to use most of the functions in package sp and package rgeos.

p1 = Polygon(rbind(c(0,4), c(0,7), c(5,7), c(5,4), c(0,4)))
ps1 = Polygons(list(p1), ID = "ps1")
p2 = Polygon(rbind(c(0,0), c(0,3), c(5,3), c(5,0), c(0,0)))
p3 = Polygon(rbind(c(1,1), c(2,2), c(4,2), c(3,1), c(1,1)), hole = TRUE)
ps2 = Polygons(list(p2, p3), ID = "ps2")
sp = SpatialPolygons(list(ps1, ps2))
plot(sp, col = c("#53D099", "#FF813A"))

The following code gist shows one way to convert Wikimapia’s JSON response to sp polygons. Note that for brevity, the body of the createWikimapiaAreaURL() function is not shown in the code gist below (sp-2.R). Please refer to that in the code gist above (sp-1.R).

Let’s look at the key aspects of how the code above works.

  • The workhorse of the code above is custom function extractPolygon() which extracts a region’s vertex coordinates and name from the JSON response (i.e. from each of the json.response$places lists in the hierarchy of R lists obtained via fromJSON()) and returns an sp Polygons object.
  • The getRegionPolygons() function is mostly for code clarity and simply runs an lapply executing the extractPolygon() function over each place in the json.response$places list of the JSON response.
  • Function getRegionsByArea() creates the right API call URL, converts the returned character vector to a hierarchy of R lists and calls the getRegionPolygons() function to convert each polygon to an sp Polygons object. Note that if the requested number of polygons is larger than the maximum per page, the getRegionsByArea() function will make multiple paginated API calls and append the results. The return type is a list of Polygons objects.

The figure below shows the Cairo region polygons (with noise filtered). Note that to plot a polygon in R, you simply cast it to an sp SpatialPolygons object and pass it to R’s good old plot() function.

Cairo neighborhoods as obtained from Wikimapia (after noise removal)

Now that we have obtained region boundary data from Wikimapia and have modeled that data as sp Polygons in R, we can finally get to our ultimate objective of mapping a point (represented by its lat-long coordinates) to its encompassing region. Let’s look at how that can be done in the next section.

3. Mapping lat-long coordinates to their encompassing neighborhoods

The sp package offers a function called over() specifically for the purpose of assigning a given point to one of many given spatial polygons or returning NA if the given point lies outside all given spatial polygons. The over() function requires two inputs: (1) A set of points represented as a SpatialPoints object (which is another object from the sp package that wraps around a data.frame of lat-long coordinates) encompassing all the lat-long coordinates we wish to assign to their parent polygons and (2) a SpatialPolygons object encompassing all region polygons.

Typically the output of the over function looks like the snippet below where the index (the first row) is the point’s index in the SpatialPoints object and the value (the second row) is the index of the polygon to which the point belongs. NA denotes cases where the respective point doesn’t lie within any of the given polygons.

1    2   3   4    5   6   7   8   9   10  11  12  13  14  15 
101 NA NA NA 2 NA NA NA 12 NA 22 1 1 1 1

The following code gist shows the over function in action. For the sake of offering an example, a grid of points roughly over Cairo is created. The over function is then used to assign each grid point to its encompassing region and to filter out grid points that lie outside Cairo. The same approach can be applied to any dataset of lat-long coordinates.

A couple of things to note regarding the code gist above:

  • The createCairoGrid() function is not directly relevant to the discussion at hand. We simply use it to create a grid of equally spaced points to serve as an example.
  • In this article we do not show how we went from raw.cairo.regions which was obtained in sp-2.R in section 2 above to cairo.regions which we used here in sp-3.R. As mentioned earlier, the region boundary cleaning process is too involved to squeeze in here and might be elaborated upon in a later post.
  • The getRegionNames() function simply returns a character vector containing the name of each region. This is then indexed by the output of the over function to assign a region to each grid point.

On the map below, grid points that were filtered out are shown in orange whereas those that were assigned to their encompassing regions are shown in green. The cairo.regions polygons are shown in light gray below.

Grid points that lie within Cairo are in Green while those outside are in orange

The filtered.grid.points data.frame has an added column denoting the name of the region to which the point of the respective row belongs.

     lng       lat              region
18 31.31029 29.83986 Helwan
105 30.87918 29.91795 6th of October City
161 31.51148 29.94398 New Cairo
320 31.19533 30.07413 Imbaba
322 31.25281 30.07413 Shoubra
325 31.33903 30.07413 Heliopolis
326 31.36777 30.07413 Nasr City

If you have any questions or suggestions for improvement, please leave us a comment below. If you know someone who would benefit from this post in their work, please feel free to share the post with them.

--

--

Thoughts on data, technology, startups, and oftentimes, other things.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yusuf Saber

Yusuf Saber

Passionate about entrepreneurship, data science, and product design.