Finding the most “innovative” square kilometer in Europe with spatial SQL
I enjoy finding obscure large datasets and enjoy doing spatial analytics that I haven’t seen done before. One of the best places to find interesting data in a somewhat passive manner is to subscribe to the Data is Plural newsletter. That is how I also came across this piece of data.
So this is what I found: 18.8 million geocoded patents 👀
Producing this data was a part of a research project and there was also a scientific article about the data. In the abstract of the article there is a good short explanation on what all of this is about:
The dataset provides geographic coordinates for inventor and applicant locations in 18.8 million patent documents spanning over more than 30 years. The geocoded data are further allocated to the corresponding countries, regions and cities. When the address information was missing in the original patent document, we imputed it by using information from subsequent filings in the patent family. The resulting database can be used to study patenting activity at a fine-grained geographic level without creating bias towards the traditional, established patent offices.
So there was a lot of high quality open spatial data, but only very minimal spatial analysis was done with it (so far), as it wasn’t in the scope of the original article. So this was like a tailored case for me to get my hands dirty and dive into the data. Time to crunch those 18 million points!
First I did some basic visualizations with the points and posted them on Twitter. I had some interesting conversations and idea exchange there on how to approach the analysis. But the analysis took an interesting turn when the original author of the article about the geocoding of the data, Gaétan de Rassenfosse, a professor at EPFL, sent me an email. He asked whether it would be possible to determine the most innovative square kilometers in Europe based on that data.
I thought this was a brilliant idea and immediately had a technical approach on how to determine that in my head, so I got right to it!
The technical nitty gritty
To analyze the data, the logical approach for me was PostgreSQL with PostGIS. If you are unfamiliar with those, you might want to check my older blog post that tries to cover the basics pretty broadly. Anyhow, here’s the short and simple description about this particular analysis process.
First I loaded the data to my local database. I like to load csv files usually through the QGIS GUI and I did that also this time. Took a while as one could expect.
Because just showing the number of points would end up being pretty much like a population map (a lot of people → a lot innovations) I needed to normalize the data. For this I used the GEOSTAT 2011 grid. Of course this could have been also done with the number of companies or something else, but population data was readily available in a nice square kilometer grid.
There are some patents in the data with several inventors, so in those cases one patent leads to multiple points in this data. When combining the population with the grid, I wanted to get a fractional number that would represent the total contribution of that point to a patent. In other words, if there would be a patent with five points (inventors) for a patent, each of those points should contribute 0.2 to the innovation score of the grid. That was achieved with the following query:
CREATE TABLE patent_share
SELECT appln_id, 1.0/COUNT(*) fract
GROUP BY appln_id;
Next up I joined this with the content of the original table and also filtered out all the patents that were filed before the year 2000. I am also using a window function in there to check if how many points have similar geometries. That was to later filter out grid cells that would have all the patents in a single point.
count(geom) OVER (PARTITION BY geom) as similar_points,
ps.appln_id, fract, patent_office, filing_date, lat, lng, ctry_code, name_0, name_1, name_2, name_3, name_4, name_5, city, coord_ source, source, type
FROM geoc_inv gi
LEFT JOIN patent_share ps
ON ps.appln_id = gi.appln_id
WHERE filing_date::date > ‘2000–01–01’;
Then it was the time to combine the data with the population grid downloaded earlier. That was achieved with PostGIS ST_Intersects function as follows:
CREATE TABLE innovation_grid
SELECT a.grd_id, tot_p, count(*) AS patent_count, sum(fract) AS patent_fractions, sum(fract)::real / tot_p::real AS patent_per_capita, max(similar_points) as highest_single, ctry_code, a.geom
FROM eu_pop_grid2011 a
JOIN innovation_analysis b ON
ST_INTERSECTS(a.geom, ST_TRANSFORM(b.geom, 3035))
GROUP BY a.grd_id, tot_p, a.geom, ctry_code;
Note how I also did a ST_TRANSFORM for the points to get them inside the grids. I used aggregate functions to calculate the following attributes:
- patent_count: the total number of patents
- patent_fractions: the total sum of patents
- patent_per_capita: the sum of the patent fractions divided by the total population of the grid
- highest_single: the number of patents that are all in a single point (i.e. same geometry)
Last piece of the SQL puzzle was a SELECT statement that used the table created above. The query selects the top of each group (country) using the DISTINCT ON by ordering the grid cells based on the patent_per_capita value.
SELECT DISTINCT ON (ctry_code)
ctry_code, patent_per_capita, patent_count, similar_max, geom
WHERE patent_count > 10 AND tot_p::int > 10 AND patent_count != highest_single
ORDER BY ctry_code, patent_per_capita DESC;
I tested this a few times and most interesting (but still sensible) results were achieved when I added the WHERE clause that left out grid cells that had a total number of less than 10 patents or a total population. After my first iterations I added a rule, where patent_count cannot be the same as highest_single. This meant that there has to be more than one patent location inside the cell. This was to avoid results being skewed by mistakes in the geocoding (i.e. a lot of points end up in the same “default” location).
Some SQL gurus out there might be shaking their heads with all these new tables and mumbling ‘I could’ve done that in one query’, but I did try a few different approaches and it seems that doing an UPDATE on +18 million row table isn’t nearly as straightforward as one might think. Any feedback is welcome!
Each query ran in 5 to 10 minutes on my Thinkpad with 16 GB RAM. Reasonable for this kind of big data (yes, I said it) analytics.
Results are in! Where are the innovation hotspots in different countries?
As you can also read from the scientific article, not all countries were included in the data in the first place. Next up there is a list of European countries that had their highest grid cell above a certain threshold (< 0.01 patents per capita) and I will go through them in an ascending order.
Highest patent per capita: 0.04
First on the list is Poland. The grid cell ends up quite close to the center of Wroclaw. Based on 30 second desk research there seems to be some university and a polytechnic around that area that might have some effect on this.
Highest patent per capita: 0.11
Next up Greece. The hotspot is around 10 kilometers from the center of Athens. Typical location for businesses I would guess.
Highest patent per capita: 0.16
Just like in Greece, in Hungary the “innovation hotspot” is located in the proximity of the capital. That area seems to be called Budaörs and there seems to be some companies located there.
Highest patent per capita: 0.17
My assumption is that Dublin is full of company headquarters, but maybe a bit surprisingly that is not the center of innovation. The square kilometer with the highest patent per capita is on the western coast outside of Galway.
Highest patent per capita: 0.24
Compared to the earlier results, Norway has their innovation hotspot quite close to the center of their capital, Oslo. Can’t explain the location here either any further, but I think the score is surprisingly low.
Highest patent per capita: 1.06
In Austria the “most innovative” square kilometer is in a somewhat unexpected location. Or so it seems at first. North from the city of Graz, in a place where there aren’t even any Google Street View photos available. But the place (town/suburb) is called Andritz. Google tells us that Andritz AG is an Austrian plant engineering group with headquarters in Graz and I have a strong feeling that these things just might be connected…
Highest patent per capita: 1.2
In Belgium the place with the highest score ended up outside the city of Gent, near a place called Wetteren. This time Wikipedia and Google didn’t give me any hint why this was the case.
Highest patent per capita: 1.32
Almost at the corner of three countries, the Swiss innovation hotspot ends up in Basel. The proximity of the neighboring countries is probably a fact that boosts innovations.
Highest patent per capita: 1.48
In Sweden the cell with the highest number ended up just outside of Göteborg. This is very close to the shore, so maybe this could have something to do with marine industry? Seems a bit strange to me.
Highest patent per capita: 2.55
Not surprisingly in Italy the most patented square kilometer per capita ends up in the northern part of the country, where most of the industry is also located. South-west from the city of Turin, When browsing the area with Google Street View, it seems like a typical outskirts of a bigger city with some industrial activities.
Highest patent per capita: 5.6
The Danish innovation hotspot lands south from the city of Randers. No idea what causes this, as Wikipedia says “The main economic and employment sectors today in Randers are, administration and service, crafts and industry, trade and transport; in that order.”
Highest patent per capita: 11.15
The Spanish innovation hotspot seems to be very close to Madrid. Again Google helped me to discover that around this area there are a few universities, which might be explaining some of the patents.
Highest patent per capita: 25.53
Not very surprisingly, in Great Britain the square kilometer with the most patents per capita lies just outside of Cambridge. One of the top universities in the world, combined with a vibrant entrepreneurial community, can be expected to produce patents.
Highest patent per capita: 25.53
Based on my initial research Germany was FULL of patents so it was interesting to see where the German patent hotspot would end up. Eventually it ended up south-west from Stuttgart, near Böblingen. According to Wikipedia, this region is “a center of both automobile and computer industries”.
Highest patent per capita: 25.66
Eindhoven, in the Netherlands, is one of the winners in this patent race. This particular piece of land has a whopping 25.7 patents per capita ratio. I was also enlightened on Twitter that Eindhoven is one of the cities in the world with the highest patent per capita ratio. The exact location is near a science park.
Highest patent per capita: 26.13
This one was easy for me to use to validate the data. In Finland the most innovative point ended up in a place with a lot of company headquarters (including Nokia and KONE) in Espoo, just outside of Helsinki
So we got a silver medal! Or maybe even gold if that next result is actually a mistake!
Highest patent per capita: 58.45
This was one of the weirdest results at least for me and at the same time it has the highest number in all of the explored countries. Patent hotspot ended up outside of the city of Quimper in Western France. Maybe someone on the Internet could explain this, because Wikipedia only enlightened me that “The town’s best-known product is Quimper faience, tin-glazed pottery.”
Might be a mistake in the data, but interesting results still!
EDIT: after publishing this article, I got some interesting feedback on Twitter regarding this. Might just be that there are some issues with the geocoding here… Check this thread for further details:
Disclaimers and conclusions
One thing that seems to be in common with most of these places: proximity of a major city and a motorway. Not very surprising, but still interesting to validate it this way. I think one square kilometer might be a bit small unit to examine, but that is easy to change for other purposes.
It is worth noting that I have chosen to use the location of inventors, as opposed to the location of the patenting firms. As such, these figures capture the location of human capital. Note also that very productive inventors, who produce multiple patents, will contribute to improving the score of their geographic location. The points do not represent the number of inventors but the (fractional) number of patents produced.
Finally, patents are an imperfect proxy of inventions: not all inventions are patented, and patents greatly differ in their technological quality and economic value. Besides, there are also many ways to count patents. The figures presented in this post are just one operationalization and other methods may very well lead to different innovation hotspots. But the data are available and I have shared most important scripts and my working methods, so feel free to produce your own estimates.
Please leave a comment or reach out on Twitter. If you would like to build something more serious from this with me, you can reach out via my website. I work with free and open source software for geospatial at a company called Gispo and do this kind of analytics (+ give PostGIS training courses) as my profession.
Huge thanks to Gaétan de Rassenfosse who helped out understanding the data and writing the blog!
Check out the following links for some further insights: