Building the ultimate map for e-cars
How we use data science to build a complete map with charging stations
Authors: Azada Henze, Isabell Konrad
- Why do we need the complete map with charging stations
In 2010, the German government set a target of increasing the number of e-cars to 1 million by 2020 and making Germany a pacesetter in the field of e-mobility. But they did not. Actually, they are far behind their targets, and as of December 2018, Germany ranks only the 6th largest plug-in market in the world and 4th in Europe. The numbers might change soon, now that Elon Musk is planning on building a gigafactory in Berlin which will be delivering 100 000 e-cars yearly.
Even though e-cars offer a quiet, clean, and more sustainable alternative to fuel or diesel-based cars, many German drivers decide against them. Maybe that’s just because there isn’t an optimal infrastructure to allow a trouble-free experience driving an e-car. Realising the pitfalls of the previous policies, Angela Merkel made a new commitment on November 2, 2019: “For this purpose, we want to create a million charging points by the year 2030, and the industry will have to participate in this effort,”. Fingers crossed.
Currently, stations are spread in highly-populated cities or along highways. However, with an increasing number of e-cars, many providers of the charging stations start targeting rural areas, workplaces, shopping malls, or housing areas. With charging stations built every day, e-car drivers need an up-to-date navigation map to find the best route to their destinations based on their preferences. And, this is exactly where we step in with an ultimate map for charging stations. The map serves two main goals: first, it contains every registered charging station in Germany, and, second, it offers thorough usage information based on several openly available data sources. We compared information from three primary data sources: Open Street Map (OSM), Open Charge Map (OCM), and Bundesnetzagentur (BNA), and implemented algorithms that select only the most reliable information for every charging station in Germany. In this article, we will take you through the steps of working with geodata, unifying features from multiple sources under one dataset, and, finally, creating a map on Mapbox Studio.
2. Our data
As usual, before you can do any cool data science project, you need to collect data. One of the most amazing things about being a data scientist these days is that data becomes more and more public, and anyone can use it for analysis. All of our data sources are openly available, and we will show you how to access it.
The easiest way to request OSM data is in https://overpass-turbo.eu/, where you can write OSM queries and download data in GeoJSON format. For charging stations in Germany use the following query:
You can download data from OCM with the following link:
and from BNA by downloading file named Liste der Ladesäulen (Stand 16. Oktober 2019) (xlsx / 1 MB) on the official page of Bundesnetzagentur:
Once we have our data, we can see which features our data sources have and how there are defined. As expected, each of the sources have different features, and the same features are named differently.
Identifying and processing differences between the data sources was a challenging part, but it is the whole idea of our map: to plot charging stations from different sources and analyse the information provided by each of them. The table above shows the overview of features defined by our sources. As soon as we identified the same features, we extracted them and stored them in a dictionary as keys and the information provided as values. We also added a new key ‘data source’ with either one of the values ‘OCM’, ’OSM’, or ’BNA’ to document the original data source. We will illustrate it on the example of OCM:
Each data source is processed the same way but based on its individual data structures: for every charging station, we create a node, extract existing features, set non-existing features to ‘Unknown’, and add the node to a common list with all nodes. The dictionary is then stored in a GeoJSON file according to a unified format to encode geographic properties. In GeoJson format, the collection of data points like ours is a Feature Collection where every node is a feature with a latitude and longitude defined under the key ‘geometry’. Geometry is what maps use as geographic location before representing data points on the map.
3. Building a map with Mapbox Studio
At this point we have a unified dataset with 29.358 charging stations derived from three different data sources. The unified dataset contains duplicates, because equal charging stations can occur in more than one data source. Now, we can go ahead and plot them on a map.
Luckily, Mapbox Studio offers a variety of tools that make building complex, beautiful maps easy and fun. Creating a map on Mapbox Studio unfolds around two main steps: adding your GeoJSON data as a Tileset, and customising the map settings in Styles. A tileset is a set of vector tiles that are used to represent geographic data in a browser. You can create one by just uploading your GeoJSON file to Mapbox, which will then convert it into vector tiles.
The completed tileset is now ready for visualisation, which leads us to the creative step of defining the style of our map. Under the Styles, you can choose any map template you want or just define one from scratch. For now, we opted for a basic style with some additional color settings for the background layer.
Our resulting map doesn’t have the charging stations yet. This is where the tileset with the data points comes in handy. One of the key features of Mapbox Studio is that visual representation of the geographical data is defined in layers. For instance, geographic objects like countries, cities, roads, or charging stations are represented as separate layers. You can then define or change the visual style of your objects in that layer. In our case, one layer represents data points from one source as charging points, and another layer adds text labels about the features of the charging stations. The layer with the charging points is then configured to different sizes and colours.
To configure data sources individually we added each of them as a layer with a corresponding colour. Besides, we added features of the charging stations as a separate layer with text labels and configured how much text should be visible at different zoom levels. To use this feature, just add zoom stops, and at each stop, set the string field to the features that should be displayed.
After configuring features of the map, voila, we ended up with a map that illustrates all of the charging stations in Germany with a unified collection of openly available information.
4. Creating a dataset with complete relevant attributes
Features of the charging stations
Usually, the main information provided by the charging station entries is its location (latitude, longitude). This makes it easy to display the charging station on a map or embed it into the routing algorithm of a navigation system for electric cars.
Additionally, the following attributes of the charging stations are stated. The address shows the street name and house number. Amperage gives the current strength, the operator states the electric company providing the current, the local owner of the charging station or the charging station infrastructure provider. The attribute payment tells you if a charging station is free, if a membership is required or which payment types are accepted. Socket type states the various sockets available at the charging station, like Type 2, Combo CCS, or Schuko and how many of these sockets are mounted. Finally, the number of cars that can be charged simultaneously is given by capacity. Then, voltage is the electric potential and authentication tells if membership is required. If an attribute is not available, it is defined as Unknown.
Depending on the data source, the information about the features of the charging stations is more or less existing or reliable. The data source BNA is the most reliable since it is maintained by the German ministry of economics, but it does not provide the attribute payment, authentication, and voltage. The data source OCM is the second-best set. It collects its information from users and community, who enter new charging stations into a mask on its webpage. Then the OCM team goes over the entries and make necessary improvements. All attributes are stated in this dataset. The third dataset from OSM is the least reliable. Everybody can enter new charging stations, but it is optional to give any information about the attributes. The only feature, every charging station has, is its location.
Merging the duplicates
Having these three sources, the first step is to identify equal charging stations from different sources. The main indicator is the distance of the charging stations. Starting with one charging station, we find all charging stations within a distance of 100 m and have a closer look at the other attributes. To find the nearby charging stations, we first apply a K-means algorithm to divide Germany into clusters. Then, we calculate the distance of one charging station to all others in the same cluster and pick the nearby ones. The running time of this procedure is much shorter than without the preceding clustering. In this code extract, we demonstrate how we find the nearby charging stations and save the id of the charging station in a list.
Once, we have the possible duplicates, we build groups of charging stations, lying close to each other. There can be more than just two ones in a group, since a station can lie close to several charging stations and, hence, appear several times in the list duplicates.
Within these groups, we now examine the addresses and the operators; but only if the charging station is from the data source BNA or OCM because the source OSM often does not provide these attributes.
If only the address or only the operator is different, then we assume that the deviation is due to different notions or perceptions. It is not likely that there are two nearby charging stations from the same operator, or on the same address. And we think that the charging stations in the group are identical. If the addresses AND the operators of the charging stations within a group are different, then the group is divided again.
We check the difference between the operators and addresses by using the SequenceMatcher library in Python that calculates the similarity between two strings using the Ratcliff/Obershelp Pattern Recognition algorithm. Matching operators and addresses of the stations was one of the demanding tasks, so we will go over it in the next section.
String matching for operators
As mentioned earlier, after identifying the charging stations lying close to each other with K-means, we have to check whether they have the same operator names. Doing this, we came up with some difficulties. See, we have points that have exactly the same operators name like ‘drewag stadtwerke dresden’ in OSM, and ‘drewag stadtwerke dresden’ in BNA. But we mainly have cases where the names of the operators are similar but are just written differently. For example, ‘ewe vertrieb gmbh’ can be labeled as ‘swb/ewe’ or ‘swb vertrieb bremen gmbh’, and just ‘ewe’ at OSM. Moreover, quite often, a full operator name is shortened and labeled as an abbreviation, for instance, ‘ele’ in OSM referred to ‘emscher lippe energie’. Luckily, these are common problems in natural language processing and can be solved by different string matching algorithms. We opted for SequenceMatcher library from Python difflib module that takes two sequences of any type and finds the differences between them. The return value is the ratio that is the sum of all the matched sequences identified in the strings. The heart of the class is the Gestalt Pattern Matching algorithm that looks for the longest common substring (LCS) in two strings plus a number of the same characters outside the LCS. As a result, the ratio is the similarity score that tells us how similar two strings are. So, we assume that operator names with higher than 0.42 score refer to the same operator and save them in a text file.
To identify the abbreviations, we perform the same matching algorithm but compare not full operators but only the first letters in a string. So, for example, to find what ‘ele’ refers to, we would compare it to the first letters of ‘emscher lippe energie’ to ‘ele’ and get the desired 100% of similarity.
Let’s have another look at our data after comparing operators and addresses of the stations. So, good news, 4 489 the charging stations have the same names across the sources, so can be merged into one station. Similar operator names of the stations imply that the stations have the same operators but are written differently, we have 2 485 of them. We can merge them as well. Different operator names (6 149) or operators with value Unknown(4 186) are regarded as probably false positives and should not be merged. As an additional checkup, we compare the address of the charging stations and if these are different as well the stations are not merged but considered as separate charging station.
Merging infrastructure provider with current producer
In the dataset OCM, sometimes we find under the name operator the infrastructure provider instead of the current producer. By infrastructure provider we mean an organisation pooling charging stations or current producer such that you can use one membership card for paying at a variety of locations, e.g. “ladenetz.de”, “Schwabencard”, and so on.
The dataset BNA mainly provides the actual current producer as operator.
This discrepancy — a pain when matching equal charging stations — can be used as a source of information when mapping infrastructure providers with current producers.
We simply match the operator entries in OCM with the ones in BNA. To drop occurring errors, we only add a match if it appears more than two times. Finally, we have a dictionary, with a list of current producers for each infrastructure provider. Here an extract:
Extract best information
From the charging stations within a group, we find the best values for amperage, socket types, and capacity. First, we bring the different kinds of notations of the different data sources to a uniform presentation. For amperage and capacity, it then depends on the data source which value to continue with. Preferably we use values from data source BNA, next OCM, if both not available, OSM. Additionally, regarding amperage, we round the amperage values to the ones usually provided in the German electric grid which are 8, 16, 32, 63, 125, or 200 Amp. To extract the socket types, we compare the sockets of the different charging stations within the same group. If a feature states an additional charging station, not provided in the other ones, we add it as socket type.
We proceed like this because we assume that nobody states a socket type that is not there. But — especially in the data source OSM — people might forget to state socket types that are not important for them. After collecting all these information, we save just one charging station per group with the extracted information and store it in the GeoJSON file UnifiedDataNoDuplicates.geojson.
5. The final map without duplicates
Finally, the tedious pre-processing step is completed, and we can get back at being creative with Mapbox Studio. The steps of building the final map follow the same steps we have made with the map with duplicates with some minor adjustments. Because now we have a dataset with unique charging stations, we do not need to style them according to the data sources. Besides, if you zoom in you can see thorough information about the usage and available socket types.
As planned, we have built a map with unique charging stations and detailed usage information about the features of the station hand-picked from multiple data sources.
Our resulting map offers a viable contribution to building an ultimate map containing not only all current existing charging stations in Germany but also essential, thorough and reliable usage information for e-car drivers. This map is the initial step towards building complex navigation systems explicitly tailored to the charging demands of the e-cars. We hope that our analysis can enhance the research of building an optimal infrastructure to address the current challenges of e-mobility, and, thus, promote a more sustainable environmental impact.
This blogpost is published by Comsysto Reply GmbH