Developing monitoring system for EV Charging network

--

Virta develops a cloud-based platform to manage electric vehicle (EV) charging networks. Our platform offers various things like mobile apps, payment systems, CRM system and so on, but at the core of everything are the charging stations that we manage.

In short, charging stations are IoT devices that communicate with a Charging Station Management System (CSMS) through internet — in our case the Virta-platform is the CSMS.

It is quite common that charging stations lose connection to the CSMS for various reasons and go offline, or that they have different faults. There might be poor mobile network signal, power failure at local site, or some other technical failure. When stations go offline, we want to know where and why this happens.

We started buidling the Virta-platform in 2014 in a small startup with a small team of three persons. The very first version of the platform had a simple interface to follow the number of offline stations — when you opened the front page of the Admin Panel, you could see different statistics of the networks’s status at that moment.

However, we noticed quite quickly that we need more info: the number of the offline stations was just a snapshot at the moment you opened the Admin Panel, but it didn’t provide visibility on what happened five minutes ago, or last night. And especially it didn’t provide any info on what could be causing the problems.

We needed better visibility.

Our very first step was to start gathering history data of the snapshots, so that we could see better what has happened during the last few hours. As technology for this we chose the ELK stack, mainly ElasticSearch and Kibana as the technical tools. Our main platform saved the current situation of offline status in a relational database and updated that constantly, but didn’t save any history data. So the first step was to start gathering history data.

The first simple solution was to take regular snapshot of the number of offline stations from the relational database, and save that info with timestamp to ElasticSearch. A simple cronjob, run once a minute, did this job.

We considered the getting status (online / offline) of each individual station every minute, but then realized the that this would soon add up to a huge amount of data when you have tens of thousands of stations. So we ended up saving just the summary data, as described in the picture above.

As a result using Kibana together with the new ElasticSearch data storage, we got easily nice trend graphs of the number of offline stations.

We had now nice visibility to the history status and trends of how many charging stations are offline. Yet this didn’t answer the question why the stations go offline. If we normally had for example 10 offline stations, and suddenly that number jups to 50, the question is why? Is there a problem with our CSMS system? Is there a problem with some VPN tunnel? With some telecom network? Electricity lost in some city? Or maybe some maintenance work in some location? Why did the stations go offline?

As a solution to this we started adding different metadata and calculating statistics from different perspectives.

The first step we did was to start gathering stats from different protocol versions. This gave quick view if there was possible issues with the CSMS and how we handled certain communication protocol — this enabled us to see quickly if for example some software updates would have caused problems with a certain communication protocol. So if we previously had datapoints like {time: ‘2020–01–01T12:02:27Z, offline 27}, we added a bit more granularity to the data: {time: ‘2020–01–01T12:02:27Z, offline_occpj: 10, offline_occps: 15, offline_other: 2}

Next step we did was to get statistics from different countries. Again, this gave us good new information: if stations went offline in a single country, that usually pointed to some local telecom issues in that country. Or if it affected all the different countries, then maybe it was problem with the CSMS. This was done technically by calculating summaries with data grouped by country: [{time: ‘2020–01–01T12:02:27Z, country: ‘FR’, offline: 12}, {time: ‘2020–01–01T12:02:27Z, country: ‘DE’, offline: 24}…]

Since we knew that often problems were caused by telecom issues like weak signal, we wanted to follow offline statistics also grouped by the telecom network. For this we started following ICCID-numbers the stations communicate when they connect to the CSMS, and based ICCID numbers we knew to which telecom network the stations belong to. From this we then were able to see if problems happens in certain network. [{time: ‘2020–01–01T12:02:27Z, iccid_prefix: 890703’, offline: 9}, {time: ‘2020–01–01T12:02:27Z, iccid_prefix: ‘8943201’, offline: 17}…]

Example: a single telecom operator had a major break in one country which brought several charging stations offline. From the monitoring we saw immediately that the problems are related to a single telecom operator, and were then able to contact the operator to work on a fix.

We continued working on this, adding statistics grouped from more and more perspectives, coming up with about 20 different most common factors affecting why stations fall offline. And since we had good data in ElasticSearch, we were able to create good dashboards to see immediately if there are problems, and if yes, what causes them.

At this point we had good visibility on what’s happening in the network, but we still faced a question that what is “normal” status and when should we react to problems? Let’s say typically you have on average 20 offline charging stations, and that number then jumps to 27. Is that normal or not? Is it just some individual broken hardware, or are there bigger problems on telecom networks on the CSMS?

As an answer to the question “what is normal” we started utilizing machine learning. Since we had quite a lot of history now of the how many stations there are offline and why they are offline, we had good data for a machine learning based anomaly detection system.

With the data we had, we were able to teach the machine learning model fairly quickly what is considered to be normal or abnormal, and then alerted automatically on abnormal events.

What did we learn?

We were able to create fairly easily good visibility on different problems, what caused them, what is normal and isn’t, and get alerts for the most serious problems.

As technology stack, maybe Kibana and ElasticSearch is not the perfect solution in the long run. Some time series database, maybe combined with Grafana as UI tool, would probably be a better solution when the amount of the data grows. But so far the chosen technology has worked well-enough, providing good visilibility to various interesting things happening in a big network of tens of thousands of charging stations in more than 30 countries.

--

--