Evolution of an IoT-architecture for EV charging

Virta is a company operating a digital electric vehicle charging platform globally in tens of different countries. We started creating our technology from scratch in 2014 by managing a few charging stations more or less manually with simple technology. During the next six years, we grew to one of the leading companies in the business and are one of the fastest-growing companies in Europe. At the same time, our technology evolved from a simple system developed by one person to a state-of-the-art IoT platform having highly automated and scalable systems to manage tens of thousands of devices with a huge growth rate. This is a story of our technology evolution during this journey.

About electric vehicle charging

Electric vehicle charging stations are sophisticated IoT devices. When you want to charge your electric vehicle (EV), you need to plug it into a charging station. Charging stations have all kinds of internal logic to manage the electricity flows with the electric vehicles. In addition, the charging stations are connected to a charging station management system (CSMS) through an internet connection. In our case, the CSMS is the Virta-platform that our company develops and sells. The CSMS systems handle the complex logic of EV driver authentication, billing, fault management, etc.

How it all started

In 2014 Virta was a new small organization. The company had three people: CEO, VP of customer relations and a CTO (that’s me). Our company was formed a few months earlier with the first investors who were also our first customers. The first goal was to build a charging station network for these companies in Finland. We had a few months and a small budget to prove that we can do this to continue with the business.

Due to the limited timeline and budget, our first technology choice was a small CRM and loyalty card system built by a small Finnish company. That system had nothing to do with EV charging, but it could manage RFID cards and customer information, both of which were needed for EV charging. So our choice was to take this existing CRM system as a basis for our project and then just add charging station management functionalities there.

We started working: a three-person team at our company, an existing small CRM system to be customized, one developer from the CRM company to do the customization and four months to have the first live version in production.

Things go live, but not without issues

I think we did a pretty good job in the beginning. We started from zero and had limited understanding and experience of EV charging technology. The charging station hardware were bought from different 3rd party companies and were more or less early beta-versions. They had all kinds of problems all the time. We had minimal development resources and a tight schedule. Still, after four months, we had a live production system to manage real charging stations, installed to real locations throughout Finland, with real people charging their EVs and payments done with real money. Doing all that with small resources and in a few months was quite an achievement. And everything was working pretty well — at least most of the time.

The CRM system which was at the core of our technology was a stateful single-server setup. It was running on one physical server, which was used to host both the stateful business logic and the database. A bit of a lousy architecture considering today’s standards, but not that uncommon when it was developed.

The actual physical server was working well and was hosted by a professional hosting company, but the software had quite often random crashes for various reasons. Those crashes then resulted in our charging stations being offline and our customers not being able to charge their EVs, which was quite a bad situation.

From stateful to stateless

We pretty soon figured out that we cannot continue with the architecture of the original stateful CRM system running on a single server. We needed to start transferring our system to a cluster of stateless servers hosted in a cloud environment.

The transition from a stateful service to a stateless service sounds pretty straightforward, but it was quite difficult in the end. Just locating all the stateful things can be a long process. You need to consider all kinds of things: logging, error handling, in-memory processing, caching, local files, etc.

Moving all functionality from a single stateful physical server to a de-centralized stateless cloud environment required a lot of work. A process we expected to complete in one or two months ended up taking a year. Still, in the end, we had a great new stateless and scalable server architecture.

Scaling IoT device traffic

Charging stations communicate with CSMS systems with a protocol called OCPP. It has a few different versions — older versions are based on SOAP/XML messages, while newer protocol versions use Websockets and JSON.

The basic principle with communications is simple: stations send a request to the CSMS, and the CSMS sends replies back to stations:

Our first architecture was still based on the old customized CRM system. It was created with minimal development resources in four months. The system was pretty simple: there was a service listening for OCPP messages coming from charging stations and when a message came in, the service processed the message and gave a reply back. With the renewed architecture, there was a load balancer and several stateless servers to handle the load — all good and scalable until you get high peak loads.

Big traffic spikes create issues

Most charging stations have 2G/3G/4G modems and SIM cards to get internet connectivity. It is not that uncommon that mobile networks have short breaks. If a telecom network has some issues, charging stations can go offline for a minute and then connect back online. When that happens, you suddenly get a massive amount of traffic from all of the devices in the same network simultaneously. Even with load balancers and autoscaling, servers can’t just handle that big spike anymore on synchronous request-response processing.

As a result, we started seeing a fluctuation pattern: a massive amount of devices connecting at once, some servers overload and crashing, devices going offline again due to server crash, connecting back at then to new automatically scaled servers, those crashing again, and so on. This could last a few hours before the traffic was balanced nicely between different servers behind the load balancers.

Queuing system

Since our synchronous request-response processing could not scale to huge spikes of traffic, the next step was to implement queues. Instead of starting to process incoming messages from charging stations immediately, they were added to a queue. Then a separate queue processor took the messages from the queue at a controlled rate, and processed them. This transferred the big spike loads from the actual message processing to a queue, which enabled the server infrastructure to stay stable.

This worked pretty well for a while until some stations started filling the queues with invalid messages.

Queues with different priorities

Message queues usually work with First In — First Out (FIFO) principle. The message that arrives first is processed first. However, there are messages with different priorities: processing an informative status message from a charging station is not as urgent as processing a message that allows a customer to start or stop charging on a station.

In our case, it happened every now and then that a charging station had some kind of a malfunction and started spamming us with messages. Suddenly the message queue would fill up with tens of thousands of messages from a single station, that was sending us an error message a hundred times a second. Even with normally functioning stations, if the telecom network briefly went down and came back online, you could get thousands of messages from stations telling “Hi, I’m online again” at the same time. If some poor customer tries the start charging at those times, their start-message is stuck behind all those informative messages.

So as a solution, we implemented different queues for different types of messages to add more robustness: some critical messages like charge start and stop are processed in a different place, with different priority, than informative messages like heartbeats and status updates.

Development continues

After all this we had nice, scalable and reliable system to handle huge amount of messages from different charging stations. Though we should mention that there were still a big amount of other things we did to secure reliable operations, everything described here just scratches the surface of our technology. But I hope it gives you some picture on where we started, how our technology evolved and what we found to be a good solution for our IoT architecture.

What we learned

As a result of our journey, we now have a good, scalable, and reliable architecture to manage the huge amount of charging stations and messages they send. Furthermore, when the number of connected charging stations grows, which is happening at an exponential rate, we can simply buy more cloud capacity to meet the growth.

So what did we learn? What should you consider when implementing the same kind of systems?

1. You don’t need to implement a perfect system from day one. If we had in the beginning started creating a perfect system capable of handling millions of devices, we wouldn’t be here today. We would have run out of money a long before we had anything ready to sell. Often it is better to deploy something soon so that you have something to sell, even if it would work just the next 12 months of your growth. If you can stay in business, you will have time to improve your technology later. But if you can’t deliver technology that can be sold, you will run out of money.

2. Plan and test what happens when things don’t work as expected: what if all your IoT devices go offline at the same time? What if they all come back online at the same time? What if one device starts spamming you with millions of messages? What if somebody attacks your systems and starts sending invalid messages on purpose? What if some component in your setup crashes? Does it crash the whole system? And so on. A lot of what-ifs you need to consider to make a robust system

3. Test with big amounts of data: sooner or later, your systems will probably hit a bottleneck either on processing messages from different devices or with the amount of data in the databases. Even if you can create message queues and scale up the amount of stateless services processing them, how do your databases work if there are a million times more rows in the tables than before?

4. Plan for technical debt: if you today need to compromise as we did by starting with a bad architecture of a non-scalable stateful server, create a concrete plan on when to improve it. Make a roadmap with dates: if today we deploy a non-optimal technical solution, write to your roadmap that 12 months from now, we need a project to fix that technology. That project will take four months, that we need a certain amount of money to do that. And that during this time there won’t be a possibility to develop some other things, like new features.

--

--