Creating a Machine Learning Data Pipeline in AWS Lambda

8 min readNov 24, 2017

At Solar Analytics our core business is analysing energy data from houses with solar panels. Using a device connected by 3G we get highly accurate data of the homes’ energy consumption and solar production that allows us to analyse how well the solar panels are performing. Recently, we developed a product for an Australian energy utility to monitor the houses they had with solar panels. However they were unable to install the 3G device into the homes, so we were only able to get the net export and import energy data that the company used to bill customers.

Compared to circuit level high fidelity data, using low fidelity net data to identify if a home’s solar panels are working is a difficult problem. The key challenge is you don’t know how much energy the house is consuming. The exported energy could be low because the solar panels are not working, or because they are consuming a lot of energy within the house. In order to provide the analysis we spent time creating a machine learning algorithm that was trained on our rich data set of existing houses. The exact process we went through to implement the machine learning algorithm is another topic of its own, but after a number of weeks developing the algorithm we were confident we could predict solar panels performance based solely on the net metered data. We then needed to implement a data pipeline that could process, predict and send back a status to the energy utility.

The project:

Implement a data pipeline that ingests and processes energy data for each house every day, create a prediction of whether the house’s solar panels are performing well or not and then send that prediction back to the energy retailer.

Iteration 1

We were receiving the data on a Microsoft Azure Queue that was managed by the energy utility. This suited us well — it meant that we were not at the mercy of any spikes in traffic, and the data ingestion could be smoothed out over the day. Having said this, it was still vital that the application could easily scale horizontally. The energy utility planned on starting off with a small sample of customers and then rapidly increasing the numbers over the coming months, so we needed to be able to deal with this steep ramp up of incoming data, without knowing the exact timeline for it. To accommodate this elasticity we decided to utilise AWS Lambda as the backbone of our application.

Lambda functions are event based functions that trigger on a selection of AWS events. Our first problem was that we wanted to scale off a Microsoft Azure Queue which obviously does not have a trigger for Lambda functions. We tackled this by creating a scheduled cloud watch event that would trigger a lambda function once a minute that’s sole purpose was to poll the Azure queue for the number of messages and then send that many notifications to a SNS topic. Our data ingestion would then be scaled off this SNS topic rather than the Azure queue itself.

Our application was broken into three stages which mapped to three Lambda functions.

Function 1 — Poll Queue:

Singleton Lambda function polling Azure queue once a minute, determining number of messages in the queue and then sending that many notifications to a SNS topic.

Function 2 — Data Persist:

Collate and persist to our DB. Trigger a prediction if enough data to do so.

Function 3 -Predict:

Predict the performance of solar panels, then send the performance status back to the energy utility company.

Iteration 2

After deploying the first iteration it became apparent we were facing issues with concurrently triggering multiple predictions for the same periods. Each data packet on the queue contained one day’s worth of data and in order to calculate a prediction we had to have 25 days worth of data within a 30 day period. To trigger an invocation of the predict function, the data persist function would check for all the periods that contained the day’s data that it had just persisted and didn’t already have a prediction. However, if two data persist function had just persisted days that pertained to the same 30 day period and that prediction had yet to run they would both trigger a prediction for the same period at almost the same time.

To overcome this issue we implemented a locking system with Redis ElastiCache. When the data persist function deemed that a house was valid for a prediction it would try to obtain a lock. The key for the lock contained the house id and dates the prediction was for. If it did not succeed in obtaining a lock it meant that another data persist function had already triggered a prediction that was yet to run, so there was no need to trigger a prediction. The lock was released by the predict function, deleting the key in Redis, upon successfully creating and persisting a prediction for the house and time period.

Iteration 3

After the system had been in the wild for a few weeks an issue was identified where some houses predicted status would keep changing on every new day, predicting a house was performing well on one day then predicting it was performing poorly on the next. This was being caused by houses whose performance was very close to the boundary of being good or bad. In order to remedy the situation the energy utility company wanted us to implement a rolling average on top of the prediction algorithm that would take into consideration the last 10 predictions as well as the previous status that we had sent them.

A rolling average was relatively trivial to implement for existing houses, and when data was always sent and persisted in order. However whenever we got new houses it posed some problems. When a new house was registered we would be sent up to a year’s worth of data for the house, but we had no guarantee of what order the data was going to be sent, as well as having no guarantee upon which data persist function would complete first. This made it difficult to apply a rolling average without knowing whether more data would be coming that pre dated the current data.

To solve this problem we made an assumption that all data would have been persisted and propagated through the queue within 45mins of seeing the last piece of data.

To implement this we used a Redis sorted set. A Redis sorted set is a non repeating collection of Strings in which each member of the set is associated with a score. This score is used to order the set. For the member string we used the house id, and for the score we used a unix datetime stamp. Every time a new piece of data was persisted for a house we would update the score for that house to be the unix datetime that the data was persisted. We then had another Lambda function that was polling the sorted set for all the houses whose score was more than 45 mins ago, giving us those who haven’t had any data persisted for at least 45 mins and are therefore ready to have a rolling average calculated.

Developing and managing deployments

Our code was all in Python and to simplify and manage our code we used the Serverless framework. One problem the Serverless framework does not provide a solution to is managing third party dependencies.

When packaging python libraries for Lambda, if any of them rely on compiled C code or shared libraries the libraries must be compiled on the Amazon Linux OS, as this is the OS that Lambda uses. One solution to this is to provision an EC2 instance with the Amazon Linux AMI to compile all of the libraries. We decided to use a docker image of the Amazon Linux AMI. Using docker made the deployment pipeline more flexible as we could compile on any platform and get the same result.

Problems Encountered:

Lambda has a limit for each function deployment size to be less than 50MB. We were using quite a few libraries such as sci-kit learn and numpy that when packaged together came to greater than 50MB. To overcome this we linked the libraries that had the same dependencies to share those dependencies. The exact details of how to do this can be seen here.

AWS Lambda functions are supposed to be idempotent, meaning that a single event can trigger a function many times without having unintended consequences. For example one notification from SNS can trigger two invocations of a Lambda function rather than just one. For us this posed a problem. These duplicate function were using up valuable resources both on our database and Lambda execution time. To resolve this, we turned to Redis again. We utilised the unique identifier of the event trigger and set it in Redis upon successfully completing execution of the function. Then at the beginning of each function we checked for the existence of this key in Redis. If it already existed we had already successfully processed this event once, therefore there was no need to execute again so the function immediately returned. An expiry was set on the key in order that Redis didn’t get full up with all of these keys.

Overview:

Lambda is often billed as a magic bullet where you just write the logic of your program then don’t have to think about anything else. Faced with our alternative, setting up and maintaining an Autoscaling group of EC2 instances, it certainly removed a lot of the effort that can entail. However, we found that with Lambda you don’t just write the logic of single functions and then expect everything to work. You still have to think of the entire system as a whole. Rather than just thinking about the isolated logic of an individual function you have to think about how the system will work with many instances of your function running simultaneously. We learnt this as we went along, having to evolve the system as our understanding of how Lambda worked and how the interaction of the functions increased.

Overall our team really enjoyed working with Lambda. The benefits it brought both in cost and low maintenance far outweighed the initial learning curve that was required to deploy a production application on it. Going forward we will be looking to implement other projects with Lambda.