Drilling for 21st Century Oil: Creating a Data Pipeline in AWS

Ruben Sikkes
Gyver
Published in
7 min readJul 24, 2019

Data has been termed the oil of the 21st century by many. An interesting topic of research is therefore the ingestion of new information, or to stay with our metaphor; to keep drilling for more oil. In our vocabulary, we term this a data pipeline, where we move from a source of data and transform it to our likings in a data storage environment of our choice. Ideally as raw as possible to not lose any data we might like later!

Enhancing your private data with public data could add to the quality and accuracy of your machine learning models. If you would like to add such information in a live environment it is unevitable that you collect, organize and structure such a data flow. Historical weather data could be a strong predictor for ice cream sales. If we therefore aim to make predictions for sales based on weather data, we need an accurate weather forecast.

In this post we will briefly show you how a weather API can be used to forecast 7 days of weather information. We will walk you through the following steps:

  • How to use a specific API to forecast weather data.
  • How to store such requested data in a cloud environment.
  • To schedule such a process to keep a live feed with forecasted data.

Some design choices we made during implementation and which will be essential if you would like to follow along in coding are:
- An AWS account: our prefered cloud solutions provider which will be doing our scheduling/ storage of the data.
- A DarkSky developer account which provides us with a freemium API for weather predictions.
They are based on personal preferences and setting both up should not take long.

Note: in this tutorial everything should keep you in a free tier of the services of these partners. We also have no affiliations with either of these companies!

Note: the weather is not always like this in the Netherlands.

We like DarkSky because they have good API documentation, their free tier supplies us with plenty requests (1000 per day) and provides hourly predictions. Further we can gather meteorological parameters which include essential values we opted to get (visibility, predicted rain, temperature and wind speeds).

In order to request data through this API, we need an API Secret Key in combination with a location (longitude and latitude) where we wish to make weather predictions. An API key can be requested by creating a developer account (1 min work!) here. The GET request will look as follows:

https://api.darksky.net/forecast/{api_key}/{lattitude},{longitude}?extend=hourly&units=si'

In this tutorial we’re only predicting the weather in a single location in the center of the Netherlands but this API could easily be used to predict the weather for multiple locations (i.e. on a big city & population base).

In order to schedule such a process, to for example pull in a 7 day forecast every morning, there are various options nowadays. One of them is to use an online scheduler in a cloud solutions provider like AWS. The key benefit of an online scheduler is that they will be responsible for running your code. An empty battery will simply not happen. We like AWS simply because we have most experience with it.

From AWS we will use the following services for this task:

  • AWS Lambda: this will run our code. Request data and write it to storage.
  • AWS S3: which will store our data.
  • AWS Cloudwatch: which will schedule and monitor when the Lambda function should be triggered.

The code for the lambda function is quite brief and can be found below. We set some basic variables like location we want to predict the weather for and our secret API key from the Dark Sky website. A new lambda function can be created by going to services/lambda in the AWS management console. Remember you will need to sign up for an AWS account to do so.

The Lambda handler function contains our main code. It sends a GET request to pull forecasted data and checks if the request has a proper status code (200). If correct it encodes the respons as a string.

Once the Lambda function has loaded the data, we write it to our S3 data storage location (data bucket) and into our desired folder. For us this folder is raw/weather.

import json
from botocore.vendored import requests
import time
import boto3
basic_string = 'https://api.darksky.net/forecast/'
api_key = '{secret_api_key}'
# Information for 'the bilt' the dutch centre of meteorology.
lattitude = '52.11'
longitude = '5.18056'
api_string = 'https://api.darksky.net/forecast/{}/{},{}?extend=hourly&units=si'.format(api_key, lattitude, longitude)
def lambda_handler(event, context):
try:
response = requests.get(api_string)
print("status code:", response.status_code)

if response.status_code == 200:
string = response.text
encoded_string = string.encode("utf-8")
cur_timestamp = int(time.time())
bucket_name = "{bucket_name}"
file_name = "{}-7d_hourly_forecast.json".format(cur_timestamp)
folder_name = "raw/weather"
s3_path = "{}/{}".format(folder_name, file_name)
print("writing as: ", s3_path)

s3 = boto3.resource("s3")
s3.Bucket(bucket_name).put_object(Key=s3_path, Body=encoded_string)
except:
print("Bad Request. Check your API key / API request")

In order to make this executable, we require an S3 bucket to be created. This can be done by going to services/s3 and creating a new bucket. Basic settings will suffice.

In addition we require to set permissions for the Lambda function so it can write to S3 and be accessed by CloudWatch. If your Lambda function does not have the correct rights, you will not be able to write to S3.

Create a new role and assign it AmazonS3FullAccess and CloudWatchFullAccess. For the other options again; basic settings will suffice.

Once your role and permissions are created; attach the role to your function and afterwards the Lambda function should be functional. Note that after creating the role it should appear as an existing role like our ‘lambda_s3_access’ one below.

You can check the Lambda function by creating a test instance and running this. Through this method you can notice where potential errors in your code lay (through the log & output).

If all works well, a file should be created in your destination folder. You will have a raw JSON containing all the hourly forecasted data for the upcoming week!

In our case the upcoming hour for example looks like this:

Scheduling this Lambda function is a matter of adding a Cloudwatch event in AWS. For this example we schedule our Lambda to trigger every morning at 7am but this could be at any time or interval you like.

Note: expressions are in UTC. So 5 UTC would be 7 in the morning for us (As Amsterdam is UTC +2).

We will not go into analyzing the data or further transformation as that’s beyond the purpose of this story. We leave the analyzing up to you :)

All things combined, that’s how you can rapidly create a data pipeline that pulls a daily weather forecast for the upcoming 7 days and store it in S3. Now we could use this forecasted weather data in addition to our sales information, to more accurately predict our ice cream sales up to 7 days in advance! Taking of course into consideration an accurate weather forecast…

At Gyver this story is one we come across quite frequently. Since data storage is quite cheap; more is usually an equivalent of better. It’s easier to have and not use than to find out later you’ve been missing out and require pricy licensing to gain access to data.

In this tutorial we show merely an example of how easy it can be to create a data pipeline. Going from raw (textual JSON) data to something which can be used in a model or dashboard is a topic for another day.

Hope you learned something and if you have any questions or simply like to get in touch find me on LinkedIn. At Gyver we are continuously helping clients get the most out of their data. You can also follow Gyver on their quest for knowledge and aiding clients on LinkedIn and Medium.

Ps; And remember all of this is on a free tier of the service providers listed in the beginning! Happy coding :)

--

--