Evaluating our needs
The first part in our journey to building a performance monitoring tool is to evaluate our business requirements for the project:
- Segmentation — Involve as little of the other internal teams as possible (we don’t want to increase their workloads) and have the ability to isolate the data that we need without having third party dependencies
- Usability — Gather what we deem as important data
- Reporting — Provide visual representations of the data we gathered for easy digestion
- Cost Efficiency — The solution must not break the bank in dollars or development hours
Why not use an already built service?
Before we actively considered designing our own tool, we had a trial contract with SpeedCurve. SpeedCurve was an online subscription service we used that helped monitor and give continuous feedback on our front end performance. Though it had a beautiful dashboard that displayed our logged metrics, we found trouble customizing it to fit our needs in our existing system.
Another service we trialled with was New Relic. We were already utilizing this service to monitor our server’s performance so we were hoping to tap into this existing framework. With New Relic, we were able to log the types of metrics we desired by simply making https requests with the data passed as a post body. Though promising, we ended up not going with them as the types of visualizations offered at the time were not what we were desired.
We continued looking into various other avenues. External solutions often checked off the reporting requirements but missed cost efficiency and/or usability. As a result, we settled on creating our own service that covered the points mentioned above.
Since the team was familiar with quite a few AWS product offerings, we decided to continue utilizing them.
As such, our performance monitoring tool is comprised of the following for data collection and storage processes:
- API Gateway for API management
- Lambda to process and upload the data
- S3 for data storage
- Glue as the ETL service to prepare and load data for analytics
- Athena for querying
For the visualization part, we use an analytics platform called Looker which hooks into Athena to query the data (the visualization process is explored in more detail in part II of this series).
To explain our choice in using API Gateway and Lambda is rooted in the cost and segmentation branch of our needs. By using these two products to create a simple API instead of relying on creating one in our existing server, we keep a separation of focus and avoid adding clutter to an already complex server logic. And the reason we choose to use Lambda over a server-based resource is that 1) it allows us to run code without provisioning or managing servers and 2) cost is relative to the compute time consumed. Since the workload is light (we do some simple data processing), Lambda is an obvious choice over a virtual server-based resource like EC2 in terms of cost for us.
In the Lambda, we set up logic to map the data to more human friendly property names and do a quick check for specific data points (to confirm that certain data points exist in the object). If the data passes the check, we invoke an AWS’ REST API called pubObject to add our file into our S3 storage.
AWS Glue crawler is scheduled to run which crawls the data stored in our specified S3 bucket. The crawler generates a metadata table and registers this table in AWS Glue Data Catalog (which is a persistent metadata store) and defines our data’s schema. This step is necessary as it enables Athena to run queries on the data. From this point on, we can either manually run queries in Athena with standard SQL functions or have Looker pull the data with Athena to display graphical representations.
One of the challenges we faced going this route was with Athena and malformed data. In a recent case, a couple instances of data we logged were incorrectly typed which broke the Athena query. For example, we expected a property value to have type ‘int’ but instead was found to be of type ‘string’ or ‘bigint.’ To fix this issue we started putting in a data check in the Lambda function for certain properties. If the data type did not conform to the schema, we placed the file in a different S3 bucket for later review. This allowed us to keep the miscreant data so we can investigate later on while still having the ability to use Athena for querying.
Part II in this series discusses in more detail about visualizing the results.