Building a Data Platform from Scratch

Mihalis Papakonstantinou
Plumbers Of Data Science
4 min readMay 27, 2020

--

It all started back in 2016, when Agroknow decided to start working on their SaaS product and suggested I join this new venture.

The team had an ambitious dream; to create a data-powered solution, collecting, translating and enriching food safety data worldwide.

The endgame? To offer insights to food companies, enabling them to make data-informed decisions concerning their supply chain.

In a visual way? This is the current state of Agroknow’s data platform, as designed into one of our walls in the office, using my non-stable-at-all hand!

Our decoration in one of the production team’s wall

But enough with the history lesson! What laid ahead (tech-wise)?

We had to crawl and scrape various data sources announcing food recalls and border rejections worldwide. This data may come in various formats; just to give a quick overview: multilingual PDF files, custom formatted Excel files, RSS feeds and of course HTML pages.

Time for our first research & respective choice! How to crawl these data? At the time, we were really fluent in Java so we decided to go with Crawler4J, but we needed to customize it a bit in order to cover our needs. Thankfully it’s an open source project so that was easy. Currently all it needs is a yml configuration file and it takes care of the rest. More on this subject on a next post, when we dive deeper into this part of our platform.

A (canadian) example of the YML configuration expected by our crawlers

Ok, we got the data, but we need to store it! We have a wide variety of data, each with its own properties. Data concerning food recalls that have a velocity of at most under a hundred per day and data coming from sensors deployed on a field level which presented a velocity of thousands per day; no one framework would be able to meet our needs. So we decided to split the data based on its velocity. Those presenting lower velocity (eg. raw html, xls) would be stored into a MongoDB instance and those with a higher one into an Apache Cassandra cluster. Again, more on this part of our endeavor will follow in a next post.

But data is useless if you cannot process it, so the next part of our stack is our Transformer into our internal schema; of course based on the entity type we are processing. To that end, we employed into our stack python and PHP (currently marked as deprecated, no judge there) scripts as well as a custom Java project, all of which take care of the harmonisation of the collected data.

Time for our cooler parts of the stack to take action. We should identify important terms in our collected data like the ingredient that was recalled, the reason behind a recall or the company involved in it. This is where data mining, NLP, NER, ML and DL techniques are employed. A number of projects and respective API endpoints are deployed taking care of this tasks. As far as technologies and frameworks are concerned, we have Spring {Boot, Data} projects, Flask endpoints taking advantage of scikit and Keras classifiers all communicating with Elasticsearch instances and internally trained models. Each producing an accuracy score, that if above a threshold is accepted as valid response. More on the subject will follow on a next post.

Current state of our data platform entities as taken from our internal Kibana dashboard instance

Our collected data is now harmonised and enriched. Time for it to be stored into our internal CMS (Drupal was our choice for this) in order for it to be easily accessible by our internal food expert team to review, correct and approve for publishing.

And our collected, automatically enriched and human curated data are ready to be published over to our production instances of Elasticsearch, ready to be queried by our custom developed Smart Search API, visualized over to our application layer.

But this part of our stack is to be analyzed over the next post of this series, so stay tuned!

Just to give you some numbers as to the current state of our Data Platform, there are:

  • 131 different API endpoints wrapped on top of each component of the stack;
  • over the past year, 14.809.924 requests have been served by our endpoints;
  • with an average response time of 200ms;
  • 9 Elasticsearch instances;
  • 2 Apache Cassandra nodes;
  • 1 MongoDB;
  • 3 Graph Databases (2 Neo4j instances and 1 GraphDB).

All of the above stats were generated by the Elastic Stack, dedicated to monitoring our whole infrastructure.

Over the next posts we plan on completing the walkthrough of our Data Platform; a Data Platform awarded by Elastic for its usage of the Elastic Stack throughout our product. Dedicated posts on each of our components are also to follow so stay tuned!

--

--

Mihalis Papakonstantinou
Plumbers Of Data Science

Data, data, data! Loves providing data-powered solutions to sectors varying from media and financial institutions to the food industry.