WebScraping and ETL — not just another pretty TLA

Let’s talk about our favorite topic of “Formatting of the extracted Data in various ways” and “Loading the data into a customer’s database”.

This post attempts to explain the boundaries between what we provide as part of our regular/default service and what we expect the requester of the data (you) to do or we provide as an add-on to our service on request.

ETL (another TLA) Dealing with data involves a set of operations that are commonly known as ETL — Extract, Transform and Load. http://en.wikipedia.org/wiki/Extract,_transform,_load operations.

Give me an “E”: We deal with only the “E” part of it — we will extract the data and give you the raw data in a structured format that matches how it comes from the source.

What about the “T” and “L”

We don’t deal with the “T” and “L” parts of it as part of our regular and default service offering. That would fall within the realm of your application and application/database programmers or as an add-on to our service. We provide the data from the source site broken down into some fields that lend themselves to be broken down — which may or may not match how you need it for your database.

That is where the “T” part comes in and you will need to transform what we provide, into the format you need and then “L” load it into your application or backend database.

For example if the source site has some date data and they have it in the format April 20 214 at some place and as 4/20/14 in others or in any of the other formats a date can be represented. We will provide you the date “as is” from the site, you will need to Transform it to fit the date field in your database.

Another example would be a field such as $35,012,25 — We get the data “as is”. So we do not remove the comma and surely we don’t convert Dollars to Euros or British Pounds.

Cannot get what isn’t there

We also cannot create data for you (ironically contrary to popular belief). If the data is not on the source site, we obviously cannot get it for you or “insert/add” it for you. Some of the popular requests have been around getting contact information e.g. “We really need that email address or phone number”.

If the site doesn’t have the email or phone number — we will not be able to get it for you.

Free Tools to the Rescue

Thankfully, the field of ETL has been around as long as Data has been and there are some excellent and Open source (Free) tools that can make life a lot easier for people who are just getting into this field.

Pentaho

One of the excellent and free tools out there is Pentaho, available at community.pentaho.com — The product used for ETL is the Kettle product. It can be downloaded and can take the JSON format we provide and transform into into wondrous and consistent data and load it into your database.

According to their site “Data Integration (or Kettle) delivers powerful Extraction, Transformation, and Loading (ETL) capabilities, using a groundbreaking, metadata-driven approach.

Talend Open Studio

Another Open source and free product in this space is Talend

According to their site “Free, open source integration software. Talend’s open source products and open architecture create unmatched flexibility so you can solve integration challenges your way. Talend reduces the learning curve and lowers the barrier to adoption for data integration, data profiling, big data, application integration, and more.

Hope this post clears up some mystery and draws some clear boundaries in what to expect of our default service and what would be something you would do or ask for as an add-on to our service.

Related


Originally published at learn.scrapehero.com on April 7, 2015.