Building Data Pipeline Step 2: Clean and Transfer Raw Data

3 min readMay 31, 2022

In the last blog, we discussed a scalable way to get raw data from data platforms. In this blog, I am going to share my perspective on raw data cleaning and transformation.

In my previous experience, data pipelines were usually built to get raw data from the platforms and store them in a self-hosted database or data warehouse. The following content will be based on this scenario.

Structure of Raw Data

The raw data structure depends on the design of the platform and the platform API. As NoSQL gains popularity in modern software development, it has a big impact on the data flows within the software, as well as the data flows out from the platform. Some raw data you get from platform API is a combination of two or more data objects, which is not so feasible for relational databases, as well as not good for future data consumption. One example of the raw data structure is as follows:

{
    "id": 985314,
    "candidate_id": 978031,
    "prospect": false,
    "applied_at": "2016-03-26T20:11:39.000Z",
    "rejected_at": "2016-08-17T21:08:29.686Z",
    "last_activity_at": "2016-08-27T16:13:15.000Z",
    "location": { 
        "address": "New York, New York, USA" 
    },
    "source": {
        "id": 1871,
        "public_name": "Happy Hour"
    },
    "credited_to": {
        "id": 4080,
        "first_name": "Kate",
        "last_name": "Austen",
        "name": "Kate Austen",
        "employee_id": "12345"
    },
    "rejection_reason": {
        "id": 8,
        "name": "Lacking skill(s)/qualification(s)",
        "type": {
            "id": 1,
            "name": "We rejected them"
        }
    },
    "rejection_details": {
        "custom_fields": {
            "custom_rejection_question_field": "Not a good fit"
        },
        "keyed_custom_fields": {
            "custom_rejection_question_field": {
                "name": "Was this candidate a good fit?",
                "type": "short_text",
                "value": "Not a good fit."
            }
        }
    },
    "jobs": [
        {
            "id": 123,
            "name": "Accounting Manager"
        }
    ],
    "job_post_id": 123,
    "status": "rejected",
    "current_stage": {
        "id": 62828,
        "name": "Recruiter Phone Screen"
    },
    "answers": [
        {
            "question": "Why do you want to work for us?",
            "answer": "I heard you're awesome!"
        },
        {
            "question": "How did you hear about this job?",
            "answer": "From a former colleague."
        }
    ],
    "prospective_office": null,
    "prospective_department": null,
    "prospect_detail": {
        "prospect_pool": null,
        "prospect_stage": null,
        "prospect_owner": null
    },
    "custom_fields": {
        "bio": "This is a bio",
        "birthday": "1992-01-27"
    },
    "keyed_custom_fields": {
        "date_of_birth": {
            "name": "Birthday",
            "type": "date",
            "value": "1992-01-27"
        },
        "bio": {
            "name": "Bio",
            "type": "long_text",
            "value": "This is a bio"
        }
    },
    "attachments": [
        {
            "filename": "John_Locke_Offer_Packet_09_27_2017.pdf",
            "url": "https://prod-heroku.s3.amazonaws.com/...",
            "type": "offer_packet"
        }
    ]

}

Here, the data object application contains complete data records from at least other 6 data objects(source, rejection reason, and so on). It is denormalized and redundant. You would not want Tableau, PowerBI, or other tools to dig into column rejection_column, read the JSON value and further read the value in the key name.

Compress the Data

Here, I would recommend you compress the data. This will make your data more structured and reusable.

For this example, you only keep the id of the contained data objects. And further you import these contained data objects separately if they are available from the API. If not, you save these data instances in separate data groups, create tables for them, and store the data from these data groups in the corresponding tables.

If you are 100% sure that some values will not be used, you can also delete them before storing the data. This makes your data more compact.

Data Type Transformation

Before you store your data in tables, the correct data type should be applied to each column. Here are some ways to change data type:

If you want to change the data type in only one data instance, you can use the most usual methods to change them in Python:

int('1')
str(2)
json.dumps({"key": value})
datetime_obj.strftime('%Y-%m-%d %H:%M:%S')
datetime.datetime.fromtimestamp(time_in_seconds)
datetime.datetime.fromtimestamp(time_in_seconds).strftime('%Y-%m-%d %H:%M:%S')
datetime.strptime('2022-05-31', "%Y-%m-%d")

If you want to change one column data type in all data instances you have, I would suggest creating a dataframe and using apply() to change the data type. Then, use pandas to store dataframe in the database(which will be discussed in the next blog). For example:

df["col_1"] = df["col_1"].apply(json.dumps)

This changes all data in col_1 to JSON type.

Summery

Above, the most common data cleaning and data transformations for raw data are discussed. However, more detailed data transformations are varied based on the purpose of the data usage. You should understand the requirements before you start to clean or transform your data.

In the next blog, I am going to share different ways to insert data into the database.

Building Data Pipeline Step 2: Clean and Transfer Raw Data

Structure of Raw Data

Compress the Data

Data Type Transformation

Summery

Written by Dakai Zhou