Field Notes from the Land of Big Data

Feihan Lu
Upstart Tech
Published in
5 min readSep 28, 2020

Many of us have heard of the fashionable term “big data analysis”, and some of us have a day-to-day relationship with it. As a data scientist with many years of statistics training in a (relatively) smaller data world, I didn’t start to build my connection with big data until recently. Sometimes this new romance is a thrill; other times it can be fairly frustrating. If you are like me, then maybe some of the “fights” we’ve had can be inspiring (or at least relatable) to you.

  1. What makes big data different from small data?

As the name suggests, big data is big. But how big should it be to differentiate it from small data? If you google questions like that, I’m sure there will be tons of articles trying to define this term. But based on my own experience (remember I was born in the small data world), here are 5 points that in my opinion define “big data”. Most of them were born from the question, “Why is it so hard for me to finish this job?!”:

  • The size of the data is at least tens of GB and can easily exceed a hundred GB;
  • The data is saved in partitions rather than a single file;
  • Any code that runs on this data is likely to get stuck somewhere, even on a Mac that you spent $4000 to configure. To get around this, you have to rely on clever pieces of code and/or powerful machines (e.g., a computer with hundreds of GB of memory or a cluster of such computers);
  • It’s a pain to visualize the data;
  • It’s a bigger pain to debug your code.

If three out of the above five points are satisfied, then congratulations, you are in the big data world!

2. Schema is a major trigger

Big data can give you big headaches. So what can we do to make our lives easier? Most of the time, the problems I encounter during data analysis or model building do not come from the analysis or modeling itself, but instead from data that was mishandled. And most data mishandling is caused by a wrong schema specification. Don’t underestimate this mistake! It can easily happen: when we modify our code to reproduce existing data, when we backfill new data with additional records or attributes, when we merge datasets that are across a range of time periods, each of which has a different schema, etc. Therefore, it is always a good habit to check the schema first.

There are many types of schema errors, but the most common mistakes I’ve come across, no matter what language or platform, are the following:

  • Wrong specification of the `header` argument (e.g., the partitioned data contain headers but we forget to skip the first row);
  • Data type misspecification (e.g., our schema file thinks a column is integer-typed while it’s actually double);
  • Column misalignment (e.g., I appended metadata for a new attribute at the end of the schema file while the column was actually added in the middle of the data).

If your program yells at you when importing data, check whether it’s because of the above reasons, as they can rule out a large number of errors. If your program remains silent, then taking a small subset of the data and summarizing the distribution of the columns (especially new columns and the columns before and after them) is a convenient and effective sanity check. Automatic tools such as data type inference tools or schema generators can be good helpers, but do not blindly rely on them, since there could be data issues that they (or their default settings) cannot catch.

3. Pre-programmed defaults could be mysterious helpers (or saboteurs)

Because of the uniqueness of big data analysis, we often need to shift to new languages or platforms that are specialized for these tasks. Spark is one such platform. When I first started to learn it, I didn’t anticipate the scope of the pre-set defaults. Sometimes they can be surprising saviors for code that appears buggy, while other times they can be unwelcome saboteurs for code that looks correct.

One example of the former is the order of applying `select` and `filter` on a DataFrame. In my “buggy” code, I first selected a few columns from a DataFrame to create a new DataFrame. I then filtered the new DataFrame by another column that was in the original DataFrame but no longer in the new DataFrame. Spark ran the code without complaints. An example of the latter situation was to generate predictions using a fitted model on a test dataset without persistent memory (persistent memory is where we cache the data from disk into memory so that it can be reused without copying from disk again). Spark generated very different predictions when I ran the same piece of the code at different times.

Not limited to Spark, these mysterious behaviors can occur with any new technology that we learn, and require a long period of practice and experience to uncover. But one golden rule is to always run sanity checks whenever a major step of the job is done. Don’t skip these sanity checks to save time — better safe than sorry.

4. A series of sanity checks within an organized workflow is the ultimate savior

As mentioned in the first section, big data is much harder to visualize and debug than small data. By “visualize”, I mean seeing the data table or summarizing the distribution effortlessly (like running one line of code). By “harder” I don’t mean impossible: we can still print a few rows or print the schema to peek at the data from a few angles. However, if the project is big (and thus has more scope for error), then we need to check our procedures carefully from as many angles as possible. In that case, an organized workflow which embeds a series of sanity checks at different stages of the job is the key to success. It may take a long time to build such a workflow, but once it’s built it will save us a ton of time in the long run.

To achieve it, we can either create our own workflow or learn existing workflow management tools. A good workflow is made up of small tasks, each of which has its own sanity checkpoint. For example, we can break down a model building task into individual steps such as data preparation pipeline, model training pipeline, model evaluation and production pipelines, etc. Whenever we add a new task to the workflow, we should add its sanity checkpoints accordingly. Also, keep in mind that our jobs are evolving and tasks can become more complicated. Therefore our workflow needs to be adaptable to new situations as well as reproducible for past results. It is worthwhile to spend some time thinking about a good versioning plan in the beginning of building the workflow. Last but not least, it is highly cost-effective and a good idea to include a “debug mode” that can test the functionalities on a small subset of data before the full run.

--

--