Processing Large Files with Go (Golang)

snassr
4 min readOct 31, 2022

--

Using concurrency to speed up large file processing.

In this article, let's look at how we can process a large file in Go. Here are the steps we will follow:

  1. Process the CSV data file sequentially
  2. Process the CSV data file concurrently
  3. Benchmark Comparison

Our data file

We will use the "Campaign Finance Election Contributions by individuals" data file.

Below is a sample of our pipe-separated values data file.

C00580100|A|YE|P2020|201903139145682053|15|IND|BALTHASAR, SUSAN|INDIAN SHORES|FL|33785|RETIRED|RETIRED|11072018|100||SA17A.8965|1319152|||4031420191645043402C00580100|A|YE|P2020|201903139145683412|15|IND|GAZZIER, JAY|FRIENDSWOOD|TX|77546|MERRILL LYNCH|FINANCIAL ADVISOR|11072018|100||SA17A.13989|1319152|||4031420191645048289C00580100|A|YE|P2020|201903139145684137|15|IND|KENNEY, PAUL|ACTON|ME|04001|P&E SUPPLY|SELF-EMPLOYED|11072018|100||SA17A.16626|1319152|||4031420191645052637C00580100|A|YE|P2020|201903139145685972|15|IND|STUBBLEFIELD, LANE|SIGNAL HILL|CA|90755|RETIRED|RETIRED|11072018|200||SA17A.23496|1319152|||4031420191645060122

Fields

  • Column 8 (row[7]): NAME in the formLastname, FirstName
  • Column 14(row[13]): TRANSACTION_DT in the formMMDDYYY

Questions

  • How many rows are in the file? 21,729,970
  • What is the most common first name in the file? JOHN
  • How many times does the most common name occur in the file? 59,055
  • What are the frequencies of donations by year? …

Files (included w/ GitHub repo)

  • A sample size file for tests (40 rows)
  • A sample size file for tests (4000 rows)
  • The Complete file (21,729,970 rows)

Processing Function

First, let's look at the core function of our file processing. The function below is simple and constructed to be somewhat time-consuming; it extracts the first name and month from a file row.

Core processing function

Process the CSV data file sequentially

First, let's process this file sequentially. As discussed in the article Concurrency In Go (Golang) — Part 1, sequential processing is a line-by-line approach. We expect this to be slow because we have to process n lines starting from the first line to the last one.

As we'll see later in the benchmark, this takes about ~20 seconds! Let's see if we can get that number down by concurrently processing parts of the file.

Function to process the file sequentially

Process the CSV data file concurrently

OK, let's put some of those CPU cores into action. Since most computers today have multi-core processors, we know we can map-reduce (split-apply-combine) this process.

data file →reader →processor(s) →combiner → result

We will use channels to build a pipeline! This pipeline will allow us to split the process into multiple stages.

Our pipeline uses the following components:

  1. reader
  2. worker(s)
  3. combiner

Reader

The reader splits rows from the data file into batches and sends the sets out for a processor to pick up.

Pipeline Stage 1: reader

Notice the variable batchSize is configurable and is used to determine the size of row batches sent out.

Workers(s)

The workers pick up a batch from the reader, process each batch, and send out the processed data.

We designed this stage for parallelism since we are targeting a multi-core architecture. The more workers we initialize, the more parts of the file we can process in parallel.

Pipeline Stage 2: processor(s)

We'll look at how to use multiple workers in the pipeline section.

Combiner

The combiner merges the incoming processed data from the worker(s).

Pipeline Stage 3: combiner (multiplexer)

We'll look at how we pass the worker(s) output channels into the combiner function in the pipeline section.

The Pipeline (reader → processor(s) → combiner)

First, let's look at the function signature for our enclosing concurrency processing function.

func concurrent_processing(file string, numWorkers, batchSize int) (res result) {...}

Notice the two parameters numWorkers and batchSize parameters. These parameters specify the number of workers and the size of rows each worker should process at a time.

It's time to see how we combine three 3 stages!

Combining reader, worker(s), and combiner to form a pipeline

Done! The complete code with tests and benchmarks can be found here.

Benchmark Comparision

Let's benchmark and compare the sequential and concurrent versions of our file processing.

The concurrent version's performance will vary based on the number of workers and batch size. Here's our benchmark output:

Benchmark results

Our goal is to reduce processing time; let's review the results:

| type       | numProcessors | batchSize | time |
|------------|---------------|-----------|------|
| Sequential | n/a | n/a | ~19s |
| Concurrent | 1 | 1 | ~30s |
| Concurrent | 1 | 1,000 | ~18s |
| Concurrent | 10 | 1,000 | ~11s |
| Concurrent | 10 | 10,000 | ~10s |
| Concurrent | 10 | 100,000 | ~9s |

We reduced our concurrent process's processing time by around 50% using 10 workers and a batch size of 100,000. Therefore, we have a program that runs in half the time!

The function processRow is pretty simple right now; if we were to increase the complexity of that operation, then the concurrent process would be even more valuable.

The source code for this blog can be viewed here.

Feel free to leave a note. You can also find me on Twitter @snassr_.

--

--