Parallel programming has historically been a relatively-niche and complex aspect of software development, often not worth the headache, but writing parallelized applications has only gotten easier and it’s not unusual for an application to leverage multiple cores on a device’s CPU at one time to maximize efficiency.
Nowadays, with the rise of Data Engineering as a field of profession, parallel programming is more popular than ever. Apache Spark, a software library used to “extract, transform and load” (ETL) large datasets, is perhaps the most-popular use of parallel programming today. While Apache’s Spark, Hadoop and AirFlow are common technologies in the Data Engineer’s toolbelt, their use calls for proficiency not in C# but Python, Scala or Java* (*albeit not ideal).
The “right” tool for an ETL job would be something like Spark or Hadoop. They are specially-designed for ETL and require less code-writing on your part versus C# or any other language. If you’re looking to move into the ever-growing field of Data Engineering, your best bet is to learn Python and Spark (among other technologies). That said, sometimes the right tool is the tool you already know how to use, and in fact I have found C# and .NET to be up to the task for parallelized ETL operations.
Over the years, C# has evolved to make parallel programming easier and easier. Because C# maintains 100% backwards-compatibility with its previous versions, it can be difficult to know which of the many ways to run code in parallel is the best way. Indeed there are several ways one can spin up multiple threads in .NET, and what it really comes down to is how much you’d like .NET to handle “behind the scenes” for you versus what you would like to handle explicitly, yourself.
Generally, we do want .NET to handle as much as possible for us, especially when it comes to thread management, because running code in parallel is (as programming goes) very complex and very prone to unexpected runtime errors. In fact, Microsoft has a dedicated page covering the potential pitfalls of parallel programming (say that three times fast) here.
I suggest giving it a read, but as far as ETL and other “data processing” tasks are concerned, we really only need to worry about two things: whether parallelization will indeed be faster and if so, ensuring our code is thread-safe.
CPU-bound vs IO-bound
In determining whether “parallelizing” code will be worthwhile, it’s important to understand which parts of the application are “CPU-bound” versus “IO-bound”. As you may have guessed, parallelization enhances the performance of CPU-bound code, and not only yields no improvements to any IO bottlenecks, but may exacerbate the IO bottlenecks the application has (by contrast, asynchronous programming aims to reduce IO-bounding).
CPU-bound code are typically operations performed on objects in your program, mathematical or otherwise. Parsing CSV files, mapping objects and calculating averages all depend on the CPU. When processing data, efficiency comes from splitting up the data set based on the number of CPUs available, and essentially running the program on each CPU simultaneously- just with different groups of the whole data set.
Code is IO-bound when execution depends on data being sent or received via “the wire”; that is, an internet or intranet connection to another server. We see this most-commonly with repository methods calling to APIs or databases. Code can also be IO-bound if it is writing to or reading from persistent storage such as a hard drive or solid-state drive. There isn’t a whole lot we can do if the third-party API we are using takes 10 seconds to return data to our application, however through asynchronous programming we can at least have our program continue to use the CPU to run other parts of the program while waiting for the API call, instead of just sitting idle.
In a case where processing each data set relies on an IO-bound call, such as calling to a database to do an INSERT of the processed data, it can be important to balance the number of CPU threads we run the program on with the number of calls we make to an external source such as an API or a database. This is particularly important if the API has usage limits (e.g. 10 requests per second), or if the database server doesn’t have enough threads to handle our program’s say 20 threads all trying to INSERT to the same database at the same time.
Maximizing efficiency and minimizing program run time without overloading IO-bound aspects of your program will likely require some trial and error-style tweaking of the number of threads your program uses at various steps of the ETL process.
Parallel.ForEach vs PLINQ
When it comes to .NET, everything you need to parallelize an application can now be achieved with a simple Parallel.ForEach loop or Parallel LINQ (PLINQ). These aren’t particularly new to .NET, but they are much easier to work with than their predecessors, which required creating and managing threads, and partitioning collections manually. Both Parallel.ForEach and PLINQ handle all of this for us, and in my experience there is not a noticeable performance difference between either. I suspect under the hood they are calling more or less the same code.
So why use one over the other? I suggest basing your decision on which you would use in a non-parallel situation- would you use a foreach loop or a LINQ query? That said, the big, non-obvious difference between the two is that a Parallel.ForEach allows you to specify the number of threads up to the maximum number of threads available on the computer or server it is running on (you can specify more but it will only spin up as many threads as the CPU has). PLINQ however will use more than the number of CPU threads on the computer if you tell it to. Generally you would not want to spin up more threads than are CPU threads available to the OS, but there are some cases where it would be more-performant to do so. For example, if you were writing a web scraper, it may make sense to spin up double the number of threads since each thread presumably has to wait on the website(s) to load. Other such network-bound (a subset of IO-bound) tasks may run faster with more threads than are available at once.
There are several other differences between Parallel.ForEach and PLINQ, which are discussed here, however for ETL they are unlikely to apply except in very specific circumstances. For example, if the order of the data must be preserved, you should use PLINQ as it provides a method to preserve order.
The Sample App
Our ETL app (repo here) will do four things:
- Read in CSV files.
- Map the fields in those CSV files to C# objects.
- Perform some transformative action on the data (the List of objects).
- Insert that data into a database.
A second article covering this app is incoming.