The Chunk Graph Algorithm (week 26 - 29)

2017/09/25–2017/10/22

Tobias Koppers

Published in

webpack

7 min readOct 23, 2017

What happend in webpack 3?

We released a few new webpack 3 versions 3.7.0, 3.7.1, 3.8.0, 3.8.1.

Like always we merged a lot of bugfixes. Best read the change logs for details on these bugfixes.

Stats validation

The validation of the stats option was broken in version < 3.8.1. Now stats are validated correctly and webpack may scream at you when typos are hidden here.

On demand loading of initial chunk

You can use import() or require.ensure to load a chunk with specified chunk name. If you try to specify a name that is already used as entry point of commons chunk, webpack versions < 3.7.0 allowed that but behave weirdly. webpack 3 will now display a warning, webpack 4 will display an error.

So if you receive this warning after upgrade, you did something wrong with your imports.

Deep Children for CommonsChunkPlugin

The CCP got a new feature. This was requested long time ago and finally somebody created a working PR for it.

The CCP now allows to find commons in all indirect children instead of only direct children. This works in non-async and async mode. This allow to remove more duplicate modules in the chunk graph, but at cost of bigger initial/async commons chunk. So use it wisely. Smaller total size != faster app.

Chunk graph algorithm refactored

This algorithm was refactored for performance reasons. In general creating the chunk graph is pretty cheap, but in some edge cases (many circular chunk references) the performance was really bad, because of high algorithmic complexity.

The refactoring increased the performance in these cases by reducing the algorithmic complexity. Milliseconds instead of minutes.

I was asked to tell more about the chunk graph algorithms, so you’ll find more details at the second part of this post.

CI: separate unit and integration tests

Code coverage of unit tests and integration tests is now separated. 90% change coverage for PRs is only enforced on the integration tests. Unit test coverage is only displayed, but otherwise ignored.

Why? Unit tests give a false sense of coverage. They only test a component in isolation and often test the inner workings of the component. These tests doesn’t make sure that the complete system is working correctly. Example: When adding a bug fix that requires a component to return 42 in an specific edge case, a unit test which feeds the correct arguments into the component and test for 42 would receive 100% change coverage. Awesome. No, it’s doesn’t test if the complete system no longer have the concrete bug. An integration test would do this.

What happened for webpack 4?

New plugins system

The implementation of the new plugin system has finished in tapable. At least it’s in beta stadium. We probably find a few missing things when really using it in webpack.

enhanced-resolve, the resolving lib for webpack, was refactored to use the new plugins system.

I also started to refactor webpack core, but that’s a lot of work…

New filtering options for contexts

A PR by @ljcrapo was merged into the webpack 4 branch which adds new comments for import() to filter the items added to the context:

import(
  /* webpackInclude: /\.js$/, webpackExclude: /hidden/ */
  `./config/${filename}`
)

Bugfixes and merging

Somebody reported a bug with sideEffects: false (yes, it was renamed to camelCase notation like other fields in package.json), which was fixed.

And all master changes were merged into the webpack 4 branch.

Feel free to try webpack 4 on the next branch, but consider it as super-unstable and expect breaking changes until the final release.

Chunk graph algorithm

Ok let’s start by explaining some of the datastructures in webpack and how they are connected:

Step 1: The module graph

Let’s assume all modules have been loaded and Dependencies in these modules have been extracted (This happens by the Parser). All the blue entities and blue connections (Module, AsyncDepBlock, Dependency and all connections between them) are already there.

A Module contains a number of Dependencies like require or import. A Module can also contains a number of AsyncDep(endencies)Blocks (i. e. formed by require.ensure and import()), which contain a number of Dependencies itself. AsyncDepBlocks could also contain AsyncDepBlocks in a tree like structure (this happens with nested require.ensure, but can’t happen with import()).

Basically each Dependency in the Module can be considered as sync dependency, going into the same Chunk as the Module. Each Dependency in an AsyncDepBlock can be considered as async dependency. Each AsyncDepBlock will form a new Chunk, or reuse one by name.

Step 2: Entrypoints

In the next step Entrypoints and Chunks for the entry option are created. So we create a new Chunk for each entry and add the first Module, the entryModule.

Step 3: Chunk graph from Modules

The created chunks and entry modules from Step 2 are fed into the first step of the chunk graph algorithm. Here the algorithm walks through every Module, Dependency and AsyncDepBlock and connects Modules and AsyncDepBlocks with Chunks.

When visiting a Module: Adds the module to the current chunk. Visit all Dependencies and AsyncDepBlocks.

When visiting a Dependency: Visit the referenced Module.

When visiting a AsyncDepBlock: Create a new Chunk, or lookup Chunk by name, when chunk is not yet set. Set the current chunk to this chunk. Visit all Dependencies and AsyncDepBlocks.

The algorithm works iteratively by using a Queue. A recursive implementation caused Stack Overflows historically, so this was refactored.

While this algorithm visits Entities it also tracks which Chunk reference other Chunks. This information is fed into the next step.

Step 4: Parent-Child relationship for Chunks

This step adds parents and children for Chunks where they are necessary. Parents means: “When this Chunk is loaded at least one of the parents has already been loaded”. This information is important for optimization steps. Example: A module can be removed from a Chunk when it’s included in all parent.

So this step tries to avoid creating unnecessary relationships, because they would hurt optimization.

The algorithm tries to figure out all possible paths through the chunk graph (all possible ways a user can go through the app) and checks if a chunk connection is needed on this way, depending on modules available on the path.

Sounds expensive? In most cases the chunk graph is pretty simple and not heavily connected. Often also tree-like. But in cases where on-demand-loaded modules reference the module which loaded everything on demand: a super connected chunks graph is created. In this graph the number of paths through the graph is n! where n is the number of chunks. A trivial algorithm could yield bad performance here. This was the case before refactoring.

The new algorithm uses an interesting fact (Which took me 2 days to discover it): Assuming two sets of available modules at a chunk A and B, where A is a subset of B. In this case the required connections for B are a subset of the required connections for A.

Using this fact we can create all required connection when using the minimum of available modules at each chunk. So when walking the paths of the graph we store the current minimum of available modules at each chunk and stop visiting when stored minimum of available modules is a subset of the current available modules.

We can further improve the algorithm when only using the minimum of available modules when traversing into the children.

An additional improvement would be possible, by traversing smaller sets of available modules before bigger sets, but this would require implementing a priority queue. So I just used a breadth-first traversal. It’s much faster than a deep-first traversal. So I leave this open as opportunity for a Pull Request.

The final graph

Here is how the module and chunk graph look like for this example:

// index.js
import a from "./sync";
import(/* webpackChunkName: "myChunk" */ "./async");

(Assuming sync.js and async.js have no other dependencies)

(The arrow from Chunk main to Chunk myChunk is a child relationship, the arrow from Chunk myChunk to Chunk main is a parent relationship)

Discovered bug

Do you remember I talked about a new warning when referencing a initial chunk by name? I didn’t tested this case (because it doesn’t make sense), but a bug sneaked into this case in the refactored algorithm. Users reported broken bundles, but I didn’t know why… (they didn’t add a repro case to the issue)

Than @simonjoom reported an issue with the perfect hint: It happens when chunk name on import() points to an initial chunk.

yes i know it’s not really logical to use System.import to load a code from main […]
At least it’s should to display an error to help the developer. (i scrolled in all my 1000 lines of code to find the problem)

So it was added between 3.7.1 and 3.8.0 and original behavior was restored in this case (to not introduce a breaking change).

Thanks for investigating yourself before reporting issues. This really helps.