Stories by François Lagunas on Medium

Sparse Neural Networks (2/N): GPU Performance.

François Lagunas — Thu, 28 May 2020 21:23:50 GMT

Sparse Neural Networks (2/N): Understanding GPU Performance.

NVIDIA Ampere A100 introduces fine-grained structured sparsity

Welcome back for this series on Sparse Neural Networks. In case you have not read our first introductory episode, here it is.

I told you last time that sparsity would a major topic in 2020, and it looks like it’s getting indeed some steam: Nvidia is announcing with the Ampere GPU generation that sparsity is directly baked into their GPU design.

It’s quite a bold move: if you consider the time it takes to design and produce a new GPU line, they made this decision at least 2 years ago, and you need some vista to understand that it would be an important trend 2 years later.

André Ampère, 1825 (from Wikipedia)

So that’s the perfect pretext to make a large digression on GPU architectures and why knowing better about them may matter for your daily Machine Learning jobs.

To be honest, this will more matter to you if you are working on some low-level code.

If you are using PyTorch or other libraries, and you are just using the extremely good tools it provides, you are probably fine.

But leaky abstractions come back at you faster than you’d think. Your model got a bit heavier? Want to train faster? OK, let’s use a DataParallel PyTorch node, and we’ll be fine on 8 GPUs. But wait, why my GPU usage is down the gutter? And on 8 GPUs it’s only 3 times as fast as on a single one?

It especially matters to me, as I have been telling you last time that the performance of sparse matrices operations was not satisfactory. Today we’ll see why it can be hard to get good performance on GPUs, how it depends on your data structure and algorithms, and how you can overcome it, or at some times at least mitigate some issues.

And of course, all this is a good pretext to read about some mind-blowing GFlops numbers and killer optimizations, nothing to sneeze at…

Some physics

You may wonder why your PC/Mac is not significantly faster than a few years ago. That’s because most of the apps you are using are mostly sequential: they are doing only one thing at a time, or almost, and sequential performance has been stagnating for some years.

That’s because sequential performance is mostly limited by operating frequency, which is itself limited by:

the size of the finest details that are drawn on the silicon, something that is getting harder and harder to improve,
the amount of heat that is created by the chips, a function of voltage and frequency. First, a transistor emits heat when changing state, so proportionally to frequency. Second, the higher the frequency, the higher the voltage you need. So in the end emitted heat is more than linear in the frequency, not something ideal.

From https://youtu.be/Knd-U-avG0c

So if you could efficiently and cheaply remove heat from the chips, you could get higher frequencies, but only marginally, and it gets quickly impractical (water-cooling, you know, is cool, but not when it leaks…).

The recent ARM takeover is not an accident. When you work for years on low consumption and so low heat producing chips, when everybody hits the “heat wall”, you are in a good position to push performance higher, even if computers migrating to your pocket was the opportunity that made the difference.

Chip design

So people invented tricks to make use of the same amount of cycles to do more, to do almost any instruction in one single cycle, to forecast what’s the next instruction etc. Very different architectures to tackle the same issues were used (RISC, CISC). But the returns are diminishing, as always.

So what can you do to feed the hungry “Moore’s Law Beast”, and the marketing guys who keep asking why the numbers are flattening?

You look for problems that need to do the same kind of task a billion times, and each task does not need the result of another task, so all tasks can be computed at the same time. (the technical slang for this is “Embarrassingly Parallel” …).

Fortunately, there are a lot of them. Linear Algebra, for example, is highly parallel by nature, and machine learning is using it a lot, like lots of physics simulation, computer graphics, and so on.

So instead of increasingly complex single cores processors, we see much simpler (and smaller on silicon) cores but grouped by the hundreds or thousands. This way you are guaranteed that the ratio computation/silicon area is getting through the roof.

Great. That’s a simple idea. But of course, reality is more complex than that.

Bottlenecks

If you have a lot of computing power available, you have to feed it with data. Memory is getting faster with time, but it’s harder than just duplicating cores. Because memory buses are basically 1D, and compute cores are 2D.

From https://unsplash.com/photos/VEVfbQtyB8s

You can think about it as a city (the computing cores), and the suburban workers coming each morning in the city (the data). The city is 2D, the highways are 1D, and of course, you get some heavy traffic jams. So you add some new lanes on the highways (the width of the memory bus), but it’s always the bottleneck

If you want to maximize the highway utility, you would have to use all day long, encouraging people to come to and leave from work earlier or later.

That’s the same thing for the memory bus: you have to make sure that you are balancing computation and memory transfers so you don’t waste time waiting without using the memory bus or the compute cores. That’s why it’s hard to reach peak performance for every task.

Some tasks even prefer to compute twice the same thing instead of transferring some data: compute is plentiful and memory bandwidth is scarce (and the gap is growing each year). In graphics, procedural texturing is used more and more for this exact reason: textures need bandwidth, and so if you can generate the same result with few memory transfers but some additional compute, it’s a win.

GPU Architecture principles

A lot of the complexities of GPU architectures exist to overcome those bottlenecks.

Hierarchy

You don’t get the 1000s of cores in a GPU in a single bag: they are grouped at multiple levels. We’ll take the example of the new Ampere A100. Numbers change according to the generation, but the general principles are slowly evolving. (Numbers below come mostly from the Nvidia blog)

The GA100 streaming multiprocessor (SM)

At the lower level you have a Streaming Processor (SP). He is part of a group of 16 SP which computes the same sequence of instructions at the same time.

(To be more precise, you have 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, 1 Tensor Core, and 1 texture unit per group. More on tensor cores later)

The first constraint is the following: the 16 SP in the group cannot diverge from a single sequence instruction. This is called SIMD: Same Instruction, Multiple Data. That’s not exactly true, the instructions can contain “if“ statement, but if different branches are taken, some compute will be lost because every processor will have to execute both branches, and throw the results that are not useful for its own work.

4 groups of 16 SPs form a Streaming Multiprocessor (SM). Each group executes the same kernel (=function), but not in a strictly synchronized way. Still, you’ll have at least 64 cores working on the same task, or you’ll lose some computing capacity.

Then, you group 2 SMs to form a “Texture Processing Cluster” (TPC), and you group 8 TPCs to form a GPC (GPU Processing Cluster). 8GPCs and you have an A100 GPU. Pfew!

To sum it up, there are 128 SMs in an A100, so 8192 FP32 cores, but as you can see, we are far from getting a flat set of “8192” cores! (those are maximum numbers, first processors won’t have the full set of cores).

If you compare the A100 structure with the Volta V100, these structural numbers are almost the same, except for the PCs, and so for the grand total of course. The innards of the cores of course have changed too, but it looks like that the communication structure of the V100 was considered quite good for the kind of job it’s usually given. The Tensor Cores seems to be the area where the most innovation is taking place (more on this later).

You can see in the comparison below that all those numbers varied significantly with time, in search of the best performance :

Why so many levels? Performance

The main reason is of course to improve real-life performance. And in real life, you don’t have a single task to be done.

First, there may be several processes using your GPU at the same time on your machine. Not sure if it’s a good idea to get some good performance, but it’s of course something very usual.

In the new Ampere GPU, you can even partition your GPU to server multiple Virtual Machines with strong guarantees on your data security: the new feature is called “Multi-Instance GPU”.

In a single process, if your network contains several layers, some linear, some non-linear, some embedding, each one will use one or several kernels to do its job.

You may think that they are executed one after the other. It’s true to some extent, but in order to keep your GPU busy, your CPU is sending a stream of tasks to be done, not a task after the other, and the GPU will do them without the CPU waiting for each one to complete.

The CPU will basically wait after a full batch has been processed, after the forward and backward pass, because he has to update the full model before starting a new batch.

There are several reasons to have this “stream of task” model:

The first reason is that starting a task takes some time, so the GPU can prepare the next task before the previous is started: changing the active kernel on some part of the GPU takes some time, pipelining saves time.
Second, in the task stream, some tasks are not dependent on each other, so both can be executed in parallel in the GPU, so more work to be done, so less chance some part of the GPU is idling.

Some networks are very, very parallel to compute, like Transformers, and so their efficiency is very good:

there are only a few different layers, so few kernel changes and a lot of work for each kernel
there are only loose dependencies between computations (eg for each token), so the GPU has a lot of degrees of freedom when scheduling the different parts of the computation: if a kernel is waiting for some data, maybe another one can compute its result because it already has its own data available.

Why so many levels? Economics

Another reason is that it’s hard to get zero-defect silicon at this level of detail.

Ampere GPUs contain 54 billion transistors. Any defective transistor, and you may have to throw the GPU to the bin. The fraction of chips that pass the test is called the yield. Those chips are huge, and silicon real estate costs a lot, so each failed chip is a big loss, just for a small defect on a single transistor somewhere in the silicon.

So instead of throwing the chip to the bin, you test some sub-parts of the chip, and you just disable the failing sub-parts. That means, for example, disabling a GPC (remember, there are 7 of them in a A100, instead of a theoretical 8). And you sell it in a lower-end card, with reduced specs. This process is called binning. If you are really good, and your chips are all perfect, you may even disable perfectly working parts of your chip, to segment your offer (and back in time, some users were able to re-enable those disabled parts of silicon to get the bang without the buck…)

Developing for GPUs

So what are the consequences of the GPU architecture choices on development?

Kernels

First, you have to write some kernels, using the primitives you get. It’s a quite specific exercise, as you have to manually manage caches, registers, the synchronization of the different cores, etc. For simple stuff like matrix products, or activation layers, it’s quite straightforward, as they are completely parallel by nature.

But for some algorithms, like sorting, it can be a lot trickier to have something efficient, because you will have some issues using all the cores all the time.

Grids and performance

That’s because the kernel is only a small part of the problem, the other is the way you distribute the work among cores. And the performance gains are often made more on the distribution than on an optimal kernel.

The way you distribute the work is usually done by partitioning your job into a 2D or 3D grid, then mapping each point of the grid to a thread, and finally mapping those threads to physical cores. Those dimensions will correspond for example to the dimensions of the output of a layer, plus the batch dimension.

As you have seen, in a GPU you get thousands of cores to work with, but with a really complex multi-layered structure. And this structure change according to the generation and model of the GPU. So it’s hard to find the right way to choose those mappings. You often have to make some benchmarks to find the right way to do a computation with given dimensions on a specific GPU, and that information will be used in the future to choose the best strategy at runtime.

Memory

But the main and the most difficult hurdle a developer face while developing for GPU’s architecture is managing memory. And specifically memory transfers. The available memory bandwidth is huge, but the computing power is even larger. And just as you did not get a flat space of computing cores, you don’t get completely random access to the memory for free.

If you want to access a float number stored in the main memory from a GPU core, you will wait literally for ages compared to the time it takes to compute a sum or a multiply. So you need to be able to start hundreds of computations at once, and when the data is finally available, you resume your kernel, you execute a few local operations, until you need some more data from the main memory.

Some special ops like “prefetch” exist, to declare that you will need some data in a few instructions, and the role of the compiler is to reorder the instructions so you keep the memory controllers busy while keeping the core busy too. And at runtime, a large part of the GPU silicon is devoted to handling all those threads that are “in flight” and their current memory requests.

But there are some low-level constraints that may cost you a lot. Just like the base computation unit is 16 cores doing the same job, you really get peak memory performance if you load memory by quite large contiguous blocks, for example, 16 floats = 64 bytes, by a group of threads (called warp in CUDA lingo). This is called coalesced access. This is another reason, and often the main one, why choosing the right grid to dispatch your task on is important.

So now, let’s unroll back to our initial issue if you still remember (I would forgive you, I can barely): why sparse matrices ops are slow?

If you look at the memory access pattern you need to make a sparse matrix/ matrix multiplication, you’ll see that by definition it’s hard to have those blocks of 16 floats when reading the matrix weights. And reading 16 contiguous floats is just a minimum, you’ll need to read more data at once to reach full performance.

That explains why a naive implementation can be at least an order of magnitude slower than the dense version.

Unless you make some compromise and use a block sparse matrix: each block, if large enough, will produce large contiguous accesses. 8x8 blocks is a minimum in OpenAI implementation, but you will get even better performance with 32x32 blocks.

But of course, you have to make sure that your model is working in a similar fashion with block sparse compared to pure sparse matrices. It can be the case if your matrices are large enough so block size is small in comparison, but you have to check.

The other way is to convince an executive at Nvidia to add some hardware sparse support into their next-gen GPU, and now it’s done. More on this below!

Inter-GPU memory transfer

Memory bottlenecks exist within the GPU, but if you work with multiple GPUs sharing a single model, the available bandwidth is way lower than between memory and cores.

The DataParallel node of PyTorch is convenient, but it is no magic: after each batch, the GPUs must send their gradients to a single GPU, and then this latter must broadcast the updated model to each GPU. If your model is big enough, this transfer can take very significant time, and the performance will suffer. Another point is that the transfers are synchronous, no GPU can work if the new model has not been received.

Another way to use multiple GPUs is to split a single model between the different GPUs, and then transfer only the “frontier” layers from a GPU to the next. Same thing for backpropagation. This may not be ideal either as the first layer will have to wait for the last to complete before the backpropagation can occur. The performance will depend heavily on the morphology of the network.

Ampere Highlights

Let’s finish where we started, with the latest Nvidia announcement.

Tensor Cores

With Volta, Nvidia introduced new “Tensor Core units”, and it looks like they are here to stay. Turing and now Ampere iterated on these new units.

You can see them as ultra-specialized units, with some significant dedicated silicon.

And this means a lot in terms of speed, especially quantized networks inference :

From https://youtu.be/yyR0ZoCeBO8?t=19

For training, it was a bit more difficult on Volta, as working with FP16 was possible but a bit tricky (the 8x gain in speed was indeed tempting).

But now with Ampere, Nvidia announces support for FP32 and even FP64 for Tensor Cores. And it looks like FP32 is now 20 times faster than on Volta with sparsity, and 10 times without sparsity. And this is for training and inference because it’s just big tensor ops, nothing special here.

It looks like we’ll be getting some nice toys to play with.

Sparsity

From the Nvidia Blog :

NVIDIA has developed a simple and universal recipe for sparsifying deep neural networks for inference using this 2:4 structured sparsity pattern.

If you have read the first part of this series, you should feel at home.

The idea is simple: maybe using a fully dense matrix is not useful. And what Nvidia is claiming is that it’s true, keeping only half the weights has a minimal impact on precision.

And so they propose a method to reduce the number of weights. But what is more interesting, is that the A100 GPU has new instructions to process efficiently these sparse matrices, at twice the speed of dense ones (no magic here, only half the multiply occurs of course).

So anyone can try its own method to sparsify the matrices and use the new instructions to speed things up. The only constraint is that the sparse pattern is fixed, as every 4 cells must have 2 sparse ones at most.

You can compare this to the way textures are compressed to save memory but for floating computation and not just graphics.

I see it mostly for inference at first, but I am sure some clever people will come with imaginative ways to use those new capabilities for training too, as it’s just some new compute ops.

What about “sparse block sparse matrices”, by combining soon to be released OpenAI “block sparse matrices” with this? We’ll see.

Conclusion

I hope you enjoyed this second part of our trip to sparse land, even if it may have been a bit harder to digest.

I hope too this will help you to better understand the level of mastery developers in the PyTorch or Keras team show: they manage to hide all this complexity and make it easy for mere mortals to use these supercomputer-on-a-chip to their full power, in just a few lines of python.

Next time we will get back to more usual depths: we’ll see some techniques we can use to train sparse networks, and how performance is impacted.

By the way, congrats to Victor Sanh, Thomas Wolf, and Alexander M. Rush for their latest paper “Movement Pruning: Adaptive Sparsity by Fine-Tuning”!

Sparse Neural Networks (2/N): GPU Performance. was originally published in HuggingFace on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comment arrêter le coronavirus ?

François Lagunas — Thu, 12 Mar 2020 16:18:12 GMT

Comment arrêter le coronavirus ?

Avec une quarantaine stricte. Il n’y a plus d’autre moyen.

Est-t-il dangereux ?

Oui, pour les personnes âgées. Pour elles, il est considéré comme étant 10x plus dangereux que la grippe. La mortalité est de :

3.6% pour les 60–69 ans
8.0% pour les 70–79 ans
22% pour les 80+ ans

(source)

Donc oui, vous ne prenez pas de mesures pour vous, mais pour vos parents et vos grand-parents.

Est-il plus contagieux que la grippe ?

Il est très contagieux. Beaucoup plus que la grippe.

Le nombre de personnes qui sont actuellement en train de mourir en Italie double tous les 2.2 jours. Oui, vous avez bien lu. En une semaine, le nombre de victimes augmente d’un facteur 9. Et la semaine suivante aussi. Vérifiez par vous-même (toutes les données de ce billet proviennent de l’ECDC ):

Mardi 25 février :   6
Mardi  3 mars    :  52 (x 8.6)
Mardi 10 mars    : 464 (x 8.9)

La seule explication de ceci est que le coronavirus se propage a très grande vitesse dans la population. Il serait intéressant de savoir pourquoi, et comment, mais nous n’avons pas vraiment besoin ici de le savoir en détail. Les chiffres sont là: le virus est horriblement contagieux.

Ne faites aucune hypothèse sur ce qu’il est prudent de faire. Evitez tout contact, que ce soit par la peau ou par la respiration. Porter un masque paraît une mesure de bon sens, en particulier au contact de personnes fragiles, mais c’est utile de façon générale.

Les transports publics et toutes les activités non strictement nécessaires doivent être immédiatement arrêtées si elles mettent des personnes en contact, même en petits nombres.

Vous avez besoin d’un seul graphique pour cela.

Le graphique

Voici le nombre de morts total pour quelques pays, sur un graphique légèrement spécial que nous allons expliquer.

(le code source est ici, il sera mis à jour quotidiennement)

Pour ceux qui ne sont pas familiers des maths, ne vous inquiétez pas, nous resterons très simple.

L’axe vertical est dit “logarithmique”. C’est un gros mot pour dire que chaque fois qu’on progresse d’une unité vers le haut, les valeurs sont multipliées par 10. Regardez sur la gauche : 10¹ signifie 10, 10² ->100, 10³ ->1000 etc.

Nous utilisons cette échelle, car sur un tel graphique une épidémie se traduira au début par une ligne droite. Après que des mesures soient prises, la ligne devrait s’infléchir vers le bas, traduisant une propagation ralentie (voyez par exemple la courbe de la Chine).

La ligne en pointillés représente la croissance dont nous parlions au début : un doublement du nombre des morts tous les 2.2 jours.

Que pouvons-nous dire sur l’Italie ?

Le nombre de mort en Italie suit précisément la ligne en pointillés.

Pas d’infléchissement. Et donc pas de signe d’amélioration.

Ceci dit, il s’agit du nombre de morts, ce sera la dernière chose à s’améliorer, car il y a un délai entre le moment où les personnes deviennent malades et le moment où elles décèdent éventuellement.

Il est donc possible que la situation soit déjà en train de s’améliorer, sans que cela se voie sur le graphique. Cependant, pour la même raison, le nombre de morts continuera un certain temps à augmenter, pendant une à trois semaines donc. Cela signifie potentiellement 9x, 81x ou plus de morts par rapport au décompte actuel. Et nous avons déjà dépassé les 800 morts en Italie (au 11 mars 2020).

Mais l’Italie a fait des erreurs au départs, cela n’arrivera pas à mon pays.

C’est peu probable. Comme vous pouvez le voir sur le graphique, la même tendance se produit exactement pour la France, et c’est aussi le cas de nombreux autres pays occidentaux qui ne sont pas représentés. Pire, l’Espagne paraît prendre une trajectoire encore plus grave.

Les seuls pays qui semblent s’en sortir un peu mieux sont :

Beaucoup mieux:

La Chine et la Corée sur Sud: ils ont traité le problème différemment, mais cela a fonctionné, et malgré cela le bilan de la Chine a été très lourd.
Hong Kong, Taïwan: ils ont réussi à éviter l’épidémie et n’ont eu que quelques cas.

Seulement un peu mieux:

L’Allemagne mais les autorités ne semblent pas compter les cas avec des co-morbidités.
Les Etats-Unis, mais après vérification leur nombre de cas double bien en suivant la même tendance, ce qui tendrait à prouver que leur nombre de morts est sous-évalué
l’Iran, mais il est difficile de savoir exactement ce qu’il s’y passe vraiment.

Mais prendre des mesures proportionnées ralentit déjà la propagation !

Faux. Le bon sens tendrait à dire que cela devrait avoir un impact. Mais regardez le graphique. L’Italie a pris des mesures sérieuses, après un mauvais départ, similaires aux mesures prises par la France. Pourtant, la courbe est une ligne droite. Pas d’effet visible. Et la même chose est vraie pour la France et l’Espagne

Pourquoi? Parce que le virus est trop contagieux. Quand vous mettez en quarantaine une seule région, vous pensiez qu’il n’y avait pas de gens contaminés en dehors de cette région, ou en petit nombre. Mais vous avez tort. Il y a déjà 10x plus de gens contaminés, sans aucun symptôme EN DEHORS de la région en quarantaine. Votre quarantaine régionale / proportionnelle n’a aucun effet mesurable.

Regardez le graphique. Réfléchisez-y. Voyez-vous la moindre amélioration après les quarantaines locales ? Voyez-vous la moindre amélioration en France à un moment quelconque ?

La quarantaine totale pour l’Italie est maintenant effective. Mais cela ne se verra pas avant un moment sur ce graphique, à cause du délai dont nous avons parlé. Il faudra donc que la population soit patiente, la situation s’améliorera, mais cela ne sera pas visible avant quelques semaines.

Il est illusoire de dire aux gens de ne pas aller travailler s’ils ne sont pas malades. Vous pouvez être porteur sain du virus sans jamais le savoir. D’autant plus qu’il est impossible de se faire tester sans symptôme.

Mais mon pays a 2 foix plus de lits en réanimation qu’en Italie.

C’est bien. Votre pays résistera deux jours de plus. C’est ainsi que fonctionne une épidémie:

2 jours de plus, 2 fois plus d’infectés, 2 fois plus de patients en réanimation, et 2 fois plus de morts.

Seule une quarantaine stricte pourra amener une amélioration et le fameux “ralentissement” recherché pour que les services de santé puissent gérer la charge de nouveaux patients.

Les autres mesures se sont montrées inefficaces, non pas parce qu’elles ont été mal menées, j’ai toute confiance par exemple dans les enquêtes épidémiologiques, mais parce que le virus est plus rapide que nous.

Si nous agissons trop tard, nos hôpitaux seront submergés, quelque soit le nombre de lits que nous avons. Et il déjà trop tard pour un bon nombre de pays occidentaux pour éviter d’atteindre cette capacité maximale.

La raison ? Le virus est trop contagieux. Le pic sera naturellement bien trop haut si nous attendons, avec une progression trop rapide.

Que puis-je faire ?

Demandez à vos autorités locales d’agir maintenant, et de faire respecter une quarantaine stricte. Pas une quarantaine où les gens peuvent se rendre au travail, et évitent d’aller au restaurant après 18h.

La Chine a mis en quarantaine Hubei lorsqu’il y avait 50 morts à peine. Leur bilan final est de 3000 morts. Et leur quarantaine était des plus strictes.

Rappelez-vous les images d’une cité déserte. Pensez-vous que votre pays s’en sortira mieux avec une quarantaine à moitié respectée ? Vous avez tort. Si votre pays a plus de 50 morts aujourd’hui, il aura plus de 3000 morts à la fin, c’est presque une certitude. Et il y a déjà 10–100x plus de cas invisibles dans votre pays que ceux qui sont détectés.

Une quarantaine stricte. Tout le monde reste à la maison. Pour tous les pays contaminés. Pour un mois. En aidant financièrement les personnes qui ne peuvent pas se permettre de ne pas aller au travail. En aidant les personnes âgées qui sont à la maison en leur fournissant de la nourriture et en leur évitant tout contact. Et l’épidémie sera en grande partie contenue, ou presque. C’est notre meilleure chance.

L’alternative est de prendre des demi-mesures, ou des mesures désynchronisées d’un pays à l’autre, qui chacun recontaminerait les autres, et qui ferait durer la pandémie des mois.

Cela ne serait pas bon pour notre économie.

Et cela tuerait un grand nombre de nos aînés.

How can we contain the coronavirus?

François Lagunas — Wed, 11 Mar 2020 11:40:12 GMT

TLDR: strict quarantine. No other measure will have an effect.

Is it deadly?

Yes, for the elderly. You can consider it to be something like 10x deadlier than the flu for the elderly:

3.6% for 60–69
8.0% for 70–79
22% for 80+

(source)

So, yes, you are not taking precautions for yourself, but for your parents and for your grandparents.

Is it more contagious than the flu?

It’s very contagious. Much more than the flu.

The number of people that are currently dying in Italy is doubling every 2.2 days. Yes, you read that correctly. In one week, the death count multiplies roughly by 9. As well as the next week. Check out Italy for yourself: (all data from this post are from ECDC ):

Tuesday, Feb 25 :   6
Tuesday, Mar 03 :  52 (x 8.6)
Tuesday, Mar 10 : 464 (x 8.9)

The only explanation for this is that the coronavirus is propagating at a very fast pace. It would be interesting to know why, and how, but you don’t really need to. The numbers are: it’s terribly contagious.

Do not assume anything about what is safe. Avoid contact, period.

Public transportation and all non-necessary activities should be immediately stopped.

You need a single graph to understand this.

The graph

Here is the count of deaths for a few countries, in a somewhat special chart that we’ll explain below.

(source code here, it will be updated each day)

For those that are not familiar with maths, don’t worry, we’ll keep it simple.

The vertical axis is logarithmic. That’s a big word to say that each time you go up one unit, the value is multiplied by 10. Look on the left side: 10⁰ means 1, 10¹ means 10, 10² means 100, 10³ means 1000, etc.

We are using it because on such a chart, an epidemic should initially look like a straight line. After some precautions are taken, the line should curve downwards (see the China curve for an example).

This dotted line represents the rate we talked about: doubling every 2.2 days.

What can we say about Italy?

The death count in Italy is exactly following the dotted line.

No bending. No sign of improvement.

That being said, it’s a death count, so it will be the last thing to improve, as there is some lag between getting ill and then dying, which is somewhere between one and three weeks.

So maybe things are improving right now. But for the same reason (lag) the current trend of deaths will probably go on for one to three weeks. That means 9x, 81x, or more deaths than now. And we have already surpassed 600 deaths in that single country.

But Italy made mistakes initially, it won’t happen to my country

Implausible. As you can see on the graph, the exact same trend is observed in France and Spain, with a delay of 9 days. The only countries that seem to fare better now are:

Much better:

China and South Korea: they did it differently, but it worked in the end
Hong Kong, Taïwan: they avoided it almost completely.

Only slightly better:

Germany, but Germany seems to not report cases with co-morbidities
USA, but when checking the “cases” I realized that cases are growing at the 2.2 days doubling rate, so it’s implausible that on another side deaths are not growing at the same rate.
Iran (same remark)

But taking soft measures is already slowing down the spread.

Wrong. Look at the graph. Italy took some serious measures after an initial delay. The curve is still a straight line. No visible effect. And the same is true for France or Spain.

Why? Because it’s so contagious. When you are locking down a region, you think that there were contaminated people only in this region. You’re wrong. There are already 10x (or more) people contaminated, without any symptoms, OUTSIDE the locked-down region. So your regional/proportional lockdown has no measurable effect.

See the graph. Think about it. Do you see any improvement in Italy after regional lockdowns? Do you see any improvement in France at any point?

Right now the lockdown is effective for the full country of Italy, but it won’t show on the graph for some time, as it takes some time to fall ill. Thus people will have to be patient, improvement will come, but it won’t be visible until a few weeks.

But my country’s hospitals have 2x the number of ICU beds Italy has.

Fine. Your country will resist 2 days more. That’s the nature of an epidemic:

2 days more, twice the infected, twice the ICU patients, twice the deaths.

Only when the lockdown is in effect you will see an improvement on this.

If you act too late, your hospitals will be overwhelmed, no matter the number of beds you have. And it’s already too late for western countries to avoid reaching hospitals’ full capacity.

The reason? The virus is way too contagious. So the spike will naturally be ‘peaky’.

What can I do?

Ask for your local authority to act now, and enforce a total lockdown. This is not something where people are going to work or just avoiding going out to eat in the evening.

China locked down Hubei when there were only 50 deaths or so. They ended with 3000 deaths. And their lockdown was militarily enforced. Recall the images of a deserted city. Do you think your country will fare better with a partially enforced lock-down? You are wrong. If your country has more than 50 deaths now, you will have more than 3000 deaths in the end. There are already 10–100x more cases that are just invisible in your country.

A FULL lockdown. Everybody staying at home. For all countries. For at least one month. Financially helping individuals who cannot afford not to work. Helping elderly people that are at home with providing them with food and avoiding all contact. With these measures this will soon be over, or almost. This is our best shot at stopping this.

The alternative is to take half-measures, or desynchronized measures across countries, so we re-contaminate each other, and let the pandemic carry on for months.

It would not be good for our economy.

And it would kill a lot of our elders.

Thanks to Meg Wilmore for the proofreading.

Is the future of Neural Networks Sparse? An Introduction (1/N)

François Lagunas — Tue, 04 Feb 2020 17:08:58 GMT

From principles to real-world library support.

TLDR: Yes

Hi, I am François Lagunas.

I am doing Machine Learning research, and I have been working for the last months on using sparse matrices, especially in Transformers. The recent announcement that OpenAI is porting its block sparse toolbox in PyTorch is really big news:

“We are in the process of writing PyTorch bindings for our highly-optimized blocksparse kernels, and will open-source those bindings in upcoming months”

I was talking about it with the outstanding Hugging Face team, (I am one of their early investors), and I wanted to share with you my excitement!

What is a Sparse Matrix?

A sparse matrix is just a matrix with some zeros. Usually, a lot of them. So every place you are using a dense matrix, in a linear layer, for example, you could be using a sparse one.

Matrices with increasing sparsity

The sparsity of the matrix is the fraction of zeros against the size of the matrix

The pros? If you have a lot of zeros, you don’t have to compute some multiplications, and you don’t have to store them. So you may gain on size and speed, for training and inference (more on this today).

The cons? Of course, having all these zeros will probably have an impact on network accuracy/performance. But to what extent? You may be surprised.

Where are they from?

The first researchers/engineers to use sparse matrices were Finite Elements users.

A 2D mesh (roof of Omni Coliseum, Atlanta) and its finite element matrix (source).

When you have to deal with large physical simulations, you get a large graph of interconnected vertices.

Each vertex is a point of your system, and each edge connects two vertices. That means that these two points will have some influence on each other in the model. And so there is a non-zero value in the matrix that describes the graph.

This last sentence sums it up: you need non-zero values in the matrix when two dimensions are interacting in some way.

Now getting back to ML, you should ask yourself the same question: are all the dimensions of my input vector interacting with all the others? Usually not. So going sparse maybe useful.

We have actually a very good, and famous, example of a successful trip to sparse-land: convolutional layers.

Learned convolutional filters. From http://cs231n.github.io/convolutional-networks/

Convolutional layers are a smart and efficient way to implement a sparse transformation on an input tensor.

When processing images, it comes down to two things:

Sparsity: the transformation is local → each output pixel should depend on a few neighboring input pixels.

Invariance: the transformation does not depend on the position in the image

Then you just add the constraint that the transformation is linear: if you were to represent this transformation, you would get a HUGE matrix with only a few non-zeros. But of course, the right way to do this is to do a multiplication of the input tensor with a small set of small matrices (each square in the image before).

The importance of convolutions in today’s ML success is obvious. But you can see that finding a clever way to make things sparse sounds like a good recipe to save time and space.

Where are they useful?

Convolutions are already an efficient form of sparsity, so you could try to make them even more sparse, but some other networks contain much larger matrices that may benefit from sparsity: Transformers.

And those are getting bigger and bigger. We have greatly exceeded the 1 billion parameters in 2019, and it’s not stopping here. The cost to train and to use those networks is getting unpractical, so every method to reduce their size will be welcome.

From https://devblogs.nvidia.com/training-bert-with-gpus/

Why the OpenAI announcement is so important?

So, if everything is fine in sparse-land, we should all be trying sparse matrices, shouldn’t we?

Yes. But there is this stupid thing called implementation. It’s easy to see the theoretical improvements we could get with sparse compute. But the support in libraries is quite … sparse.

PyTorch developers, for example, have done a significant effort to support sparse compute. But there is still a big gap in performance between dense and sparse matrices operations, which defeats the whole purpose of using them. Even memory usage is quite large: sparsity has to be more than 80% to save some room on sparse matrices (more on that in my next post). Even basic serialization was broken before version 1.4. The reason is that the underlying libraries (for example cuSPARSE) are not doing a great job because the problem is ill-suited to the way GPU works.

So the OpenAI announcement on their block sparse tools is very good news for those who want to use sparse ops without sacrificing training speed (and it looks like some people have been waiting for some time now). And we are not talking about a few percents.

“Our kernels typically performed one or two orders of magnitude faster in terms of GFLOPS.”

From OpenAI blocksparse paper

(The worst thing is that the paper concludes that cuBLAS is faster that cuSPARSE even with very sparse matrices. How sad.)

The magic keyword here is “block”. It’s hard to implement general sparse matrice computations on GPUs in an efficient way. But it gets much easier if you add a “reasonable” constraint on the form of the matrices: their non-zeros should be grouped in small fixed-size blocks, and that makes GPU processing much easier to parallelize efficiently. Typically 8x8, 16x16 or 32x32 blocks, 16x16 already giving a very good performance, with 32x32 giving a slightly better one.

A 8-block-sparse matrice

Of course, the “block” constraint may be crippling some sparsification algorithms, or at least it would require some changes to take it into account.

But at least we can play with large high sparsity matrices, and the block constraint may not be a big issue: if you think about it, it means that there is some locality in the dimensions, and that sounds a quite reasonable constraint. That’s the same reason band matrices have been useful in the past (finite difference, finite elements), and it was a much stronger constraint.

Band matrix

Conclusion

I hope I have convinced you that 2020 will be the sparse network year (it already has two zeros, that’s a sign).

Next time for those who are curious about what happens when they are using some CUDA based PyTorch code, we’ll dig a bit deeper in GPU internals, (and we will understand why block sparse code is outrunning sparse code by a large margin).

This article series will continue on the different techniques that have been proposed to make sparse networks, and what are the potential long term benefits.

Stories by François Lagunas on Medium

Sparse Neural Networks (2/N): GPU Performance.

Sparse Neural Networks (2/N): Understanding GPU Performance.

Some physics

Chip design

Bottlenecks

GPU Architecture principles

Hierarchy

Why so many levels? Performance

Why so many levels? Economics

Developing for GPUs

Kernels

Grids and performance

Memory

Inter-GPU memory transfer

Ampere Highlights

Tensor Cores

Sparsity

Conclusion

Comment arrêter le coronavirus ?

Comment arrêter le coronavirus ?

Est-t-il dangereux ?

Est-il plus contagieux que la grippe ?

Le graphique

Que pouvons-nous dire sur l’Italie ?

Mais l’Italie a fait des erreurs au départs, cela n’arrivera pas à mon pays.

Mais prendre des mesures proportionnées ralentit déjà la propagation !

Mais mon pays a 2 foix plus de lits en réanimation qu’en Italie.

Que puis-je faire ?

Et cela tuerait un grand nombre de nos aînés.

How can we contain the coronavirus?

TLDR: strict quarantine. No other measure will have an effect.

Is it deadly?

Is it more contagious than the flu?

The graph

What can we say about Italy?

But Italy made mistakes initially, it won’t happen to my country

But taking soft measures is already slowing down the spread.

But my country’s hospitals have 2x the number of ICU beds Italy has.

What can I do?

And it would kill a lot of our elders.

Is the future of Neural Networks Sparse? An Introduction (1/N)

From principles to real-world library support.

Hi, I am François Lagunas.

What is a Sparse Matrix?

Where are they from?

Where are they useful?

Why the OpenAI announcement is so important?

Conclusion

More reading