Threading in Java.

Beka Kodirov
4 min readJun 27, 2016

--

Part II. Hardware.

— Tell us about the hierarchy of caches L1 / L2 / L3? What caused her appearance?

L1, L2 and L3 caches are different memory pools similar to the RAM in a computer. They were built in to decrease the time taken to access data by the processor. In order to let the CPU continue without wait states (idle, waiting for the data to come) there is with low-latency small memory to store the relevant data. The difference is the location where they are placed. L1 which is placed on CPU chip closest to processor, L2 which is placed in between CPU and RAM, if L23 is present then it is on motherboard. The architecture they are built with also differs considerably. For eg. the L1 cache is built using larger transistors and wider metal tracks, trading off space and power for speed. The higher level caches are more tightly packed and use smaller transistors. When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.

  • L1 cache is closest to the core (each core has its own cache) and typically there are one for data and one for instructions, sizes are 8–64KB.
  • L2 can be shared with multiple cores and is in the 2–4MB range
  • L3 is on the die as well in some systems and can be 8–16MB, although some processors have replaced it with L2 cache.

— What is a “Cache line”?

Data is transferred between memory and cache in blocks of fixed size, called cache lines. When a cache line is copied from memory into the cache, a cache entry is created. Cache is partitioned into lines (also called blocks). Each line has 4-64 bytes in it. During data transfer between CPU and RAM, a whole line is read or written. Each line has a tag that indicates the address in M from which the line has been copied. In the modern CPU, usually, cache line size is equal to 64 byte.

Aside: on a mac you can get your CPU’s line size by running sysctl machdep.cpu.cache.linesize. On linux you use getconf: getconf LEVEL1_DCACHE_LINESIZE

— What “False sharing” is it? Is it bad? How to deal with this?

Let’s take a look this situation.

public final class X {
public volatile int f1;
public volatile int f2;
}
On Oracle JDK 1.8, instances of this class are laid out in memory as shown below

We have declared all fields as volatile indicating that these fields may be used by different threads and we want to ensure writes are visible to all of them. As a result, the runtime will emit code to ensure that writes to a field are visible across CPU cores. But how does this work?

In this example I assume a hypothetical CPU with one cache layer (L1 cache) where each cache belongs to one core. What happens if one thread, running on Core0 is constantly writing to X.f1 (depicted in red below) and another thread on Core1 is constantly reading from f2 (depicted in green below)?

Core0 knows that it does not own the cache line “n” exclusively, so it has to broadcast an “Invalidate” message across the bus after changing the cache line to notify other cores that their caches are stale. Core1 is listening on the bus and invalidates the corresponding cache line. Consequently, this produces a lot of unnecessary bus traffic although both cores operate on different fields. This phenomenon is known as false sharing. False sharing cause only performance issue. It doesn’t affect to data correctness. How to deal with? Keep reading on!

— What is a Memory padding?

In previous example we had an object with size 32 byte, meanwhile cache line size is 64 byte. Our class X has 2 members f1 and f2 and both of them laid in the one cache line. Because of frequent invalidating them, read and write operations get more expensive. How can we solve this issue. We can separate f1 and f2 into another cache lines. In that case changing f1 doesn’t affect to operations with f2. The idea is to stuff enough fields between f1 and f2 so that they end up on different cache lines. Depending on the JVM implementation this may not work as a JVM implementation can lay out fields in memory as it sees fit.

@Contended has been introduced to Java 8 with JEP-142. With this annotation, fields can be declared as contended. The current OpenJDK implementation will then pad the field appropriately, inserting a 128 byte padding after each annotated field. 128 bytes is twice the typical cache line size and were chosen to account for cache prefetching algorithms, specifically algorithms which fetch two adjacent cache lines.

— What is a Thread Parallelism? Task Parallelism?

As a simple example, if we are running code on a 2-processor system (CPUs “a” & “b”) in a parallel environment and we wish to do tasks “A” and “B”, it is possible to tell CPU “a” to do task “A” and CPU “b” to do task “B” simultaneously, thereby reducing the run time of the execution. The tasks can be assigned using conditional statements as described below.

Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as opposed to the data (data parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.[citation needed]

Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur — while one thread is delayed waiting for a memory or disk access, other threads can do useful work.

--

--

Beka Kodirov

Senior Android Software Engineer with deep Java skills. Oracle Certified Professional Java Developer. LinkedIn:https://www.linkedin.com/in/bkodirov/