Write barrier optimizations in regions

Maoni0
8 min readOct 2, 2022

--

This is another blog entry about regions. As I mentioned in my last blog entry, regions on their own will naturally give us some perf benefits already but they are really meant to be a foundation for us to unlock optimizations that are difficult to do with segments. Given the time we had in .NET 7.0, we unlocked some of these optimizations and obviously will continue to unlock more in the upcoming releases.

One of the optimizations we did was to have a more precise write barrier. To talk about that, let's talk about what purposes the write barrier serves in the .NET runtime first. If you are already familiar with this feel free to skip to the section that describes the 7.0 optimization.

What is a write barrier?

A write barrier is a piece of code that gets invoked when you do this -

obj_x.field0 = obj_y;

when obj_x.field0 is a reference field, in other words, obj_y is an object on the GC heap. We refer to obj_x.field0 as "location" from here on, as it's the location we are assigning to, and obj_y as the child object. And I’ll use the WB acronym in place of Write Barrier.

When .NET was started, the WB’s only purpose was to facilitate the GC being generational. A generational GC means we don't collect the full heap all the time - in fact we do that pretty rarely. When we are only collecting part of the heap, we need to know if the rest of the heap is pointing into the portion that we are collecting. For example, if we are doing a gen1 GC, it means we are only collecting gen1 and its younger generations (which in this case is just one generation - gen0) so the rest of the heap - gen2, LOH and POH (POH did not exist till .NET 5) are not collected. If those objects are pointing to any gen1 or gen0 objects, that needs to be discovered in some way so GC can keep these gen0 or gen1 objects alive if they are not already considered live by other means such as being pointed to by a stack root. In the WB we indicated this by setting a byte in a data structure called the card table. The reason we set a byte instead of a bit, even though a card is a single bit, is because you can't set a bit atomically.

When we needed to make the runtime run on Linux, we added another functionality in the WB which is to record heap modifications in our own data structures instead of using the OS's write watch on Windows which did not exist on Linux.

The most common function that gets called is JIT_WriteBarrier. When this is called, it knows that the location is on the GC heap. When the JIT can't decide this is the case, it generates calls to one of the 2 other versions - JIT_CheckedWriteBarrier and JIT_ByRefWriteBarrier. These need to check if the location is on the GC heap first and exit right away if that's not the case. JIT_ByRefWriteBarrier is a special case to handle copying a range of reference fields in a struct (which is why rsi/rdi need to be incremented by a ptr size).

For Workstation GC in the segments case, we know that there's a single ephemeral range because there's only one ephemeral segment. In fact, where there is only one segment, instead of a range, we just need the lower limit, i.e., anything higher than where generation 1 starts is an ephemeral object. So we filter the child object by checking if it's in the ephemeral range by doing either one or two comparisons. Unfortunately this isn't the case for Server GC which has as many ephemeral ranges as there are heaps. So for Server GC we set the cards unconditionally (if they are not set yet).

More precise WB in 7.0

In .NET 7.0 we made the WB much more precise for the regions case, especially for Server GC. Peter did the work. One of the major differences between regions and segments is a region can only belong to one generation, unlike segments where the ephemeral segment can contain gen0, gen1 and potentially gen2 objects. We maintain a compact generation info map for regions (so it's likely in the cache) and pass this info over to the WB side. Then in WB we can get the generation for any address that's on the GC heap. We do the following in the WB -

  1. Get the generation number for the location.
  2. If this is in gen0, nothing to do; else get the generation of the child object.
  3. If the generation of the child object is >= generation of location, nothing to do; else we need to set the card, if it hasn't been set.

By only setting cards when location is in an older generation than the child object, we cut down the number of false positively set cards by a whole lot. For example, it's very common that you'd allocate a new objects and initialize their fields with another new object you just allocated. So no cards are set for that case anymore. This greatly reduces the number of cards GC has to look at during an ephemeral GC which makes the mark phase faster.

Initially my plan was to still set the byte in the card table like we used to for segments. But Peter experimented with setting just a bit and found that the perf was still acceptable (because we don't tend to interlock on the same bit when we update). And of course this reduced the number of false positive cards noticeably further, so we decided to set this as the default.

We actually do still maintain an ephemeral range but this is no longer precise like in the segments case for Workstation GC. This is simply the beginning of lowest ephemeral region to the end of the highest ephemeral region. And there may be gen2/LOH/POH regions in between. We do use this in the GC, however we don't currently use this range on the WB side because updating WB when this range is updated added what we considered an unacceptable amount of overhead. However, we plan on doing more optimizations in the future to make this range much more stable. So today even though we leave the code of comparing with this range in WB, this range is just set to the GC heap range (so we don't need to update the WB code because this range is updated).

Note that the data structure that allows us to get the generation number for an address is quite compact (it’s a lot smaller than the card table) so we expect that this is often a cache hit.

Here's a comparison of 3 10-min runs from a 1st party environment with real time forked production traffic. Note that the result for the Region Byte WB is doing fewer GCs because it happened to hit a tuning decision for budget adjustment differently from the other 2 runs (one of the things I'm hoping to fix in the upcoming releases to make this more robust). The last 2 columns are very important — they are what we use to measure GC's efficiency, ie, how long does it take to promote the same number of bytes? For GC the number of promoted bytes is the indication of how much work GC needs to when comparing the same workload. So higher is better. With the Byte WB version of regions gen0 speed is 18% higher. And with the Bit WB version of regions it's 25% higher than segments. Note that the improvement in regions does include more than just the WB changes but you can clearly see the difference between bit and byte WB.

Of course we do pay a price in the WB code because it now does more work -

For Bit WB regions, there were 25,556 samples in JIT_WriteBarrier which is 0.7% of the total CPU samples. So even though it does have 15% more samples than segments, it's still a small cost. I've rarely seen JIT_WriteBarrier take more than 1% of CPU.

I do plan to change the WB implementation in the future releases in .NET to move most of the work we have to do in the WB code path elsewhere so it does not block the thread that executes the WB as long.

How to know if I hit a regression due to this?

We did see regression from running tiny microbenchmarks for this change. As I mentioned above, usually JIT_WriteBarrier takes very little CPU. But if you have a microbenchmark that spends a lot of time in the WB, you could certainly see a regression. Let's say you have a small test that just assigns an existing object to different locations, it can easily spend a significant percentage of CPU in WB. But since this is not the typical behavior real workloads exhibit, we view it as acceptable regression. However, we do provide a way to revert back to the old Server GC behavior if you do run into the very rare cases where the WB cost is actually significant in your app. In src\coreclr\gc\gcconfig.h, we define the following values you can specify for the DOTNET_GCWriteBarrier environment variable -

enum WriteBarrierFlavor 
{
WRITE_BARRIER_DEFAULT = 0,
WRITE_BARRIER_REGION_BIT = 1,
WRITE_BARRIER_REGION_BYTE = 2,
WRITE_BARRIER_SERVER = 3,
};

Default is the Bit WB so 0 and 1 are handled the same way, both using the Bit WB. Setting the config to 3 means to use the old Server GC WB which is to unconditionally set the card (byte). And if you want a middle ground, you can set it to 2 which means to use the Byte WB. There is no equivalent of the old Workstation GC WB. However we observed that in our Workstation GC testing, even though WB costs a bit more, the end result is the total execution time is usually noticeably less.

In .NET 7 we also implemented Write XOR Execute. And the page that the WB code lives on is mapped twice and the address of the code is on the writable page that’s not part of coreclr.dll. Without W^X you can set a bp on coreclr!JIT_WriteBarrier and see it getting hit -

ModLoad: 00007fff`28420000 00007fff`2890a000   C:\runtime-segments\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\coreclr.dllBreakpoint 1 hit  
coreclr!JIT_WriteBarrier:
00007fff`2857e040 488911 mov qword ptr [rcx],rdx ds:00000240`80ad46c8=0000000000000000

00007fff2857e040 is in the coreclr.dll module.

But with W^X you will only hit coreclr!JIT_WriteBarrier_Callable which will then jump to the WB code -

ModLoad: 00007fff`26150000 00007fff`2663a000   C:\runtime-segments\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\coreclr.dll

Breakpoint 1 hit
coreclr!JIT_WriteBarrier_Callable:
00007fff`262ae240 ff25baed3600 jmp qword ptr [coreclr!JIT_WriteBarrier_Loc (00007fff`2661d000)] ds:00007fff`2661d000={CORINFO_HELP_ASSIGN_REF (00007ffe`c65d0010)}
0:017> t
CORINFO_HELP_ASSIGN_REF:
00007ffe`c65d0010 488911 mov qword ptr [rcx],rdx ds:0000029c`2c86b7e8=0000000000000000

00007ffec65d0010 is not part of coreclr.dll so when you look at this in a profiler like PerfView, it does not know how to map this address to a symbol. This can be made work by firing a rundown event to help with the decoding. But for now, you can turn W^X off by setting the DOTNET_EnableWriteXorExecute env var to 0.

--

--

Maoni0

loves working on #dotnet #dotnetcore GC and other perf stuff; avid zoo-goer; wannabe hiphop dancer.