Is the regression in GC or something else?

Maoni0
4 min readDec 7, 2022

--

When you change your product to run with one version of .NET to another, sometimes there’s a regression in memory. This could show up as heap size increases, GC pause increases (either individual GC pause increase and/or total GC pause increase), or GC CPU time increases.

It’s very often that I’ve observed folks immediately think this is a GC problem, after all, GC is in charge of managing memory for you, right? GC is indeed in charge of managing memory for you but if your workload changes of course the GC behavior will change accordingly.

I’ve had many people say to me, “NO! My code did not change! It’s exactly the same code. I just upgraded from .NET version X to .NET version Y!”.

Your code does not run just on the GC (it’d be hard to do anything interesting if there was only the GC :-)). It relies on the GC, the rest of the .NET runtime (which includes JIT, type system, interop, and etc), BCL (Base Class Library — this is where the implementation for classes like string/StringBuilder/Stream and many, many others live) and the higher level libraries (asp.net and etc). And with each new version of .NET there are not just changes in the GC but in all those other components. Each new release contains a huge amount of changes.

Obviously, changes in the GC can change memory behavior. But changes in the library code you are calling can contribute to the changes in memory behavior just as easily.

With this many changes, debugging why you see a memory regression can range from trivial to very difficult. Trivial cases are like “this particular class we use a lot started to allocate a lot more objects/pinned handles which causes a lot more GCs/fragmentation on the heap”. But with a ton of changes, many of which can affect memory behavior, it could be a lot of work to figure out what contributes to regressions.

As someone who often gets asked to do this kind of memory regression analysis, I thought I’d make the initial step of identifying “is the regression in GC or something else?” much quicker. This mechanism is shipping in .NET 7. Details are explained below.

Using clrgc.dll to determine if the regression is caused by changes in GC

Starting .NET 7.0 we’ll be shipping a standalone clrgc.dll that includes the GC implementation from .NET 6.0. This will be in the same directory as coreclr.dll. When you see a memory regression, please run your scenario again with clrgc.dll by setting this runtime config -

{
"runtimeOptions": {
"configProperties": {
"System.GC.Name": "clrgc.dll"
}
}
}

or this environment variable -

set DOTNET_GCName=clrgc.dll

There are 3 possible outcomes -

1. if the regression is gone, it’s clearly in GC so please file an issue on our GH repo with the following data when running on .NET 7, with and without clrgc.dll -

Top level GC traces

CPU samples traces

and the GC team will look at it promptly! We’ll likely ask you for more data and while we are doing our investigation and producing a fix, please feel free to keep using clrgc.dll.

2. if the regression is the same, it’s clearly not in GC. The best starting point is also by capturing traces with top level GC metrics for running on .NET 7 and the version you upgraded from. You can display the trace in PerfView and compare the “Total Allocs” in the GCStats summary for your process. If the total allocation is simply more on 7.0, you can follow the measuring allocations section to diagnose what’s making the allocations. Otherwise, you can follow the section for diagnosing high % pause time in GC and/or large GC heap size.

3. The regression is only partially gone. You can start with the same step described in 1).

History of clrgc.dll

In .NET 2.0 we started working on the LocalGC project which was to allow you to load an alternate GC implementation, ie, a “local implementation”. By default, the GC for .NET Core is part of coreclr.dll. LocalGC enabled the runtime to load a GC implementation from a separate DLL. This separate DLL needed to communicate with coreclr.dll via well-defined interfaces, the 2 main ones are -

IGCHeap which contains methods on the VM side (in this context, VM side really just means the code that lives in the coreclr\vm directory) uses to call into the GC. This is defined in coreclr\gc\gcinterface.h. Many methods here are actually to support calls indirectly made by the BCL side such as GetMemoryInfo which is used to implement the GC.GetGCMemoryInfo method. There are also methods called by the allocation helpers to actually call into the GC to get memory.

IGCToCLR which contains methods that the GC side uses to call into the VM, such as SuspendEE/RestartEE for suspension and GcScanRoots to scan stack roots. This is defined in coreclr\gc\gcinterface.ee.h.

If you want to plug in your own GC implementation, you will need to implement what is on the IGCHeap interface. If you do this just for research purposes or for fun, you don’t need to implement every single method IGCHeap define, just enough to get the runtime functional. Since many methods are to support APIs defined on the GC class, they are not required to make the runtime functional.

We actually ported all this back to .NET Framework 4.8 because aside from using the standalone GC in our own day-to-day development work, it has been really handy to use it to verify fixes or provide instrumentation in customers’ production environment. Especially for .NET Framework it’s a really big deal to change clr.dll because you’d be changing it for everything running on that machine, but with standalone GC you can just have the process of interest to load clrgc.dll and all other processes will remain using clr.dll. As part of the early testing for regions, I gave our partner teams two clrgc.dll’s — one with the regions implementation and the other with segments. This way they didn’t need to upgrade their .NET version at all — they just needed to load a clrgc.dll with their product running on .NET 6.0.

--

--

Maoni0

loves working on #dotnet #dotnetcore GC and other perf stuff; avid zoo-goer; wannabe hiphop dancer.