How the heap verification mode helps with debugging heap corruption

Maoni0
6 min readDec 22, 2022

--

There’s a very handy tool called the heap verification mode that’s used to help figure out heap corruption issues, ie, those gnarly AVs (access violations) when accessing an object that’s no longer valid. It’s by default off and to turn it on, set this env var –

set DOTNET_HeapVerify=1

When it’s on, it verifies that the GC heap looks sane on entry and on exit of each GC. And when it doesn’t look sane, heap verification will report an error. This is an example of such an error –

void gc_heap::verify_heap (BOOL begin_gc_p)
{
// omitted
prev_object = curr_object;
// omitted
curr_object = curr_object + Align(s, align_const);
if (curr_object < prev_object)
{
dprintf (1, ("overflow because of a bad object size: %p size %zx", prev_object, s));
FATAL_GC_ERROR();
}
// omitted
}

I’ve omitted all the other lines in the verify_heap method. This particular code snippet says that if we calculate the size of curr_object to obtain the address of the next object and that address is smaller than the current_object start, something is wrong so we call FATAL_GC_ERROR() which does the following –

inline void FATAL_GC_ERROR()
{
GCToOSInterface::DebugBreak();
_ASSERTE(!"Fatal Error in GC.");
GCToEEInterface::HandleFatalError((unsigned int)COR_E_EXECUTIONENGINE);
}

When folks observe a heap corruption, which often shows up somewhere in the GC code like mark_object_simple1 (unsurprisingly, since GC looks at the heap the most extensively, it very likely gets to observe the corruption before something else looks the corrupted object and observes a failure there), I would ask them to turn this mode on and see if it breaks in FATAL_GC_ERROR as the first step of figuring out if the corruption is likely caused by the GC or something else.

If it does break in FATAL_GC_ERROR, what does that mean? First of all, it’s useful to know how to distinguish “on entry” and “on exit” –

· On entry of a GC means verify_heap is called directly by garbage_collect

· On exit of a GC means it’s called directly by gc1.

Let’s say the heap looked sane on entry of a GC but not on exit of that GC. It means the heap looked fine when this GC was triggered, so GC was handed a heap that heap verification did not detect any problems with. But after the GC was done, the heap no linger looked fine. So something went wrong during the GC. Note that I’m very careful here not to say that “something went wrong with the GC” because it can still mean something else went wrong, for example, if something reported roots incorrectly to the GC. But we can treat it as “it’s very likely something went wrong with the GC”. And vice versa, if the heap looks sane on exit on a GC but not on entry of the next GC, it’s very likely that something went wrong with something other than the GC. If the corruption repros with the heap verification mode on, the policy we generally use to decide if a GC team member should debug it is if it breaks on exit of a GC. And this should make sense now because that tells us “it’s very likely something went wrong with the GC”.

Of course having a heap verification mode provides more value than just deciding who should look at the heap corruption issue first 😊. It’s extremely useful to look at the corrupted object(s) as early as possible after it occurred. As I mentioned above, the reason you see callstacks that indicate heap corruption often in GC code is because GC looks at objects very extensively. Think about how objects are normally used — you allocate an object, or get a hold of an object from something you call, then you access some fields. There could be a long time between when it’s corrupted till it’s accessed in a way that causes an AV. To understand why/what corrupted it, it’s better to get as close to when it happened as possible. Even doing GCs normally may not cause an AV to show up when an object is corrupted because of reasons like GC doesn’t look at all objects when doing ephemeral GCs, or that GC doesn’t need to look at an object again after it corrupts it therefore no AV would show up during that GC. But when heap verification is on, we verify the whole heap, not just the generations that get collected. So if say something went wrong during a GC, it’s much better to be able to look at it at the end of that GC instead of an undeterministic number of GCs later.

There is a ton of stuff heap verification checks which gives us clues what to look at next. Below is the list of things that get verified during the verify_heap call (I’ll mention regions but segments works very similarly) -

· The whole heap is walkable, meaning we can advance from the beginning of a region all the way to the last object on that region.

· For each object we walk that’s not a free object, we verify various aspects of it, eg, if we find it in SOH its size should be less than the threshold for a large object. We verify its immediate children when we can (during some phases we may not be able to do this).

· Region lists themselves look fine. For example, we can go from the start region of a generation to the tail region, a region’s generation number is set correctly, and etc.

· The GC bookkeeping data structures look fine, ie, cards, bricks, mark bits and mark array bits look fine, eg, when a card should be set it is indeed set, when mark array bits should be cleared they are indeed cleared.

· Free lists look fine.

· It calls the verification code in components that GC interacts with, such as the handle table and finalize queue.

Each of these will give clues to what we want to look at next. For example, if it breaks because it hits FATAL_GC_ERROR when a card that should be set is not, we look at places where GC is supposed to set cards and go from there.

Note that verify_heap is always called during STW (Stop The World), which means for a Background GC, we will do an additional STW phase. Normally the end of a BGC runs concurrently but when heap verification mode is on, we call SuspendEE to suspend managed thread before calling verify_heap and RestartEE afterwards to resume them.

You may have noticed I said “if the corruption repros with the heap verification mode on” instead of “when”. Because verify_heap does so much, it will make your application run a lot slower (unless you barely do GCs of course) so it could disturb the timing enough that it makes the bug no longer repro. I’ve had folks who worked on GC or GC related issues tell me that heap verification wasn’t useful to them when they couldn’t repro the bug with it on. I suggested to them that they could heavily cut down a lot of what the verification verifies by default. For example, you could start with just walking the heap, or just walking the heap and verifying cards, and only in the generations that you suspect might be of interest. That can reduce the work by a lot and often make the bug repro.

In GC development, we use the heap verification mode extensively and it’s normal to make change to the verification code itself if we want to cut down the cost there. When customers report a heap corruption issue, I start with having them turn the mode on and see if they can repro it, and if they can, see if it reports an error on entry or exit of a GC and go from there. If the error is on exit of a GC, a GC team member would take a look. If it’s on entry of a GC, then it’s likely caused by something other than the GC.

--

--

Maoni0

loves working on #dotnet #dotnetcore GC and other perf stuff; avid zoo-goer; wannabe hiphop dancer.