Sorry for this title, but that’s exactly the subject: this short stack gives me enough information to understand the issue, reproduce it, open a SR, talk with friends, find a workaround,…
A Friday afternoon story
Here is how this started, on a database just migrated from 11g:
Every hour a job is running, many sessions are blocked on library cache lock. As we are in RAC 19c, The Hang Manager detects this blocking situation and kills the culprit after a while. This was the first time I see it in action in real life. Killing is never good, but that’s better than letting all sessions blocked by one background job. However, it does not resolve the root cause… which, in this case, comes back every hour.
I was on day-off, but I can’t resist looking at those things. I look at the trace file. Not only the Hang Manager kills the blocking session but also dumps a lot of diagnostic information. Finding the needle in a haystack is not so difficult once you have identified where is the haystack: the dump trace contains the call stack which identifies where the issue occurs — which C function in Oracle software. And it also gives some clues about the context and how we got there.
I shared this call stack just in case some friends already encountered this issue:
Now let’s see how I came, within a few minutes, to the idea that it was related to online statistics gathering and global index maintenance. The best source of information about Oracle C function is Frits Hoogland www.orafun.info and you can even copy/paste the stack trace as-is in http://orafun.info/stack/
If the full function is not found here, you can guess from the names, you can search My Oracle Support for bugs related to this function,… and this is what I came with:
- the hang situation was detected in kglLock(): kernel generic library cache management library cache lock, which is already known as the wait was on “library cache pin”
- qospsis() is query optimizer statistics related to setting index statistics. That’s interesting. I can also see a call to the statement I’ve seen in the dump trace: dbms_stats.postprocess_indstats(). This is a new feature in 19c where online statistics gathering happens for indexes as well as tables. But, as far as I know, this should occur only on direct-path insert.
- qospPostProcessIStats() confirms this: setting the index statistics is part of post processing index statistics
- qerltFetch() is query execute rowsource load table fetch and qerstFetch() is query execute rowsource statistics row source fetch. That looks like an insert from select.
- insdlexe() is insert direct load execution. This links to the online statistics gathering: direct-path inserts counting inserted rows to set index statistics at the end.
- atbugi_update_global_indexes() makes the link between indexes and the statement causing this because atb is the prefix for ALTER TABLE functions.
- atbFMdrop() on google finds a bug about “ORA-600 Hit When Drop of Table Partition” — I’ve no idea why but “FM” looks like related to partitions and we are dropping one. Yes, I found an ALTER TABLE DROP PARTITION in the waiting session cursor for the job that kicks-in every hour.
Ok, that’s far from understanding everything but at least we have a consistent view about what happened: We drop a partition. This has to update global indexes. By some magic, this is done in direct-path. In 19c direct-path does update the index statistics. And probably a bug there leads to a deadlock situation. Note that I’ve no global indexes on this table, but the codepath is there.
1. Workaround to fix the production issue
Thanks to the function name where the problem occurs, I have my workaround 😀. I’ve just migrated to 19c. I don’t rely on the online statistics gathering. Let’s disable it. I did it first with “_optimizer_gather_stats_on_load”. It was the first coming out of my mind, but after a while, as suggested by Oracle Support, we did it more precisely by disabling only “_optimizer_gather_stats_on_load_index”.
Yes, being able to workaround a critical production issue on Friday afternoon is the best you can do to a DBA.
2. Testcase to open a SR
The full call stack helps to reproduce the situation. I can create a partitioned table, create global indexes (I finally realized that I don’t even need to have GIDXs), drop a partition, and observe the behavior. But… wait. In 19c the index maintenance is supposed to be asynchronous. It was not like this on my production issue because I can see this in the trace:
insert /*+ RELATIONAL("...") NO_PARALLEL APPEND NESTED_TABLE_SET_SETID NO_REF_CASCADE */ into "..."."..." partition ("P1") select /*+ RELATIONAL("...") NO_PARALLEL */ * from NO_CROSS_CONTAINER ( "..."."..." ) partition ("...") delete global indexes
All ingredients are there in this magic internal query: direct path insert to delete the global index entries related to the dropped partition. But looking for this “delete global indexes” in My Oracle Support shows that this is the old pre-12c behavior when index maintenance was immediate. I don’t know what prevents this new feature in my case but that also explains the highest probability for a new bug: the two features (online statistics gathering and asynchronous GIDX maintenance) are usually not found together as one was pre-12c and the other is 12c.
But now, I can simulate the same thing for my test case with the parameter that controls this async GIDX update: “_fast_index_maintenance”=false. Again, this is mentioned in the MOS notes and easy to guess from its name.
Here is the simple test case that reproduces this improbable situation:
create table DEMO (a primary key,b not null)
partition by range (a) (partition p1 values less than(5),partition p2 values less than (maxvalue))
as select rownum,rownum from xmltable('1 to 10')
create index DEMO_B on DEMO(B) global;
alter session set "_fast_index_maintenance"=false;
alter table DEMO drop partition for(1) update global indexes;
This self-deadlocks even in single-instance (solved differently but on the same cause) and in the latest release update. Now I have everything to open an SR.
The message from this post
I’ve written this to give some positive messages about things that may have a bad reputation.
The blocking issue in production was limited thanks to:
- Hang Manager, one component of this complex Grid Infrastructure software is there to detect the hanging situations and reduce their consequences by killing the blocker. And dump diagnostic information. In the old times, I’ve seen many people “solving” this with a “startup force” and even forgetting to dump a hang analysis.
I got a quick workaround thanks to:
- the comprehensive diagnostic capabilities of the Oracle Database Software (one dump with statements, wait events, call stack,…)
- the amazing community of people doing research and sharing their results (like Frits Hoogland documenting the internal functions, or Tanel Poder troubleshooting tools and methods… can’t name all of them)
- the My Oracle Support engineers who build an easy-to-search knowledge base. That’s the invisible part of commercial software support: maybe spend more time on one customer issue in order to gather enough information to help the many other customers who will be in the same case.
I got a reproducible test case thanks to:
- the free environment (OTN license, Free Tier Cloud,…) that Oracle provides to us. I can ssh to my lab VM from anywhere.
- the Oracle Software keeping symbols in the executable to make the call stack understandable. It is not Open Source. You cannot recompile with more debug info. But you don’t need to: they are all there.
- the lessons I learned from Tom Kyte or Jonathan Lewis in order to think about small test cases to show a problem.
I can get further in understanding thanks to
- the product managers who follow those issues in the shadow
- the Oracle User Groups and ACE program which put in contact the users, the PMs, the developers,…
And long term fixes thanks to
- the Oracle developers who have identified this improbable but in some regression testing and already fixed it for future versions
- support who will probably backport the fix in future Release Updates (no need for one-off when a workaround is possible)
We complain a lot about what doesn’t work and I thought it was a good idea to describe this. There’s no magic bug-free software. What makes the difference is how they are instrumented for troubleshooting, and how they build a community to help users.
A small management summary
Some managers think that the budget they put on buying software and support is sufficient to get their infrastructure running their business. But that’s not enough to avoid bugs and find fixes for critical situations. It is only a little additional effort to get to this point: send people to conferences, give them time to learn, to share, to read, to post, to publish and to discuss. This is what will make the difference one day. Forget this “no human labor” and “eliminate human errors” which is chanted by software vendor salespeople just because this is the music you want to hear. Unexpected situations always happen. One day, the software you use will have a problem. Your business will be blocked. Machines can help to diagnose, but it will be a human who will find, with other humans help, how to get the service available again.