案例记录 --- 主备切换后索引无效（Invalid Commit SCN in INDEX block）

问题描述

2019.10.26 1：00~6：00 灾备演练，做完主备切换，并且回切完成后，有几套数据库(11.2.0.4, PSU7) alert日志出现告警信息：

Sat Oct 26 07:45:58 2019
Errors in file /oracle/app/db/diag/rdbms/ebbctxn/ebbctxn2/trace/ebbctxn2_ora_10228.trc  (incident=235230):
ORA-00600: 内部错误代码, 参数: [ktbdchk1: bad dscn], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/db/diag/rdbms/ebbctxn/ebbctxn2/incident/incdir_235230/ebbctxn2_ora_10228_i235230.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Oct 26 07:45:59 2019
Dumping diagnostic data in directory=[cdmp_20191026074559], requested by (instance=2, osid=10228), summary=[incident=235230].
Sat Oct 26 07:46:01 2019
Sweep [inc][235230]: completed
Sweep [inc2][235230]: completed
Sat Oct 26 07:47:15 2019
Errors in file /oracle/app/db/diag/rdbms/ebbctxn/ebbctxn2/trace/ebbctxn2_ora_11697.trc  (incident=240350):
ORA-00600: 内部错误代码, 参数: [ktbdchk1: bad dscn], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/db/diag/rdbms/ebbctxn/ebbctxn2/incident/incdir_240350/ebbctxn2_ora_11697_i240350.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Oct 26 07:47:17 2019
Dumping diagnostic data in directory=[cdmp_20191026074717], requested by (instance=2, osid=11697), summary=[incident=240350].

Incident文件部分内容如下：

Dump continued from file: /oracle/app/db/diag/rdbms/ebbctxn/ebbctxn2/trace/ebbctxn2_ora_10268.trc
ORA-00600: 内部错误代码, 参数: [2663], [23], [473774127], [23], [1302647517], [], [], [], [], [], [], []

========= Dump for incident 235382 (ORA 600 [2663]) ========

*** 2019-10-26 05:40:21.198
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=4641s08h87q5q) -----
select ID, CHNL_ID, TRANS_CODE, TRANS_DATE, SYS_SEQ, TRANS_SUB_SEQ, USER_ID, PAY_ACCT_BANK_FLAG, PAY_ACCT_TYPE, PAY_ACCT_NO, PAY_NAME, PAY_BANK_NO, PAY_BANK_NAME, REV_ACCT_BANK_FLAG, REV_ACCT_TYPE, REV_ACCT_NO, REV_NAME, REV_BANK_NO, REV_BANK_NAME, CCY, TRANS_AMT, FEE_CODE, CUST_FEE, SHLD_FEE, ROUTE_CODE, PMT_USAGE, TRANS_STAT, TRANS_TIME, CHK_FLAG, ORG_NO, PAY_ORG_NO, REV_ORG_NO, HOST_STAT, HOST_DATE, HOST_SEQ, THIRD_STAT, THIRD_DATE, THIRD_SEQ, HOST_FLAG, GROUP_PAY_FLAG, SUBJECTNO, CLASSNO, STAT_UPDATE_TIME, TRANS_RES_INFO, IS_BAT_FLAG, BAT_NO, IS_BAT_FEE_FLOW, REMARK1, REMARK2, REMARK3, TRANS_AMT1, TRANS_TOT_AMT from T_PMT_FLOW_LIST WHERE SYS_SEQ= :1 

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+41        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFFE9A7C920 ? 7FFFE9A7C9F8 ?
                                                   7FFFE9A814A0 ? 000000002 ?
ksedst1()+103        call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFFE9A7C920 ? 7FFFE9A7C9F8 ?
                                                   7FFFE9A814A0 ? 000000002 ?
ksedst()+39          call     ksedst1()            000000000 ? 000000001 ?
                                                   7FFFE9A7C920 ? 7FFFE9A7C9F8 ?
                                                   7FFFE9A814A0 ? 000000002 ?
dbkedDefDump()+2746  call     ksedst()             000000000 ? 000000001 ?
                                                   7FFFE9A7C920 ? 7FFFE9A7C9F8 ?
                                                   7FFFE9A814A0 ? 000000002 ?

问题分析

在Oracle MOS上，检索"ORA-00600: 内部错误代码, 参数: [ktbdchk1: bad dscn], [], []"，检索到文档 ID 22241601.8 和文档 ID 1608167.1，两篇文章指向了同一个 BUG：22241601。

根据bug描述，该问题的产生是因为索引块中无效的提交SCN号造成（Invalid Commit SCN in INDEX block）。

该问题主要发生在Dataguard做了Failover 或者 Switchover后的数据库上，但也可以发生在任何数据库上（包括没有配置DG的数据库）。This is issue has been mostly seen in Dataguard Standby Databases after a Failover or Switchover (either in PRIMARY or STANDBY) but may also occur in all kind of databases (with no dataguard configuration).

如果使用 DBVERIFY 工具对数据文件进行检验，可能会出现如下告警：
DBVERIFY (with fix of Bug 7517208) may report the next error:

   itl[<itl_id>] has higher commit scn(aaa.bbb) than block scn (xx.yy)
   Page <Block#> failed with check code 6056

该问题由索引块中错误的SCN号造成，但索引块没有数据损坏。
There is NO DATA CORRUPTION in the INDEX block; column values in the INDEX are correct.

问题解决

临时的解决办法就是对索引进行在线重建（在线并行，取消并行）。
根本性的解决办法是打补丁 Patch 22241601。

After installing Patch 22241601 Oracle tries to repair existent invalid ITL commit scn (healing). There is not need to set any parameter for it. _ktb_debug_flags=8 is now set by default thus the healing is enabled by default when Patch 22241601 is installed. For already existent invalid SCNs on disk, the SCN is repaired when the Index block is cleaned out (example: in a block update). While blocks are not touched, dbverify still reports the 6056 errors.

当块被修复的时候，Trace file 中会有类似如下修复信息:

Healing Corrupt DLC ITL objd:%d objn:%d tsn:%d rdba:<rdba> itl:%d
 option:%d xid:<xid> cmtscn:<scn> curscn:<scn>

Sometimes the fix may not repair the block for an already existent invalid SCN on disk; then repair this issue with:
1.重建索引
2.使用主库或者备库上好的数据文件替换有问题的数据文件