Occasionally the database is going down with error SYSTEM ERROR: Wrong dbkey in block during the backup of our 1.5 TB database.
After running successfully for the last year the error occured again this night.
I did a dbtool: option 5 Read or validata db blocks, option 0=single-user, rowid=all, table=all, area=11, verbose=0, validation=0
The tool didn't find any errors. Now I'm running a dbanalys. Can I do more to find out if we really have a db corruption?
OE 10.2B SP8
Windows 2008 R2
Error message (11 times):
[2016/07/14@02:20:47.362+0200] P-7320 T-7264 I WDOG 20: (4232) Corrupt block detected when attempting to release a buffer.
[2016/07/14@02:20:47.362+0200] P-7320 T-7264 I WDOG 20: (10560) bmReleaseBuffer: Error occurred in area 11, block number: 22885265, extent: F:\mfgdata\prod\trtrprod_11.d1.
[2016/07/14@02:20:47.362+0200] P-7320 T-7264 I WDOG 20: (10561) Writing block 22885265 to log file. Please save and send the log file to Progress Software Corp. for investigation.
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) SYSTEM DEBUG: Database buffer block
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) pbktbl = 0x000000026846BDB0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) pbktbl->qself = 0x000b0001280fbdb0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) XBKBUF(pbktbl->qself) = 0x000000026846bdb0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) pbktbl->bt_qbuf = 000f0001c27827c8
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) XBKBUF(pbktbl->bt_qbuf) = 0x0000002302AF27C8
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) qusrctl: 0x000b00000000b3a0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) use count: 1, governing latch: 24, lru: 0, state: 4
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) changed: 0, chkpt: 0, writing: 0, fixed: 0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) aged: 0, onlru: 0, cleaning: 0, apwq: 0
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) bt_qlrunxt: 0x000b000128100370, bt_qlruprv: 0x000b000006cab670
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) bt_qapwnxt: 0x0000000000000000, bt_qapwprv: 0x0000000000000000
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) bt_qfstuser: 0x0000000000000000, bt_qcuruser: 0x0000000000000000
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) pbkbuf = 0x0000002302AF27C8
[2016/07/14@02:20:47.368+0200] P-7320 T-7264 I WDOG 20: (-----) Block dbkey = 1464657024 Offset = 187476099072
Other utilities that you can try to find the possible cause(s):
- proutil -C dbrpr, options 1; 1+4
- proutil -C dbscan
- dbtool, option 3
Edwin,
Your database is NOT corrupted.
> Corrupt block detected when attempting to release a buffer.
Progress did not read block from disk.
You did not post the text of error 1124:
SYSTEM ERROR: Wrong dbkey in block. Found dbkey1, should be dbkey2 in area num 11.
I guess the found dbkey is real one: multiple by RPB (=64 in your case), below HWM.
> Error message (11 times):
Who reported it first?
Did watchdog report a dead user?
I bet a client's session died (or was killed) while reading data from disk. Buffer header stored the information about dbkey that the session was going to retrieve from disk ("should be dbkey"). The buffer itself stored some old block ("found dbkey").
> Can I do more to find out if we really have a db corruption?
The quickest way is:
find any_table_in_the_area no-lock where recid(any_table_in_the_area) eq should_be_dbkey_in_1422
Hi George,
The 1124 message is:
[2016/07/14@02:20:47.389+0200] P-7320 T-7264 F WDOG 20: (1124) SYSTEM ERROR: Wrong dbkey in block. Found 1464656896, should be 1464657024 in area 11.
The watchdog was reporting this.
So the only thing you need to find out: why session was died.
indeed, It has something to do with the backup. Because we had this problem 4 times in the last few years and it was always during the backup.
Is this a virtualized or physical environment?
No, it is not virtual. But I expect that there is some other software scanning the memory.
We have started the database again and it seems that is running normal.
Thanks ALL for replying. I appreciate it.
Lots of things to think about here: knowledgebase.progress.com/.../P5286
The article missed the scenario that happend in this case and that I saw many times reported by our customers on Unix.
It's easy to reproduce if you have a rather large database: kill -9 a session that reads data from disk. The chances to get "Wrong dbkey error" are higher than "Usr died holding latch". There are no corruptions: neither on disk nor in memory. "Writing block to log file" will dump a block contents that is correct for a "found" dbkey. You can compare it with a dump of the same block on disk in database. "Should be" dbkey is correct as well.
"Wrong dbkey" reported by watchdog after a dead session that did NOT report the "wrong dbkey" error is always not a real corruption. It's confirmed many times.