DB corruption on SLES12 with XFS

Posted by ke@iap.de on 05-Apr-2017 06:45

Hello :)

A customer has a DB on a VM (VMWare) since a few month and is getting corrupted data. OE 10.2B08, Suse Linux Enterprise SP 02, XFS files system for data partition.

We made a D/L binary, and before we could do that, we needed to rebuild indexes on meta schema (_File, _Index...).

Short time later new errors showed up in DB log file. We set -DbCheck and -MemCheck, but we got more bad blocks.
A look into Linux error protocols showed nothing.

We see the following areas which may have a problem:
- XFS as file system, although other end users have this
- SAN
- Memory/CPU (hardware is about 5 years old)

We plan do do a text D/L to make sure everything is fresh. And we go to ext3 which was the formerly file system of the installation.

Questions:
- Any concerns about XFS?
- Ever heard about damaged meta schema or damaged empty DB?
- Is it possible that a binary dump is defekt? Like a backup with probkup, which may backup defect blocks.
- Any other idea?

kind regards - Klaus

Note: The error message from log file will follow soon.

Posted by ke@iap.de on 19-Jul-2017 03:24

The customer change the hardware on which the VM runs. But errors still occured.

Then he instantiated the old hardware as a VM, with old Linux version and OE version.

That worked.

So it must be a combination of the while software stack...

Posted by ke@iap.de on 19-Jul-2017 03:27

The customer changed the hardware, but errors still occur.

He is not doing any of the bad things mentioned above :)

The he made a copy of the old hardware as VM. And this finally runs.

So it must be a combination of the software stack of the new server...

Thanks to all

Klaus

Posted by ke@iap.de on 19-Jul-2017 03:29

The customer changed the hardware, but errors still occur.

He is not doing any of the bad things mentioned above :)

The he made a copy of the old hardware as VM. And this finally runs.

So it must be a combination of the software stack of the new server.

Thanks to all

Klaus

All Replies

Posted by ke@iap.de on 05-Apr-2017 09:23

I am sorry, text is in German, but message numbers are given :)

[2017/03/27@14:22:06.929+0200] P-5038       T--147446016 I ABL    26: (1422)  SYSTEM ERROR: Index po-nummer in artkusta für recid 41520842 konnte nicht gelöscht werden. 

The following repeat with a few blocks in the message (2 blocks, 107280832, 48955840).
5 or 6 elements are affected (like 184 in this example).

[2017/03/27@15:10:25.802+0200] P-2275 T--147462400 I ABL 17: (4430) SYSTEM ERROR: Index 49, Block 107280832, Element-Nr. 184: Falsche Informationsgröße in einem Leaf Block.
[2017/03/27@15:10:25.811+0200] P-2275 T--147462400 I ABL 17: (2816) vorherige Größe = 18, cs = 6, ks = 1, is = 191, Schlüsselanzahl = 184.
[2017/03/27@15:10:25.821+0200] P-2275 T--147462400 I ABL 17: (14037) Fehlerdaten der Blockvalidierung für Index 49: nment ist 455, nlength ist 4117, level ist 1, aktueller Schlüssel ist 184, Offset ist 1673, func ist cxDoInsert
[2017/03/27@15:10:25.821+0200] P-2275 T--147462400 I ABL 17: (14031) Ungültiger Indexblock gefunden
...
[2017/03/27@15:10:25.832+0200] P-2275       T--147462400 F ABL    17: (14036) SYSTEM ERROR: Ungültiger Indexblock FATAL 

Posted by ChUIMonster on 05-Apr-2017 09:34

What sort of SAN?  Does the customer do things with snapshots?

Does the customer use VMotion on this VM?

Doing any of the above without having a quiet point properly enabled seems like the most likely sources of corruption to me.

Posted by George Potemkin on 05-Apr-2017 09:52

> text is in German, but message numbers are given :)

Translation:

[2017/03/27@14:22:06.929+0200] P-5038 T--147446016 I ABL 26: (1422) SYSTEM ERROR: Index po-nummer in artkusta for recid 41520842 could not be deleted.

[2017/03/27@15:10:25.802+0200] P-2275 T--147462400 I ABL 17: (4430) SYSTEM ERROR: Index 49, block 107280832, element no. 184: bad info size in a leaf block.
[2017/03/27@15:10:25.811+0200] P-2275 T--147462400 I ABL 17: (2816) prev size = 18, cs = 6, ks = 1, is = 191, key count = 184.
[2017/03/27@15:10:25.821+0200] P-2275 T--147462400 I ABL 17: (14037) Index 49 block validation error data: nment is 455, nlength is 4117, level is 1, current key is 184, offset is 1673, func is cxDoInsert
[2017/03/27@15:10:25.821+0200] P-2275 T--147462400 I ABL 17: (14031) Invalid Index Block Detected
...
[2017/03/27@15:10:25.832+0200] P-2275 T--147462400 F ABL 17: (14036) SYSTEM ERROR: Invalid Index Block FATAL

Can you dump the index block with dbkey 107280832?

Posted by gus bjorklund on 11-Apr-2017 21:29

also doing

- backups with third-party or system backup tools while the database is in use, or

- skipping crash recovery with the -F option

can cause these sorts of errors

Posted by James Palmer on 19-Jul-2017 03:09

[mention:8d59dc807a2b4d4ea969c379c7a0b13d:e9ed411860ed4f2ba0265705b8793d05] Klaus, did you get to the bottom of this problem?

Posted by ke@iap.de on 19-Jul-2017 03:24

The customer change the hardware on which the VM runs. But errors still occured.

Then he instantiated the old hardware as a VM, with old Linux version and OE version.

That worked.

So it must be a combination of the while software stack...

Posted by ke@iap.de on 19-Jul-2017 03:27

The customer changed the hardware, but errors still occur.

He is not doing any of the bad things mentioned above :)

The he made a copy of the old hardware as VM. And this finally runs.

So it must be a combination of the software stack of the new server...

Thanks to all

Klaus

Posted by ke@iap.de on 19-Jul-2017 03:29

The customer changed the hardware, but errors still occur.

He is not doing any of the bad things mentioned above :)

The he made a copy of the old hardware as VM. And this finally runs.

So it must be a combination of the software stack of the new server.

Thanks to all

Klaus

Posted by James Palmer on 19-Jul-2017 03:35

THanks Klaus. Appreciate the quick response.

This thread is closed