DB crash a short time after increasing -B to 200000

Posted by Stefan Marquardt on 15-Oct-2014 07:57

Hello,

i noticed again that the database crashes when i increased -B to this "high" value.

It's a 8k block db and 200000 is ~ 1.7GB, that's far away from 2GB.

[2014/10/15@14:23:16.377+0200] P-49244      T-69004 I Usr    17: (13980) Increase Params increasing buffer pools size (-B) from 50000 to 200000.

....

[2014/10/15@14:29:55.763+0200] P-74376      T-108524 I ABL    13: (9450)  bkioRead:Insufficient disk space during Read, fd 788, len 8192, offset 71984, file C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure_13.d1.
[2014/10/15@14:29:55.771+0200] P-74376      T-108524 I ABL    13: (9450)  bkioRead:Insufficient disk space during Read, fd 788, len 8192, offset 71984, file C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure_13.d1.
[2014/10/15@14:29:55.773+0200] P-74376      T-108524 I ABL    13: (9450)  bkioRead:Insufficient disk space during Read, fd 788, len 8192, offset 71984, file C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure_13.d1.

....

C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure_13.d1.
[2014/10/15@14:29:59.882+0200] P-21776      T-44116 I BROKER  0: (4232)  Corrupt block detected when attempting to release a buffer.
[2014/10/15@14:29:59.883+0200] P-55856      T-30668 I SRV     1: (2520)  Stopped.
[2014/10/15@14:29:59.884+0200] P-21776      T-44116 I BROKER  0: (10560) bmReleaseBuffer: Error occurred in area 13, block number: 71983, extent: C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure_13.d1.
[2014/10/15@14:29:59.889+0200] P-21776      T-44116 I BROKER  0: (10561) Writing block  71983 to log file. Please save and send the log file to Progress Software Corp. for investigation.
[2014/10/15@14:29:59.889+0200] P-21776      T-44116 I BROKER  0: (-----) SYSTEM DEBUG: Database buffer block
[2014/10/15@14:29:59.890+0200] P-21776      T-44116 I BROKER  0: (-----) pbktbl = 0x1F353998
[2014/10/15@14:29:59.890+0200] P-21776      T-44116 I BROKER  0: (-----) pbktbl->qself = 0x000c000001f93998
[2014/10/15@14:29:59.890+0200] P-21776      T-44116 I BROKER  0: (-----) XBKBUF(pbktbl->qself) = 0x0018fb841f353998
[2014/10/15@14:29:59.891+0200] P-21776      T-44116 I BROKER  0: (-----) pbktbl->bt_qbuf = 000e0000000001f8
[2014/10/15@14:29:59.891+0200] P-21776      T-44116 I BROKER  0: (-----) XBKBUF(pbktbl->bt_qbuf) = 0x5D3C01F8
[2014/10/15@14:29:59.891+0200] P-21776      T-44116 I BROKER  0: (-----) qusrctl: 0x000b000000007f20

.....

I am sure that there is no space problem.

Stefan

All Replies

Posted by Tim Kuehn on 15-Oct-2014 08:05

On Wed, Oct 15, 2014 at 8:58 AM, Stefan Marquardt
wrote:
> DB crash a short time after increasing -B to 200000
> Thread created by Stefan Marquardt

> I am sure that there is no space problem.
> Stefan

Check for things like ulimit which can show up as problems like this,
or an extent hitting the 2G limit for db's w/out having large files
enabled.

--
Tim Kuehn: Senior Consultant - TDK Consulting Services
Ontario PUG President
PUG Challenge Americas Executive, Program Committee Chair

Skype: timothy.kuehn
Ph: 519-576-8100
Cell: 519-781-0081

Posted by Richard Banville on 15-Oct-2014 08:10

What platform was this on?

What are your shm startup parameters if any? (-shmsegsize inparticular)

What are the SHM kernel parameters set to?

This is obviously a bug that should be filed with support  but it does not appear to be common one so getting the environment details (kernel settings, all OE startup params and permissions settings on executables and .db file) will really help in understanding why this is happeneing.

Posted by Stefan Marquardt on 15-Oct-2014 08:14

Windows 2008R2

Every area has only one open extent, there are other larger extents.
I redo the same test procedure again and again, today my fault was to set -B to 200000, 185000 works.

 

               Wed Oct 15 14:02:57 2014

[2014/10/15@14:02:57.149+0200] P-21776      T-44116 I BROKER  0: (333)   Multi-user session begin.

[2014/10/15@14:02:57.208+0200] P-21776      T-44116 I BROKER  0: (5326)  Begin Physical Redo Phase at 2048 .

[2014/10/15@14:02:59.967+0200] P-21776      T-44116 I BROKER  0: (7161)  Physical Redo Phase Completed at blk 2644 off 6668 upd 7626.

[2014/10/15@14:02:59.969+0200] P-21776      T-44116 I BROKER  0: (13547) At end of Physical redo, transaction table size is 256.

[2014/10/15@14:03:00.153+0200] P-21776      T-44116 I BROKER  0: (452)   Login by winsure on batch.

[2014/10/15@14:03:00.187+0200] P-21776      T-44116 I BROKER  0: (5644)  Started for 21600 using tcp IPV4 address 0.0.0.0, pid 21776.

[2014/10/15@14:03:00.188+0200] P-21776      T-44116 I BROKER  0: (4234)  Progress OpenEdge Release 10.2B build 1907 SP08 on WINNT .

[2014/10/15@14:03:00.189+0200] P-21776      T-44116 I BROKER  0: (4281)  Server started by winsure on batch.

[2014/10/15@14:03:00.189+0200] P-21776      T-44116 I BROKER  0: (-----) Started using pid: 21776.

[2014/10/15@14:03:00.190+0200] P-21776      T-44116 I BROKER  0: (4235)  Physical Database Name (-db): C:\ProgramData\Progress\eNoraAcc\winsureuk\sure_db\winsure.

[2014/10/15@14:03:00.192+0200] P-21776      T-44116 I BROKER  0: (4236)  Database Type (-dt): PROGRESS.

[2014/10/15@14:03:00.193+0200] P-21776      T-44116 I BROKER  0: (4237)  Force Access (-F): Not Enabled.

[2014/10/15@14:03:00.193+0200] P-21776      T-44116 I BROKER  0: (4238)  Direct I/O (-directio): Not Enabled.

[2014/10/15@14:03:00.194+0200] P-21776      T-44116 I BROKER  0: (-----) LRU mechanism enabled.

[2014/10/15@14:03:00.194+0200] P-21776      T-44116 I BROKER  0: (-----) Number of LRU force skips (-lruskips): 0

[2014/10/15@14:03:00.195+0200] P-21776      T-44116 I BROKER  0: (-----) Number of LRU2 force skips (-lru2skips): 0

[2014/10/15@14:03:00.195+0200] P-21776      T-44116 I BROKER  0: (4239)  Number of Database Buffers (-B): 50000.

[2014/10/15@14:03:00.196+0200] P-21776      T-44116 I BROKER  0: (-----) Number of Alternate Database Buffers (-B2): 0.

[2014/10/15@14:03:00.196+0200] P-21776      T-44116 I BROKER  0: (9422)  Maximum private buffers per user (-Bpmax): 64.

[2014/10/15@14:03:00.196+0200] P-21776      T-44116 I BROKER  0: (4240)  Excess Shared Memory Size (-Mxs): 35.

[2014/10/15@14:03:00.197+0200] P-21776      T-44116 I BROKER  0: (10014) The shared memory segment is not locked in memory.

[2014/10/15@14:03:00.198+0200] P-21776      T-44116 I BROKER  0: (4241)  Current Size of Lock Table (-L): 80000.

[2014/10/15@14:03:00.199+0200] P-21776      T-44116 I BROKER  0: (13953) Maximum Area Number (-maxArea): 32000.

[2014/10/15@14:03:00.199+0200] P-21776      T-44116 I BROKER  0: (4242)  Hash Table Entries (-hash): 13063.

[2014/10/15@14:03:00.200+0200] P-21776      T-44116 I BROKER  0: (4243)  Current Spin Lock Tries (-spin): 2000.

[2014/10/15@14:03:00.200+0200] P-21776      T-44116 I BROKER  0: (6526)  Number of Semaphore Sets (-semsets): 3.

[2014/10/15@14:03:00.200+0200] P-21776      T-44116 I BROKER  0: (13924) Maximum Shared Memory Segment Size (-shmsegsize) 512 Mb.

[2014/10/15@14:03:00.201+0200] P-21776      T-44116 I BROKER  0: (4244)  Crash Recovery (-i): Enabled.

[2014/10/15@14:03:00.201+0200] P-21776      T-44116 I BROKER  0: (6573)  Database Blocksize (-blocksize): 8192.

[2014/10/15@14:03:00.201+0200] P-21776      T-44116 I BROKER  0: (4245)  Delay of Before-Image Flush (-Mf): 3.

[2014/10/15@14:03:00.202+0200] P-21776      T-44116 I BROKER  0: (4247)  Before-Image File I/O (-r -R): Reliable.

[2014/10/15@14:03:00.274+0200] P-21776      T-44116 I BROKER  0: (4249)  Before-Image Truncate Interval (-G): 0.

[2014/10/15@14:03:00.275+0200] P-21776      T-44116 I BROKER  0: (4250)  Before-Image Cluster Size: 8388608.

[2014/10/15@14:03:00.275+0200] P-21776      T-44116 I BROKER  0: (4251)  Before-Image Block Size: 16384.

[2014/10/15@14:03:00.276+0200] P-21776      T-44116 I BROKER  0: (4252)  Number of Before-Image Buffers (-bibufs): 20.

[2014/10/15@14:03:00.276+0200] P-21776      T-44116 I BROKER  0: (-----) Record free chain search depth factor 5 (-recspacesearchdepth)

[2014/10/15@14:03:00.277+0200] P-21776      T-44116 I BROKER  0: (9238)  BI File Threshold size (-bithold): 0.0   Bytes.

[2014/10/15@14:03:00.278+0200] P-21776      T-44116 I BROKER  0: (6552)  BI File Threshold Stall (-bistall): Disabled.

[2014/10/15@14:03:00.278+0200] P-21776      T-44116 I BROKER  0: (4254)  After-Image Stall (-aistall): Not Enabled.

[2014/10/15@14:03:00.278+0200] P-21776      T-44116 I BROKER  0: (4255)  After-Image Block Size: 8192.

[2014/10/15@14:03:00.279+0200] P-21776      T-44116 I BROKER  0: (4256)  Number of After-Image Buffers (-aibufs): 20.

[2014/10/15@14:03:00.279+0200] P-21776      T-44116 I BROKER  0: (8527)  Storage object cache size (-omsize): 1024

[2014/10/15@14:03:00.279+0200] P-21776      T-44116 I BROKER  0: (4257)  Maximum Number of Clients Per Server (-Ma): 10.

[2014/10/15@14:03:00.280+0200] P-21776      T-44116 I BROKER  0: (4258)  Maximum Number of Servers (-Mn): 7.

[2014/10/15@14:03:00.280+0200] P-21776      T-44116 I BROKER  0: (4259)  Minimum Clients Per Server (-Mi): 10.

[2014/10/15@14:03:00.280+0200] P-21776      T-44116 I BROKER  0: (-----) Server network message wait time (-Nmsgwait): 2

[2014/10/15@14:03:00.281+0200] P-21776      T-44116 I BROKER  0: (-----) Use pollset mechanism for client/server (-pollset): Disabled

[2014/10/15@14:03:00.281+0200] P-21776      T-44116 I BROKER  0: (-----) Delay first prefetch message (-prefetchDelay): Disabled

[2014/10/15@14:03:00.281+0200] P-21776      T-44116 I BROKER  0: (-----) Prefetch message fill percentage (-prefetchFactor): 0

[2014/10/15@14:03:00.282+0200] P-21776      T-44116 I BROKER  0: (-----) Minimum records in prefetch ms (-prefetchNumRecs): 16

[2014/10/15@14:03:00.282+0200] P-21776      T-44116 I BROKER  0: (-----) Suspension queue poll priority (-prefetchPriority): 0

[2014/10/15@14:03:00.282+0200] P-21776      T-44116 I BROKER  0: (4260)  Maximum Number of Users (-n): 61.

[2014/10/15@14:03:00.283+0200] P-21776      T-44116 I BROKER  0: (4261)  Host Name (-H): DEXW4160.

[2014/10/15@14:03:00.283+0200] P-21776      T-44116 I BROKER  0: (4262)  Service Name (-S): 21600.

[2014/10/15@14:03:00.284+0200] P-21776      T-44116 I BROKER  0: (14268) TCP/IP Version (-ipver) : IPV4

[2014/10/15@14:03:00.285+0200] P-21776      T-44116 I BROKER  0: (4263)  Network Type (-N): tcp.

[2014/10/15@14:03:00.285+0200] P-21776      T-44116 I BROKER  0: (4264)  Character Set (-cpinternal): ISO8859-1.

[2014/10/15@14:03:00.285+0200] P-21776      T-44116 I BROKER  0: (4282)  Parameter File: \\dexw4160\C$\ProgramData\Progress\eNoraAcc\winsureuk\swinsureN.pf.

[2014/10/15@14:03:00.286+0200] P-21776      T-44116 I BROKER  0: (5648)  Minimum Port for Auto Servers (-minport): 3000.

[2014/10/15@14:03:00.286+0200] P-21776      T-44116 I BROKER  0: (5649)  Maximum Port for Auto Servers (-maxport): 5000.

[2014/10/15@14:03:00.286+0200] P-21776      T-44116 I BROKER  0: (8865)  This broker supports both 4GL and SQL server groups.

[2014/10/15@14:03:00.287+0200] P-21776      T-44116 I BROKER  0: (9426)  Large database file access has been enabled.

[2014/10/15@14:03:00.288+0200] P-21776      T-44116 I BROKER  0: (9336)  Created shared memory with segment_id: 39780352

[2014/10/15@14:03:00.289+0200] P-21776      T-44116 I BROKER  0: (12813) Allowed index cursors (-c): 244.

[2014/10/15@14:03:00.289+0200] P-21776      T-44116 I BROKER  0: (12814) Group delay (-groupdelay): 10.

[2014/10/15@14:03:00.290+0200] P-21776      T-44116 I BROKER  0: (12815) Lock table hash table size (-lkhash): 13063

[2014/10/15@14:03:00.290+0200] P-21776      T-44116 I BROKER  0: (12816) Maxport (-maxport): 5000

[2014/10/15@14:03:12.261+0200] P-21776      T-44116 I BROKER  0: (12817) Minport (-minport): 3000

[2014/10/15@14:03:12.262+0200] P-21776      T-44116 I BROKER  0: (12818) Message Buffer Size (-Mm): 1024

[2014/10/15@14:03:12.262+0200] P-21776      T-44116 I BROKER  0: (12821) Use muxlatches (-mux): 1

[2014/10/15@14:03:12.263+0200] P-21776      T-44116 I BROKER  0: (12823) Semaphore Sets (-semsets): 3

[2014/10/15@14:03:12.263+0200] P-21776      T-44116 I BROKER  0: (13870) Database Service Manager - IPC Queue Size (-pica) : 64.0  KBytes.

[2014/10/15@14:03:12.264+0200] P-21776      T-44116 I BROKER  0: (13896) TXE Commit lock skip limit (-TXESkipLimit): 10000.

[2014/10/15@14:03:12.264+0200] P-21776      T-44116 I BROKER  0: (15219) Encryption enabled: 0

[2014/10/15@14:03:12.265+0200] P-21776      T-44116 I BROKER  0: (15218) Encryption cache size (-ecsize): 1000

[2014/10/15@14:03:12.336+0200] P-97908      T-34728 I APPSRV  7: (452)   Login by winsure on batch.

[2014/10/15@14:03:12.361+0200] P-21776      T-44116 I BROKER  0: (10471) Database connections have been enabled.

Posted by Stefan Marquardt on 15-Oct-2014 08:19

Additional question: Why the first user can login before the connections have been enabled?

Posted by Libor Laubacher on 15-Oct-2014 08:22

Because of the other issue you are having.

Now back to the OP.

So *every* time you use increaseto and set -B to 200000 the DB goes down at any time this is performed.

There is no OS backup, no VM snapshot, no snap copy, no live antivirus scan etc running at the time of the crash ?

Posted by Stefan Marquardt on 15-Oct-2014 08:36

I avoided to use "every". It's now the second time I increased it to 200000 to improve the performance and it crashed.

Because it's a test procedure i am able to redo this again.

Since the first happening I avoided values > 185000 but now I tried it again.

The database has no database errors because the test procedue is:

load backup

run static 4gl programs

I did it ~ 5 times with the same backup.

Currently it's (still) running again with -B increased to 185000, the whole procedure needs ~ 1 hour.

What's happening outside of my scope - i don't know.

But then it happend at the first the same "secret" backgound things too, i can't believe it.

If the procedure finished, i can redo it again.

Posted by Frank Meulblok on 15-Oct-2014 11:06

Thing to keep in mind is the Windows memory allocator. That does a couple of weird things when it comes to shared memory, and will use up more address space than you think you're allocating. Effectively: If you're running any 32-bit process, your out-of-address space issues can start showing up anywhere from the 1.5 to 2.0 gb mark.

Posted by TheMadDBA on 15-Oct-2014 11:38

Like Frank says how Windows handles 32 bit processes is very likely the issue. For example if you have 4GB of memory on the box 2GB is reserved for use by the OS and the other 2GB is available for processes. That 2GB also needs to be contiguous for it to be available. Memory "holes" are caused by other processes grabbing different chunks of memory that don't allow Windows to allocate memory in proper chunks.

The real fix would be to upgrade to 64 bit Progress and 64 bit Windows. There are still a few memory allocation issues with Windows but most of this goes away.

But.... the performance improvement from 185K to 200K is likely to be very small and you would probably be better off looking into the code, indexes and storage area setup to get reasonable performance differences.

Posted by Stefan Marquardt on 15-Oct-2014 11:39

But then increaseto should not allow values that can crash the database, isn't it?

For 32bit "increaseto -B max" would be fine because today 2GB isn't a big deal even for a smartphone.

Then i don't need a calculator ;-)

If you are asking why we are not using 64bit version for the server, we still use 32bit dll on the db server side in local batch procedures or appserver calls. The main developer told me that there is no environment for 64bit tools development available. I know we could create a separate 64bit database server and a 32bit client "server" running the dlls, but this needs time for investigation. Currently it's a deadlock to use more RAM.

But i need to prove whether the error is 100% reproducable when the test environment is available again.

Currently no problem as all the times with the lower -B.

Posted by TheMadDBA on 15-Oct-2014 11:52

Well to be fair to PSC, some of these things are out of their control. They can check for free memory and see that X amount is available, but Windows still might not be able to actually allocate it to that process.They have to run within the operating system rules and sometimes those restrictions aren't known until you actually try and grab the memory.

32 bit .NET, Oracle and SQL Server all have the same issues that PSC does because it is a 32 bit OS issue and not a product issue.

Running 32 bit on non Windows operating systems also have limitations that are tied to how the OS handles memory.

Is he saying the 64 development environment doesn't exist in Progress or the tools they use to write the DLLs? OE 11.3 and above have 64 bit GUI clients.

Posted by Stefan Marquardt on 15-Oct-2014 12:03

As I asked dev why we can't switch easy to 64bit and I was asked whether I would have a 64bit delphi environment ...

I am going to be fair to PSC, if windows can't allocate it and throw an error during the try, that would be o.k., but not confusing error messages about file space and block errors with an abnormal shutdown, minutes or hours later after the increasing.

But i still have to prove it to be 100% sure.

Posted by TheMadDBA on 15-Oct-2014 12:30

Not sure what would be wrong with a 64 bit Delphi environment (it is available)... other than it being Delphi of course.

Agree completely about the error messages. It would be interesting to see how this works on modern versions of Progress compared to 10.2B.

Posted by gus on 15-Oct-2014 13:05

@stefan:

smartphones do not run windoze. (well, ok, i lied, there are two people using windoze phones)

in theory you are correct, we should not allow values that will cause a problem. unfortunately, in practice, this is not possible. On 32-bit Windoze a process address space is limited to 2gb.

that includes all code, stacks, dynamically allocated heap, shared libraries and a host of other things. this limits the effective address space that can be used for database connections to about 1.7 GB under /ideal conditions/.

There are no system calls that allow you to ask "how much memory can i attach" or related questions. even if there were, by the time you got the answer and attempted such a feat, the answer might have changed.

the fact is, 32-bit operating systems are no longer suitable and should be abandoned as soon as possible. in the meantime use smaller -B (actually -B +-B2) and don't connectto multiple databases.

try -shmsegsize 128.

Posted by gus on 15-Oct-2014 13:07

the errors that result can occur in /any/ process that has a self-serving database connection. not just the one that tried an increaseto.

maybe we can improve things a bit. i doubt it can be done quickly if at all.

Posted by Stefan Marquardt on 17-Oct-2014 09:15

"So *every* time you use increaseto and set -B to 200000 the DB goes down at any time this is performed."

Yes, i can reproduce it.

[2014/10/17@15:26:09.802+0200] P-43912      T-94968 I Usr    11: (13980) Increase Params increasing buffer pools size (-B) from 50000 to 200000.

[2014/10/17@15:26:10.584+0200] P-43912      T-94968 I Usr    11: (453)   Logout by Increase Params on CON:.

[2014/10/17@15:26:50.347+0200] P-17928      T-72648 I PROMON 11: (-----) Login by smq_adm.

[2014/10/17@15:35:12.167+0200] P-87512      T-55664 I ABL    10: (14676) bkioRead:Unknown O/S error during Read, errno 2, fd 788, len 8192, offset 225599, file C:\ProgramData\Progress\eNora\winsurede\sure_db\winsure_13.d1 database ..\sure_db\winsure. (9451)

[2014/10/17@15:35:12.176+0200] P-87512      T-55664 I ABL    10: (9450)  bkioRead:Insufficient disk space during Read, fd 788, len 8192, offset 225599, file C:\ProgramData\Progress\eNora\winsurede\sure_db\winsure_13.d1.

I know that there are 32bit limits but I can understand why

1.) increaseto work return without error

2.) the error happend later (depends on data-io!!), one time it was the next day when worktime began

3.) the error message is not "memory problems", it says: unkonw error accessng a special file

Every poor admin would think that he has a terrible OS filesystem error.

BTW: promon showed 1701 MB shared memory

Posted by Stefan Marquardt on 17-Oct-2014 10:02

@gus:

"try -shmsegsize 128.", i did and then i can use 200000 but higher values caused more problems.

BTW: I know 4 peoples using windoze phones, you can meet one in November in Stuttgart :-)

I will use 185000 but i asked Libor to examine the strange (wrong) error message,

Posted by Mike Fechner on 17-Oct-2014 10:04

Düsseldorf, i assume.

Von meinem Windows Phone gesendet

Von: Stefan Marquardt
Gesendet: ‎17.‎10.‎2014 17:03
An: TU.OE.RDBMS@community.progress.com
Betreff: RE: [Technical Users - OE RDBMS] DB crash a short time after increasing -B to 200000

Reply by Stefan Marquardt

@gus:

"try -shmsegsize 128.", i did and then i can use 200000 but higher values caused more problems.

BTW: I know 4 peoples using windoze phones, you can meet one in November in Stuttgart :-)

I will use 185000 but i asked Libor to examine the strange (wrong) error message,

Stop receiving emails on this subject.

Flag this post as spam/abuse.

Posted by gus on 17-Oct-2014 12:51

@stefan:

32-bit windoze will not allow you higher values. i agree you should get decent error messages. that does not solve the problem though.

but the following can happen:

0) you use increaseto to expand some data structure like the buffer pool.

1) this works fine for the primary broker and server processes and they successfully attach the new shared segment(s).

2) self-serving clients attach the new segment(s)

3) one or more cannot attach the segments because they have insufficient free address space of sufficient size.

at this point, it is too late and we (well, actually you) are hosed.

Posted by Thomas Mercer-Hursh on 17-Oct-2014 13:06

Which says something about trying to ride too close to limits ... especially for what has to be very minimal impact on performance that one can expect from an 8% boost in -B.

This thread is closed