Future changes of db log files

Posted by George Potemkin on 12-Jun-2015 03:54

If I understand correctly it was announced that Progress is planning to use two database log files: one for login/logout messages and another for the rest. I understand the reasons behind this idea but I'm not sure if it will make me happy. When I see an error in db log I got form a customer the first thing I'm trying to create is the list of /all/ messages issued by the same session - including the login/logout messages. I'm sure that having two log files for an incident a customer will send to TS the smallest (and, of course, the most important) file - the one without logins/logouts. Would it be enough for the most cases? I'm not sure.

What are the problems you have while parsing the current structure of db logs?
What are the changes in db logs would you wish to get in the future Progress releases?

Future Progress releases seem to be going to use more than 4 billions messages. ;- ) At least _DbParams-Msg-Num introduced in V11.5 is INT64. So we should get ready for such "verbose" releases (with the hundreds GBs promsgs file!). ;-) For information: the version 11.5.1 has only 17,919 messages (the promsgs' size is 1.6MB) and each new release adds approximately 300-400 messages.

The most annoying (IMHO) message is # 14658 (# 5512 in the past):

Previous message sent on behalf of user <number>, server pid <pid>, broker pid <pid>.

Instead of this message I would prefer to have a new column in the log that should report the number of user the message is talking /about/.

The current log structure has the column with the number of user who /writes/ the message to the log file. It's a server for the remote clients. But the "subject-object" relations also exist between the self-service processes. For example:

WDOG 249: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 4
WDOG 249: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 1
WDOG 43: (2522) User 249 died holding 2 shared memory locks.


In first two messages the watchdog "stole the identity" of dead session (instead of its own user number it used the user number of dead session). It's not a big issue when you read by the eyes a small fragment of db log but it might result in the missed messages when you need to parse the millions lines in the log files.

Transaction backout messages do not have any information about the user who owns transaction (and about TRID as well):

Begin transaction backout. (2252)
Transaction backout completed. (2253)


The session can be already dead at this moment but its number is still reserved in connection table.

User <num> died holding <num> shared memory locks. (2522)
User <num> died with <num> buffers locked. (2523)

vs

User <num> died holding <num> shared memory locks. (5026)
User <num> died with <num> buffers locked. (5027)


The only difference between these messages is /who/ writes them: watchdog or "dead" user itself who was in fact "a bit alive":

ABL 10: (5027) User 10 died with 1 buffers locked.

I'm not suggesting to change all existent messages (like Progress did for the messages with RECID in 11.4 to support table partitioning) but new column could be used at least instead of 14658's and by the future messages.

It would be nice if Progress will use the numbers for all messages without exceptions. Not like:
SRV 1: (-----) Received RECONNECT from WTB
Promsgs file is not the only source of Progress messages, some messages are hard coded into the executables - namely the ones introduced in the patches. Later (in new releases) Progress is moving these messages to the promsgs file but sometimes it forgets to change the executables. And the difference between the real message text in db log and its "official" version in promsgs file (and in Help) may give the wrong results while scanning log files by message text.

Progress uses the message types like "F" for "fatal" messages:

F ABL 10: (5027) User 10 died with 1 buffers locked. 
F Usr 97: (1124) SYSTEM ERROR: Wrong dbkey in block. Found 57689, should be 9787072 in area 19. 
F RFUTIL : (4524) SYSTEM ERROR: File /backup/xxxx too small 0, blocksize 16384 extend failed.


But some messages that are not fatal for /database/ are also marked as "F":

F ABL 2113: (14675) SYSTEM ERROR: Attempt to define too many indexes for area 6 database DBI16zaSWw. (40) 
F RFUTIL : (835) ** Cannot disable after-imaging, it is not enabled. 
F RFUTIL 46: (836) ** Cannot do AIMAGE NEW, after-imaging is not enabled.


The "W" type is rare:

W ABL 1228: (5408) WARNING: -l exceeded. Automatically increasing from 4482 to 4492. (5408)

The most messages are reported as type "I" and I did not see other message types. In other words the message types so far are not consistently used by Progress and we can't rely on the message types. For example, Progress could add, "C"(-onnect) type instead of moving the connection messages to a separate files. It would be easy to filter out the "unwanted" messages provided they are correspondingly tagged.

I would like to have the less frequently re-used user number - similar to PID in the OS: it's unique at any given moment, it can be re-used later but in fact even on very busy systems it will be re-used once per hour or so. The current limit (32K) is enough for user numbers. The user number can be still related to a slot in the user control table (a.k.a. the connection table) e.g. as modulo by table's size. But if a session crashed and the same end-user almost immediately started new session then it's very likely new session will get the number that was just released by previous session. We need to be careful not to mess the messages from these two sessions especially for remote sessions where we don't know their PIDs.

It would be great if db logs will report the PIDs for remote sessions. Now we have only the following information:

SRV 1: (742) Login usernum 947, userid root client type WTA , on xxxxx batch using TCP/IP IPV4 address 192.xxx.xxx.xx.
SRV 1: (708) Userid is now XXXXX.
SRV 1: (14658) Previous message sent on behalf of user 948, server pid 28017, broker pid 24771. (5512)


This information (userid on OS level, client's host, client's type ) does not allow us reliably identify all connections of the same session in the logs of all databases the session was connected to. Even the connection time is useless, for example, in case of the pre-started appserver's sessions.

And I'd wish that all Progress logs have /exactly/ the same format. So we can use the same tool to parse all of them.

Do you have something to add to the list?

Best regards,
George

All Replies

Posted by Dmitri Levin on 12-Jun-2015 10:46

George

I believe there are two ways to look at database log files. One is browsing it like a DBA would do and an other one is parsing log file with a tool. For parsing it with a tool PSC added time stamps, PID and other info.

Browsing log file on a system that has thousands of users by a human DBA is very difficult. That is why the suggestion came at the conference to make it easy for a rank-and-file DBA to see a system-wise issues with the database.

May be PSC could come up with something that works for both, a human DBA reading the log for database issue and a script parsing the log for a specific code issues.

For example dbanalys in 11.4 has two outputs, one for human reading and an other one for load into Excel and further processing. May be PSC could leave the current log as it is for script processing and create a new "short" log for critical database wise issues ( which couldl probably just exclude connect/disconnect ).

Posted by George Potemkin on 12-Jun-2015 12:27

Dmitri,

> One is browsing it like a DBA would do

My experience says that DBAs do not check the logs until the end-users tell them that they have the problems. And then DBAs often use a primitive tool like "grep error *.lg".

For example, there was a case when a crazy process was writing to a db log a few messages per sec and this lasted a few days. The daily db logs were huge but DBA did not notice this even when he uploaded the log on ftp for me to analysis the completely different issue.

A typical call from large customer:

Customer: Houston, we've got a problem. What data you need for analysis?

Me: I need db logs and recent promon data. (Emptied the contents of the glass in one gulp.) I told you that a thousand times before.

Customer: OK. Will be done in a few minutes.

I don't blame them, they follow the boss' instruction: "in case of problems first to contact TS".

> May be PSC could come up with something that works for both, a human DBA reading the log for database issue and a script parsing the log for a specific code issues.

PSC could reanimate the dead "key event" project - a daemon can parse db log for any "unsuppressed" messages and copy them to a separate file or send them to the specified email list. The "suppressed" messages is the set of any messages that we don't want to copy - like the connect/disconnect messages. Such feature would be a good "toy" for DBAs.

But as a first line techsupport I need as complete set of data as possible. The customers call me when an issue is happening "right now" but a thousand mile away from me - at the site I do not have a direct access to. The less data I will get the more "false trails" I should check and these checks would just waste my time when the time is critically important because the tens thousands end-users are unable to do their job. Will PSC provide a tool to analys the db log or not - it's not a problem for me. I have my own tools.

Regards,

George

Posted by TheMadDBA on 12-Jun-2015 13:14

George... I think you are correct that most DBAs don't check the log files unless there is a reported issue (and sometimes not even then).

I do like the idea of separate logs for different types of messages. Basically the ability to have multiple different log files with their own logging levels and logged events.  You could leave the main log intact (or turn off the noise messages like setuserid messages or connect/disconnect) and then have a different log file for errors only. I don't even think you would need another process for this. Just modify the current DB logging processes to honor new parameters.

It would certainly cut down on the number of home grown tools to parse all of those out. And when you have very large log files like some of us do it would certainly cut down on the IO parsing those logs.

Posted by George Potemkin on 12-Jun-2015 13:32

I would agree with any changes in db logs provided somewhere there will exist a "flight data recorder" and even the captain of the aircraft will be unable to change the completeness of the recorded data. ;-)

Regards,

George

This thread is closed