Online schema changes don't work (and never have)

Posted by dbeavon on 13-Mar-2018 15:15

I've started using client-server connections to the OE DBMS and have discovered issues that never impacted me with "shared memory" connections. 

One of the more annoying issues is that online schema changes don't work.  In version 10 Progress claimed to have added support for this... but whatever support they had added was incomplete to say the least.  There are a variety of obscure errors that ABL apps will receive if they are connected remotely via client-server connections to an OE DBMS server ("SERV".)

Here is a partial list of the types of interesting messages you will see (list taken from https://knowledgebase.progress.com/articles/Article/P126802)

  • BUFFER-FIELD <field-name> was not found in buffer <buffer-name>. (7351) 
  • Lead attributes in a chained-attribute expression (a:b:c) must be type HANDLE or a user-defined type and valid (not UNKNOWN). (10068) 
  • SYSTEM ERROR: Cannot read field from record, not enough fields. (450) 
  • SYSTEM ERROR: Cannot read field XXX from record, not enough fields
  • SYSTEM ERROR: Failed to extract field <field-num> from <file-name> record (table <table-num>) with recid <RECID>. (3191) 
  • Couldn't extract field '<field>' from source in a BUFFER-COPY statement. (5367)
  • https://community.progress.com/community_groups/openedge_development/f/19/t/33923 RAW-TRANSFER statement failed due to stale schema


The errors listed above are by no means exhaustive, according to that KB.

Other KB's elaborate on the problems with "online" schema changes:

By trial-and-error, I've discovered that I can dig into promon (R&D) and find "admin functions" -> 7. "terminate server" and then start killing off any servers of type "SERV".  Once they are all dead, the schema corruption/compatibility issues seem to go away for any app that is restarted. But killing off servers like this seems to be a very error-prone solution.  It certainly does not seem like the type of thing you want to be doing in a production environment.  (more explanation about this workaround can be found here: https://knowledgebase.progress.com/articles/Article/P127816 )

Is it common knowledge that "online" dictionary changes are this unreliable?  Are we doing something unusual?  Our ideal workflow would be to run a DDL like so (via the SQL92 interface) without worrying about killing SERV-type servers afterwards.

ALTER TABLE PUB."xxx" ADD COLUMN ( "abc" BIT DEFAULT 'false' PRO_ORDER 810 PRO_FORMAT 'true/false' PRO_LABEL 'abc' PRO_COL_LABEL 'abc'  ) ;

Once the schema is changed we'd like it if our ABL clients would pick up the change, especially if they are restarted.  Unfortunately it is not enough to stop the client-server ABL clients because the "SERV"-type servers seem to have corruption in them as well.

Any tips would be much appreciated.  I'd especially like to avoid the part where we have to use promon to kill servers.  In the very least it would be nice if the DBMS would start creating additional servers for all new client connections that are made after the schema changes.  The old servers that predated the schema change should not be used for new connections.

All Replies

Posted by George Potemkin on 14-Mar-2018 03:49

Progress version?

Posted by James Palmer on 14-Mar-2018 04:41

The big issue with online schema changes will be ensuring the CRC isn't broken. But then I'm sure you're aware of this and have this covered.

I've not dug into it yet, but I gather a lot of changes have been made in 11.7 to do with online changes. Something to do with client notify etc. Like I say, I haven't dug in, just picked up a few things by osmosis at EMEA PUG Challenge last year.

So, as George suggests, knowing your Progress version will be key to understanding what you can and can't do online.

Posted by dbeavon on 14-Mar-2018 07:50

We're running Release 11.6.3.025  where the database is served (on HP-UX / IA64).

Posted by George Potemkin on 14-Mar-2018 07:59

11.7 introduced the -usernotifytime parameter

"Evil side affect of -usernotifytime parameter" was fixed in 11.7.2 (under PSC00359361).

community.progress.com/.../115707

Posted by dbeavon on 14-Mar-2018 08:07

It might be a while before our databases are running on OE 11.7.  I would guess that we'll upgrade to that version around the same timeframe that the first version of OE 12 is released  (we are usually a version or two behind).

Insofar as "online" schema changes are concerned, it doesn't seem enough to restart all the remote ABL clients.  I was wondering if anyone had any practical advice from trying to perform "online" schema changes in production, and making those changes take effect for new ABL client instances, both for shared-memory and client-server connections.  While it seems acceptable to restart ABL client applications for them to pick up schema changes, it is extremely odd that we are responsible for performing admin operations on the database SERV-type processes themselves. The management of those seems like it should be the responsibility of the DBMS, and if they become internally corrupted and start generating errors - especially for new ABL client connections - then there's no way that anyone can claim the database allows "online" schema changes.

Given the way in which SERV-type servers need to be manually bounced to clear out the corruption/compatibility, then in the context of schema changes, they aren't more than glorified ABL client apps themselves.  The technology isn't really client-server at all... it's more like big-client-little-client.

Posted by dbeavon on 03-Apr-2019 19:14

OK we are now running OE 11.7.4 (both on the PASOE-client side of things and on the database side of things).

And we are still getting some pretty scary errors in our remote clients after a schema change.  The "remote servers" are very, very confused about the underlying database schema.

SYSTEM ERROR: Failed to extract field 187 from ord_hdr record (table 441) with recid 13827526. (3191)

I have one open ticket with Progress about the inability to perform online schema changes.  That ticket was for a slightly different message:

"RAW-TRANSFER STATEMENT FAILED (11425) DUE TO STALE SCHEMA"

https://knowledgebase.progress.com/articles/Article/raw-transfer-statement-failed-due-to-stale-schema-for-table-abc?q=000089700&l=en_US

The support case has been open for almost a year.  The original commitment was to fix that in 11.7.4 but weren't able to deliver.  I'm hoping the fix for "STALE SCHEMA" finally comes in 11.7.5.  If we do get the fix, then what are the chances that an additional failure message (failed to extract field) would go away as well?

I don't know why there aren't a lot more developers that run into these issues.  Don't any other Progress customers ever attempt to use the features for "online schema changes"?  When this stuff first became available, I was very skeptical that it could be pulled off properly.  What I'm now being told by Progress is that schema changes can be made in either ABL or SQL ... but if they are made on one half then the servers on the other half become stale/inconsistent/confused.  I wish the two teams that work on those technologies would get on the same page.  Sometimes it feels like the OE database has been given a frontal lobotomy - and the SQL and ABL technologies are not really inter-connected the way they should be.

Perhaps Progress should just go back to disallowing online schema changes until they can get their ducks in a row.  I'd prefer to get a message saying that the database couldn't be locked for modification, rather than a message that appears like the database has been permanently corrupted: SYSTEM ERROR: Failed to extract field...

Sorry for the complaining.  It is a bad day when you get a message saying you have a corrupted database. (At least that's the first conclusion based on the google hits and the Progress KB )

Posted by onnodehaan on 03-Apr-2019 19:37

Hi dbeavon

Sucks that you are running into these kind of problems.

You where asking if other vendors don't run into similair problems. We don't. But that's because we have strict release policy concerning database-changes. Once a year we release a major version that can contain new tables and//or fields.

Our customers are used to a yearly update that takes down the system for a friday evening or weekend day. And they don't need a 24x7 availability of the application.

But I agree, when PSC releases a feature, it should just work. Or be specific in when it doesn't.

Posted by dbeavon on 03-Apr-2019 21:06

Production releases aside, I would think that other people would encounter this stuff in their day-to-day software development activities.  When you do a database change in the development or testing environment, do you always restart the entire database as well?  

Or maybe you don't use client-server connections?  When working with the legacy "shared memory" connections, we don't normally have as many troubles doing "online schema changes".

Today we wasted a lot of time trying to figure out why our code was failing in one environment and not the other.  It had nothing to do with the custom code, or the custom schema.   Everything was deployed properly and successfully.  That deployment was done more than a day ago.  The difference came down to the fact that one database was restarted somewhere along the way and not the other.  Bouncing the database is not one of the first things you think to do when your programs aren't running properly.

These so-called "remote servers" seem picky about their schema.  And, as near as they are to the database itself, they don't wan't to retrieve fresh schema without a full database restart.  It doesn't make sense!

Posted by onnodehaan on 03-Apr-2019 21:23

Hi

We have a fully automated build environemtn. Every single file that is checked in starts a new "trunk" build.

When we commit a DF-file, we start an "update database"-job after office hours. That triggers a rebuild of our master "test" (trunk)-database.Takes about 15 minutes.

After the build is complete, we restart the Docker container for "trunk" and after a minute or so, all dev/test environments are up and running again and using the new database.

Weekly updates for our Minor-releases, are also generated  and published automatically, but never contain DF-files; since database-conversions are rather slow for large(r) customers. Would be great if we could use multi-treading for database-conversion programs btw. Spawing more than one session is troublesom, because of locking issues.

As mentioned before, only our Major releases (yearly) contain DF-files.

So, because we have automated everything, using AWS / Docker etc, the need for online-schema updates for Development is non-existing in fact.

Updating customers is always done via an installer scripts, which is as simple as this example:

- start script on Unix or start .CMD on Windows

- Type:  INSTALL,  press ENTER

- Type   2019.2.4  press ENTER

Version 2019.2.4 is downloaded, unpacked, installed, database is updated, converted, all automatically.

If the users uses clients, they need to be updated using the same procedure, but mostly customers use a few terminal servers or some way to use remote clients (Citrix image, etc)

Posted by dbeavon on 03-Apr-2019 22:13

We stopped using DF's and migrated to SQL deltas some time ago.  But the idea is basically the same.  Maybe the use of the SQL-style deltas is where we took the wrong turn.

Maybe we need to take a step back, and just talk about applying a change to your own development database (the first-ever place you are creating your DF's).  Lets say you modify the database which your are actively working with.  Let's say you add a column.  And you delete any previously created r-code.  Do you then have to bounce the entire database to avoid errors in your applications?  Or do you just restart client applications?  If you aren't bouncing the database, then are you working exclusively with "shared memory" clients?

That's all I'm really talking about. The "remote servers" are the things that seem to choke up on schema changes.  

The need to restart the database for the sake of "remote servers" is very odd.  It is defeating the whole purpose of the "online schema changes".

We can also stop and restart any of our databases in less than 15 minutes, but that shouldn't be necessary.

This thread is closed