Data Extraction...the hard way...

Posted by b0b0rama on 20-Oct-2011 15:44

I have some old (version 9 and older) Prorogress database files that I need to pull data out of....without the nice friendly help of "Progress" itself.

I no longer have access to a running version of Progress that supports these files. I've found whitepapers that discuss the organization of blocks in the database.

I'm used to working with binary data, so packed fields, integers, floating point, and offset pointers, block-id's and such don't scare me.

So far...I'm able to move through the databased from 1024 block to block...and I'm picking out record blocks.

For each record block I'm following the local offsets in the block header to each record within that block and I'm reading what is most always the record length and table identifier.  Then I'm able to follow the field lengths and move from field to field and export data.  This has worked well for several tables. ...but some records seem to have "more complex" record headers.  Not just "four bytes".  

Does anyone have technical documentation on the structure of individual record headers?

Any guidence is welcome.

Thanks

All Replies

Posted by Tim Kuehn on 20-Oct-2011 16:28

I would expect it's a lot easier - and safer - to send the db to someone who has 9.* and ask them to dump it for you.

AFAIK, there's no published doc on the internal structure as it can change over time depending on what engineering decides to do with it.

Posted by b0b0rama on 20-Oct-2011 17:40

Yes, well...it might be much easier if contact some outside company...and send them the data base files....which I suppose they would have to use developer tools to "load" the database. and I would have to explain which tables I wanted etc.  This "easier" way could end up costing me thousands of dollars with some sort of data recovery service.

I susspect the more recent Progress databases have increased in complexity. Exploring these old database files in a hex editor....it seems that really...these version 8 and 9 record blocks just don't seem that complicated.  For the most part..I'm able to take these thousands blocks and separate out the "like" records into groups based on their "table number".  Sometimes however I see records from "obviously" different tables that are being grouped together because there is something "special" about their record header.  I think the crux of my problem comes down to just knowing the organization of these tiny record headers which are probably only 4 to 10(?) bytes in length for each record. I expect there may be block/record address pointers for any records that are divided into "fragments" .  E.G. any record larger than "x" bytes or that no longer fits in their record block will be divided into fragments and the header of each fragment will contain the block id and relative address of the next fragment....or something.  Even if I can't get the "exact" specification for a record header...It'd help if I new what sort of data to expect in a header (like a forward link block pointers or ?).  The Block level headers are well defined in available documentation.   I'm not opposed to sending out the data to a third party....it's the great heaping mounds of cash that must acompany it that gives me pause. Besides...what fun would that be?   :-)

Posted by Tim Kuehn on 27-Oct-2011 08:27

how much data are we talking here? Because if it's a "few" GB, I'd be willing to help out with this for a reasonable fee.

Posted by b0b0rama on 27-Oct-2011 10:36

Thanks, I'll keep that in mind :-)    Actually.... If I could purchase a small license for Progress Version 9,  It'd probably be worth it to me. I think Version 9 comes with the utility to allow it to upgrade version 8 databases.  Unfortunately...to my knoweldge.... the current Version 10 can only "upgrade" version 9 databases.  I'm guessing you have a functional Version 9 system.

It'd be swell if they had a utility that could ASCI dump tables from a db without the database engine running.

Posted by Tim Kuehn on 27-Oct-2011 15:34

I have a pile of versions, although I only have >= 10.2 installed on my current machine.

A personal db -might- be able to do a D&L, and the last time I got one it was in the $300 range. Failing that, a 4GL client would run around $1K.

There's also the trial version development suites you can download, but they're for 10.2* systems. They might have the necessary binaries to convert a v8 to v9 db - I don't know for sure though.

Posted by Thomas Mercer-Hursh on 29-Oct-2011 16:13

In a version 11.0 distribution, there are three directories under the bin directory in $DLC - 83dbutils, 91dbutils, 101dbutils.  The 83dbutils includes 83conv78.bat to convert a version 7 DB to version 8.  So, with a current copy of Progress, presumably even the eval, you have the tools needed to migrate all the way from version 7 to the version you have.

Trying to do this by reading the blocks would be nuts.

Posted by b0b0rama on 31-Oct-2011 09:52

Thanks Thomas for additional input, It really is appreciated.

I'm looking into if I can get access to these utilities.  This dabase was originally running on an old SCO Unix box...which is "dead / no longer available".  If I need to purchase a current version or use a trial version of Progress, I need to consider where I'm going to get a Server to run it on...and what platform it's going to be...and if it's not the same platform...is it going to be able to load these database files.

I have alot of experience working on internal file structures like various flavors of ISAM (cisam, kisam, tisam) and others.  I have nothing against the Progress database. Seems like a fine database to me. :-)    However; I need to move data formerly hosted on a SCO Unix box with Progress to a new server/database --- and so far, I only have access to the raw .db or .d1 (.d"x") files.  If it's necessary that I perform a tree traversal starting at a root node and recursively moving to intermediate nodes until I arrive at a leaf node which points me to individual records... I can do that.  It won't be a picnic, but I've done this kind of thing before. It's always an "imperfect" science unless you have really good whitepapers that describe the function of status bits etc. Sometimes though....if you just want to do a raw dump of every record in the file... you can just move sequentially through the blocks tossing out everything but record blocks. If I consider a record "fragment" to actually be a valid "record" (has a rowid) in and of itself, I'm 90% of the way there already. Progress uses internal "field compression" that poses some additional challenge (I've seen this kind of thing before too).

Maybe you do have to be a little nuts to go this route. To me, it's just knowing the "process" behind the magic. If I had 100 numbers to add and my calculator application won't launch,  I can go find, download, install and license another calulator application, or find somebody else with a working calculator application and ask or pay them to do it for me.. That might be "easier" but it doesn't mean I shouldn't know how to do addition. Does it make sense to write my own calculator application?  Well, no, but a typical calculator application does lots of stuff I don't need in order to add 100 numbers.   I don't need subtraction, or multiplication, or division. I don't need a fancy GUI.   I could infact write a Perl script to read a text file of 100 numbers and sum them in a couple minutes - Problem solved.

I really do appreciate all the suggestions and additional info offered up by this groups members. I understand my questions fall well outside what a typical Progress database user would have any interest in.  Thanks again for the heads up on the conversion programs. I will investigate this avenue as well.

Posted by Thomas Mercer-Hursh on 31-Oct-2011 11:16

If I were paying for this, I sure would be looking for a compatible box to borrow before I authorized spending the kind of time that you are going to have to take.

Posted by b0b0rama on 31-Oct-2011 11:59

Well, working at the problem from the other direction (trying to upgrade the files into Progress 10.2 with supplied tools is proving troublesome (perhaps because we lack the necessary expertise) -- the "paths" stored in the .db for existance don't exist.  The SCO box used different directory structure than a current Windows installation etc.

We are trying to make adjustments so that the config files and path environment variables support the now non-existant SCO paths.

Posted by Thomas Mercer-Hursh on 31-Oct-2011 12:07

You've tried the repair option?

You might get more response dealing with some of these move from dead box using utilities questions by posting on the OpenEdge/RDBMS forum.  The DB experts may not see a question "way up here".

Posted by gus on 31-Oct-2011 13:07

You can create a new .db file by using the data extents in whatever location you find convenient and creating a new structure definition file (.st) to describe where the extents are. Then you can use a "prostrct builddb foo foo.st" command to create a new .db file. Alternately, you can use prostrct repair to change the extent locations in an existing .db file.

Posted by b0b0rama on 31-Oct-2011 14:18

I continue to be amazed that anyone with the initials VP in their title actually has this much low level root knowledge about the low level underpinnings of their product. That's a rarity these days. Thank you again for your detailed explanation.

Posted by b0b0rama on 02-Nov-2011 17:42

First of all...I'd like to say that I just installed CentOS on my Macbook Pro, and I will continue to investigate getting Progress Up and Running.  I hope I will find some reasonable SQL administrator type client so I can browse tables etc.   Meanwhile... "Extracting data the Hard way" is actually going fairly well takling into account your tips on some of the underlying architectuure. Several of my "guesses" have turned out correct...and some of the organization I discovered through pure observation has been confirmed with your help. Thank You.

All of the "smaller"  tables seem to export just fine (tables that have records with less than 16 fields).

I am still having some difficulty getting some records grouped into the correct table number, The problem centers around the fragmentation and the skip table.

For the following example record header...(which is indicated to be the first fragment by the record offset directory

00 01 e9 62 00 d8 e7 00 0c 00 66 00 7c 00 8d 00 9d 00 da 00 ef fa 00 0c

00 01 e9 62  --> The first four octets contain the rowid of the fragment (I understand row-ids and how to locate them)

00 d8  --> the Length of this record fragment.

e7 --> This record has a skip table.

00 0c --> The length of the skip table

00 66 00 7c 00 8d 00 9d 00 da 00 ef  --> the skip table (in groups of two octets)

fa 00  --> This should be the table number right?  

0c  --> The lengh of the first data field...followed by additional fields of varying sizes and encodings.

Am I interpretting the table number correctly? Should the next fragment also contain this table number or no?  Guessing NO. When I move to the expected location of the next (and final in this instance) fragment. can additional fragments have a skip table? 

I appologize for ginving you such a technical workout...again, this truly is appreciated.

Robert Bednar

Posted by Thomas Mercer-Hursh on 03-Nov-2011 01:29

Have you tried the DB utilities from this CentOS parition (assuming there is a Linux version of the eval)?  You are getting extraordinary support from Gus here ... far beyond what I would expect anyone at PSC to provide for free, especially someone at his level.

Posted by b0b0rama on 03-Nov-2011 09:07

I just got Centos running under VMware...and I'm installing/setting up all the helpful Unix/Linux services that are turned off by default. I agree that the level of technical knowledge provided by Gus has been extrordinary. That said, it's the same kind of technical knowledge I provide others about underlying database structure that I am intimately familiar with.  Sadly, I have little to offer the Progress community, However;  If Gus would like a copy of the completed program (the source code) that his knowledge helped me to create...I am happy to send him a copy. Because active Progress instalations have superior tools, I don't know how much use he would have for this. It's also no doubt somewhat "version dependant".  It's a small single purpose program.  What it will do, if take a single Progress .db or .dx  file and "sort the records" into groups called file00001, file00002, file00003,...file###### etc.  where all the records in this file belong to the same table. With some effort, the program could replicate the "binary dump" process.  Gus could extend the program to increase it's usefullness however he see's fit.  This program runs in a complete absence of the Progress runtime or any other Progress libraries or other tools. I have no doubt that Gus could write it himself if he needed such a program.  The additional knowledge he has provided on this forum has far exceeded any expectations. The level of "support" I've received from him personally speaks very highly to that available to the Progress community. Rest assured that I will continue to speak highly of both the OpebnEdge product and support available.

Posted by gus on 03-Nov-2011 09:10

You have part of it right.

But fa00 does not seem reasonable. The table number will be a positive value for user defined tables and a negative value for the tables that contain metadata, such as the table and index definitions, but most of them are not such large values as fa00. I must have forgotten something.

The table number is the first data value and is "column 0". It will not appear in subsequent fragments.

Subsequent fragments will not have a skip table. The skip table is for the entire row when the fragments are assembled.

Posted by gus on 03-Nov-2011 13:06

As I mentioned, the table number is either a single integer data value, column 0, OR it is an the first element in an array of values. The first form was used in older datbases and the latter in newer ones.

0xfa is the prefix for an array. The next two octets are the array length. The array ends with 0xff.

The array should have 8 data values, all integers. The first element is the table number.

Posted by Thomas Mercer-Hursh on 03-Nov-2011 16:37

Gus, you must be snowed in to be giving this kind of detailed help to someone!

Posted by b0b0rama on 03-Nov-2011 17:15

Maybe Gus just recognizes there aren't many people willing to learn the nuts and bolts behind a process and appreciates somebody...crazy enough to to tackle a problem like this the old fashioned way. By actually writing code. 

...or maybe Gus is just a really nice guy and feels that "what goes around comes around"...

...or maybe Gus figured out I'm in Cincinnati Ohio and has a craving for Skyline Chili, or Graeters Ice Cream and knows I can find his office address.

This thread is closed