There is a strong push to accelerate the testing and deployment of databases to keep up with the agility of source code. However, it is not a simple thing because of the nature of databases. For example, perhaps bugs from previous versions allowed bad data into the db that must be corrected; or data must be migrated to fit new models; or care must be taken to ensure data is not lost when promoting new schema (particularly renames). Not to mention updating all code which accesses the db, including triggers.
It goes without question that the database should be under the same rigorous version control that the source code is. Since the incremental df files needed to migrate the production database to the new schema is more a deployment detail than anything, I initially assumed the state approach, i.e. dumping the full schema and version controlling that, would make the most sense. However, this runs into problems at deploy time because the context for the change has been lost.
On the other hand, version controlling an initial schema and an ever-increasing number of migration scripts (incremental df's) would (I think) cause performance issues with CI as test db's must be regularly built and torn down. Also, we would still need the full schema file in order to have the pipeline analyze it for problems.
How do you handle this problem? Keep both the full schema and migration scripts (both for schema & data) in source control? Some kind of hybrid approach with full schema and rename files?
How do you handle this problem? Keep both the full schema and migration scripts (both for schema & data) in source control? Some kind of hybrid approach with full schema and rename files?
As Peter said, incremental DF are more than just an implementation detail. They have to be applied in a specific order, and additional steps may have to be executed in between. All those steps have to stored in the code repository, so that they can be applied again and again in any environment.
A few years ago, I started working on DB update mechanism with Groovy. I've done a presentation on Groovy at the EU PUG Challenge ( www.pugchallenge.eu/.../groove-is-in-the-ant---querret.pdf ), saying that this language is really good for this kind of task. Next step is to do the presentation for the full DB update mechanism !
Would you version control the migration files (inc df, data migration procedures) to the exclusion of full df's, or would it be a good idea to include everything in the repository (even though that may be redundant)?
I see - so even though you could produce the current full df by applying all the migration bundles in order from start to finish, it is still reasonable to actually store the full df in parallel in that same repo for its own purposes such as spinning up new test environments and running analytics?
Repository hierarchy (could also be further organized by database name at the root level?):
/schema - contains full df's
/migrations - contains a folder for each delta, which in turn holds all the df & p files needed for that delta
/data - contains d files with sample data for the current iteration
For example:
/db1/schema/db1.df
/db1/migrations/delta001/<delta001 artifacts>
/db1/migrations/delta002/<delta002 artifacts>
/db1/data/<d file for each table in db1>
/db2/schema/db2.df
/db2/migrations/delta001/<delta001 artifacts>
/db2/data/<d file for each table in db2>
etc.
Or should all the d files be inside their corresponding deltas so that all versions of the data are available concurrently without having to checkout a previous version of the repository?
If you want to setup an old version of your database, then you'd just check out the right version in the history of the repository. So to my mind, you just store the data for the latest version of your database in the repository, and keep history for that.
However, some cases will have different requirements !
Dynamic buffers and queries can be used to simplify the build process, removing the need for r-code specific to a database version.
two separate problems:
0) update an existing database with the new schema changes
for this you need one or more delta.df's.
1) create the current version of the database from scratch
for this, you need complete .df's, .st's, and a text change history that explains what schema changes were made and why.
you should be able to create a complete database with a small set of artifacts. doing 50 delta .df's is not fun.
2) depending on applications
you may also need to version some data files - for example if you have (mostly) static lookup tables or application configuration tables. and/or .p's needed to populate the data files.
dont forget documentation needed to create or update the database(s)
Thanks for all the input - I think this gives us plenty of ideas to work with.