Pro2: Data in the MS SQL take up more space than in the Prog

Posted by Valeriy Bashkatov on 09-Mar-2016 03:42

Hello,

When I loaded data from OpenEdge to MS SQL using Pro2SQL tools I found that the data in MS SQL take up more space (аpproximately > 3 - 4 times more) than in the Progress OpenEdge.

Simple example:

OpenEdge table:

ADD TABLE "test1"
  AREA "Account Table"
  DUMP-NAME "test1"
 
ADD FIELD "int1" OF "test1" AS integer 
  FORMAT "->,>>>,>>9"
  INITIAL "0"
  POSITION 2
  MAX-WIDTH 4
  ORDER 10
 
ADD FIELD "dec1" OF "test1" AS decimal 
  FORMAT "->>,>>9.99"
  INITIAL "0"
  POSITION 3
  MAX-WIDTH 25
  DECIMALS 10
  ORDER 20
 
ADD FIELD "char1" OF "test1" AS character 
  FORMAT "x(8)"
  INITIAL ""
  POSITION 4
  MAX-WIDTH 150
  ORDER 30


Create a four record with different sizes (lengths) for field char1:

1 22222 "12345678"
2 33333 "1234567891234567"
3 44444 "12345678901234567890123456789012345678901234567890123456789012345678901234567890"
5 21324 "123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"

MS SQL Table:

CREATE TABLE test2 (
  int1 integer null,
  dec1 decimal(25,10) null,
  char1 char(150) null,
)
GO

Create a similar four records:

INSERT INTO [dbo].[test1]
           ([int1]
           ,[dec1]
           ,[char1])
     VALUES
           (1
           ,22222
           ,12345678)
GO
 
INSERT INTO [dbo].[test1]
           ([int1]
           ,[dec1]
           ,[char1])
     VALUES
           (2
           ,33333
           ,1234567891234567)
GO
INSERT INTO [dbo].[test1]
           ([int1]
           ,[dec1]
           ,[char1])
     VALUES
           (2
           ,44444
           ,'12345678901234567890123456789012345678901234567890123456789012345678901234567890')
GO
 
INSERT INTO [dbo].[test1]
           ([int1]
           ,[dec1]
           ,[char1])
     VALUES
           (4
           ,21324
,'123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890') GO

in Progress OpenEdge size of these four records is 324.0B

Table                                    Records    Size   Min   Max  Mean                Count Factor  Factor
PUB.test1                                      4  324.0B    32   144    81                    4    1.0     1.0

but in the MS SQL is 8К:

EXEC sp_spaceused N'test1'
GO
 
name  rows  reserved    data  index_size  unused
test1 4     16 KB       8 KB  8 KB        0 KB
 

How to explain this?

 


Regards,
Valeriy

Posted by Keith Sudbury on 10-Mar-2016 09:27

That all looks pretty reasonable to me... especially since you have 3 more columns in SQL than you do in OE.

If you wanted to try and reduce the physical storage you "might" be able to get some savings by setting up filegroups (storage areas) for the table and the indexes instead of just letting them exist in the same filegroup with other objects.

Unless this table is going to grow quite a bit or if you have much larger tables in mind it likely isn't worth it.

All Replies

Posted by Keith Sudbury on 09-Mar-2016 09:09

Well you have two issues... one is that you are using the CHAR data type (fixed length) instead of VARCHAR(variable length).

The other is that I am pretty sure that sp_spaceused is reporting the block usage and not the contents of the rows.

Loading up terabytes of data into both SQL Server and Oracle I have never seen the supposed extra space usage required. This is of course using the proper data types and without any of the shadow columns that dataserver requires.

Posted by Valeriy Bashkatov on 09-Mar-2016 09:32

[quote user="Keith Sudbury"]

Well you have two issues... one is that you are using the CHAR data type (fixed length) instead of VARCHAR(variable length).

The other is that I am pretty sure that sp_spaceused is reporting the block usage and not the contents of the rows.

[/quote]

I found another script to determine the size of the row:

declare @table nvarchar(128)
declare @idcol nvarchar(128)
declare @sql nvarchar(max)

--initialize those two values
set @table = 'test4'
set @idcol = '*'

set @sql = 'select ' + @idcol +' , (0'

select @sql = @sql + ' + isnull(datalength(' + name + '), 1)' 
        from sys.columns where object_id = object_id(@table)
set @sql = @sql + ') as rowsize from ' + @table + ' order by rowsize desc'

PRINT @sql

exec (@sql)

and it's closer to the truth

[quote user="Keith Sudbury"]

Loading up terabytes of data into both SQL Server and Oracle I have never seen the supposed extra space usage required. This is of course using the proper data types and without any of the shadow columns that dataserver requires.

[/quote]

You've worked with Pro2?

Since I use Pro2SQL, his documentation said:

Please note, if this box is to contain the SQL database, the size of
the SQL database will be approximately two times the size of the
Progress database.

So I'm trying to understand why the size may be twice as much, plus understand why I have sql db size even more )

Posted by Keith Sudbury on 09-Mar-2016 09:53

That script is just adding up the lengths of the returned data and not what is actually stored in the DB. So still not exactly comparing apples to apples yet.

The first step is to get the data types set up correctly... that means VARCHAR or NVARCHAR depending on your codepage needs. Using CHAR is just asking for space to be wasted. There are some rare edge cases on very large tables where CHAR can make sense... but unless you have a table with a few hundred million records it isn't worth talking about.

Why the Pro2 team made this statement I have no idea. My utilities existed before Pro2 was commercially available but they are basically the same concept... use replication triggers to track changes and the dataserver to insert/update/delete the target database as needed.

Posted by gus on 09-Mar-2016 10:22

> On Mar 9, 2016, at 10:54 AM, Keith Sudbury wrote:

>

> Using CHAR is just asking for space to be wasted.

Not a requirement.

It is possible for CHAR to be stored variable-length in a database and then fixed up when the value is handed to the application. There are databases that compress character data too.

Posted by Keith Sudbury on 09-Mar-2016 10:31

As far as I know in SQL Server CHAR is still the old style fixed length storage.. CHAR(150) always uses 150 bytes for storage.

VARCHAR(150) uses the actual data length + 2 bytes.

NVARCHAR(150) is twice the actual length + 2 bytes

Posted by Valeriy Bashkatov on 10-Mar-2016 02:22

In the first example I used CHAR type to match with Progress - now I know that I was wrong. Thank you for the clarification, Keith! 

But initially Pro2SQL tool generates fields with varchar. Exemple from real db:

CREATE TABLE [DBO].[qc_mstr] (
  [prrowid] varchar(36) not null PRIMARY KEY NONCLUSTERED,
  [qc_cmtindx] integer null,
  [qc_due_date] datetime null,
  [qc_insp_loc] varchar(80) null,
  [qc_inv_stat] varchar(80) null,
  [qc_lead_time] decimal(28,10) null,
  [qc_loc] varchar(80) null,
  [qc_lot] varchar(80) null,
  [qc_nbr] varchar(30) null,
  [qc_ord_date] datetime null,
  [qc_part] varchar(30) null,
  [qc_project] varchar(80) null,
  [qc_qty_comp] decimal(28,10) null,
  [qc_qty_ord] decimal(28,10) null,
  [qc_qty_rjct] decimal(28,10) null,
  [qc_ref] varchar(80) null,
  [qc_rel_date] datetime null,
  [qc_rmks] varchar(80) null,
  [qc_serial] varchar(50) null,
  [qc_site] varchar(80) null,
  [qc_status] varchar(30) null,
  [qc_teststep] varchar(30) null,
  [qc_type] varchar(30) null,
  [qc_user1] varchar(80) null,
  [qc_user2] varchar(80) null,
  [qc__chr01] varchar(80) null,
  [qc__chr02] varchar(80) null,
  [qc__chr03] varchar(80) null,
  [qc__chr04] varchar(80) null,
  [qc__chr05] varchar(80) null,
  [qc__dec01] decimal(28,10) null,
  [qc__dec02] decimal(28,10) null,
  [qc__dte01] datetime null,
  [qc__dte02] datetime null,
  [qc__log01] bit null,
  [qc_autoissue] bit null,
  [qc_job] varchar(80) null,
  [qc_domain] varchar(8) null,
  [oid_qc_mstr] decimal(28,10) null,
  [Pro2SrcPDB] varchar(12) null,
  [pro2created] datetime null,
  [pro2modified] datetime null,
)
GO

CREATE INDEX [qc_mstr##qc_due_part] ON [DBO].[qc_mstr] (
  [qc_domain] ASC,
  [qc_due_date] ASC,
  [qc_part] ASC,
  [qc_site] ASC
)
GO
CREATE INDEX [qc_mstr##qc_lot] ON [DBO].[qc_mstr] (
  [qc_domain] ASC,
  [qc_lot] ASC
)
GO
CREATE INDEX [qc_mstr##qc_nbr] ON [DBO].[qc_mstr] (
  [qc_domain] ASC,
  [qc_nbr] ASC,
  [qc_lot] ASC
)
GO
CREATE INDEX [qc_mstr##qc_part] ON [DBO].[qc_mstr] (
  [qc_domain] ASC,
  [qc_part] ASC,
  [qc_site] ASC,
  [qc_due_date] ASC
)
GO
CREATE INDEX [qc_mstr##qc_type_nbr] ON [DBO].[qc_mstr] (
  [qc_domain] ASC,
  [qc_type] ASC,
  [qc_nbr] ASC,
  [qc_lot] ASC
)
GO

In the Progress dbanalys the size of this table is

PUB.qc_mstr                               408863   57.6M    98   195   147               408863    1.0     1.0

I tried to calculate the size of the table in SQL based on this article: https://technet.microsoft.com/en-us/library/aa933068(v=sql.80).aspx

I get the same size using the formula provided in the article: ~57.4 M

As far as I understand now, all previously used scripts showed me a real physical space, or the wrong size (row lenght).

This table in MS SQL takes physically (sp_spaceused) 130568 KB. it is equal to 130568 KB / 8KB (page size) = 16321 pages. But according to the calculations it should take only 7301 pages. Other words physically used twice as much space. And it's even more confusing me [:S]

Keith, what method for calculating the size of the data in the SQL used you?

Regards,
Valeriy Bashkatov

P.S.

I just need to formulate a comprehensive explanation to the customer about why the physical size of the SQL-database (created by Pro2) more than Progress-database.

Posted by Keith Sudbury on 10-Mar-2016 08:39

Keep in mind that the 57.6M reported by proutil is the total size of the rows and not how many blocks are required to store the data. The odds are pretty decent (99.9%) that more than 57.6M of space is being used to store that table in OpenEdge.

sp_spaceused will show the entire footprint for a table... including reserved pages for future use (think Type II Area cluster size).

If you posted the entire output from sp_spaceused I would have a better idea of where the space is being used.

Posted by Valeriy Bashkatov on 10-Mar-2016 08:56

[quote user="Keith Sudbury"]

If you posted the entire output from sp_spaceused I would have a better idea of where the space is being used.

[/quote]
Thank you!
EXEC sp_spaceused N'qc_mstr' 
GO

name	rows	reserved	data	        index_size	unused
qc_mstr	408863  144464 KB	130568 KB	13856 KB	40 KB
[quote user="Keith Sudbury"]

Keep in mind that the 57.6M reported by proutil is the total size of the rows and not how many blocks are required to store the data. The odds are pretty decent (99.9%) that more than 57.6M of space is being used to store that table in OpenEdge.

sp_spaceused will show the entire footprint for a table... including reserved pages for future use (think Type II Area cluster size).

[/quote]

I think you're right.

The physical size of the area (Type II) with a single table is almost the same:

ls -lah atm_28.d1
-rw-r--r-- 1 root root 112M Мар 10 17:33 atm_28.d1

Posted by Keith Sudbury on 10-Mar-2016 09:27

That all looks pretty reasonable to me... especially since you have 3 more columns in SQL than you do in OE.

If you wanted to try and reduce the physical storage you "might" be able to get some savings by setting up filegroups (storage areas) for the table and the indexes instead of just letting them exist in the same filegroup with other objects.

Unless this table is going to grow quite a bit or if you have much larger tables in mind it likely isn't worth it.

Posted by Valeriy Bashkatov on 11-Mar-2016 02:31

Thank you, Keith!

I think I have to go visit the bookstore, section on the MS SQL tuning books [^o)]

This thread is closed