Reading in CSV without double quotes on every entry.

Posted by Joseph Kreifels on 30-Mar-2017 10:47

If I have a CSV file that doesn't use double quotation marks except for items with commas. 

Sample Data

name,age,description,state
Johnson,43,"He loves apple pie, but is allergic to apples",PA

I can't use the standard entry() method because it will split the description into 2 parts.

  1. "He loves apple pie"
  2. " but is allergic to apples"

 

I've noticed software like Excel knows that when a item is wrapped in double quotation marks, it should ignore delimiters.

 

Is there a way to parse a CSV file into a temp table following the CSV standards?

 

I am using Progress 10.2B on RHEL. If Progress can't parse the file, there may be a outside option through Bash, Java, or Python that can be invoked to make changes to the CSV during runtime

 

 

 

All Replies

Posted by Garry Hall on 30-Mar-2017 10:59

Use the DELIMITER option of IMPORT:

IMPORT DELIMITER "," c1 c2 c3 c4.

This assumes you have a consistent format within the file.

Posted by Brian K. Maher on 30-Mar-2017 11:00

What about checking num-entries() and if it returns the value indicating <normal  count> + 1 then split as normal for all fields but the two that should be combined then combine those two into a single value?

Posted by Joseph Kreifels on 30-Mar-2017 11:04

@

Doesn't that still cause the issue I stated above?

1. "He loves apple pie"

2. " but is allergic to apples"

 

See response below

Posted by James Palmer on 30-Mar-2017 11:11

How big is the CSV? If it's not too big you can use copy-lob to put the whole file in a LONGCHAR and then parse the contents using chr(10)/chr(13) as your line delimiter. It might be a bit clunky, but it's better than having to shell out to some other process IMO.

Posted by Joseph Kreifels on 30-Mar-2017 11:15

Garry,

I think that actually works. I tried your method against the sample data and it seems to have done the trick. I'll try to implement it using the actual CSV file and see how it goes.

Posted by Brian K. Maher on 30-Mar-2017 11:16

define variable vLine as character no-undo.

input from data.csv.

repeat:
    import unformatted vLine.
    if num-entries(vLine) = 4 then
        message entry(1,vLine) skip
                entry(2,vLine) skip
                replace(entry(3,vLine),'"','') skip
                entry(4,vLine) view-as alert-box.
    else
        if num-entries(vLine) = 5 then
            message entry(1,vLine) skip
                    entry(2,vLine) skip
                    replace(entry(3,vLine) + "," + entry(4,vLine),'"','') skip
                    entry(5,vLine) view-as alert-box.
        else
            message "Found line with wrong number of entries" view-as alert-box.            
end.

input close.
    

Posted by Garry Hall on 30-Mar-2017 11:17

I just test it in 11.7, and it worked for me.

DEFINE VARIABLE c1 AS CHARACTER   NO-UNDO.

DEFINE VARIABLE c2 AS CHARACTER   NO-UNDO.

DEFINE VARIABLE c3 AS CHARACTER   NO-UNDO.

DEFINE VARIABLE c4 AS CHARACTER   NO-UNDO.

INPUT FROM "csv.txt".

REPEAT:

IMPORT DELIMITER "," c1 c2 c3 c4.

DISPLAY c3 FORMAT "x(50)".

END.

INPUT CLOSE.

c3

--------------------------------------------------

description

He loves apple pie, but is allergic to apples

IMPORT is the statement used for dictionary dump and load. It has to handle delimiters in quoted character data, even if the delimiter is embedded in the quoted data. The default delimiter is " " (space), so IMPORT must be able to handle embedded spaces. As long as your quote starts after your delimiter, it should work.

Posted by Brian K. Maher on 30-Mar-2017 11:19

This is better:

define variable vLine as character no-undo extent 4.

input from data.csv.

repeat:
    import delimiter ',' vLine[1] vLine[2] vLine[3] vLine[4].
    message vLine[1] skip
            vLine[2] skip
            vLine[3] skip
            vLine[4] view-as alert-box.
end.

input close.
    

Posted by Joseph Kreifels on 30-Mar-2017 11:20

@James

The CSV has 23 columns and can have anywhere between 1 to a few hundred rows. There is at least 1 column in particular that allows commas in the data. Thankfully though this 1 column is wrapped in double quotes. I'm looking for a method that works with the CSV like it should without jury rigging a band aid method.

Posted by Ken Ward on 30-Mar-2017 11:26

The IMPORT statement is definitely the way to go, but I had to deal with a similar issue where the IMPORT statement wouldn't have been useful.

What I did was I checked the first character to see if it was a quote but the last character was NOT a quote. If so, then it went into a loop where it looked for a token that ended with a quote, appending commas and tokens along the way.

It might've been kludgy, but it got the job done.

Posted by scott_auge on 30-Mar-2017 14:12

This might help:

/* Given a string from import unformatted, return CSV entries in the */

/* 100 extent CSVEntries variable. */

/* TODO: Quotes within a quote is not done */

define input parameter CSVLine as character no-undo.

define output parameter CSVEntries as character extent 100 no-undo.

define variable CSVLength as integer no-undo.

define variable iterator as integer no-undo.

define variable extentiterator as integer no-undo.

define variable WholeValue as character no-undo.

define variable CharValue as character no-undo.

define variable LastCharValue as character no-undo.

define variable InQuotes as logical initial false no-undo.

assign CSVLength = length(CSVLine).

do iterator = 1 to CSVLength:

 assign

 LastCharValue = CharValue

 CharValue = substring(CSVLine, iterator, 1)

 .

 case CharValue:

   when "~"" then do:

     assign InQuotes = not InQuotes.  /* flip flop switch */

     next.

   end. /* when " */

   when "," then do:

     if not InQuotes then do:

       assign

       extentiterator = extentiterator + 1

       CSVEntries[extentiterator] = WholeValue

       WholeValue = ""

       .

     end. /* if not InQuotes */

     if InQuotes then assign WholeValue = WholeValue + CharValue.

   end. /* when , */

   otherwise assign WholeValue = WholeValue + CharValue.

 end. /* case */

end. /* do iterator */

assign CSVEntries[extentiterator + 1] = WholeValue.

/*************** Unit test code

define variable A as character no-undo.                            

define variable C as character extent 100 format "x(10)" no-undo.  

A = "345,the word,*,45,~"417,000~",3ed".                            

run /home/sauge/prg/CSVParser.p (input A, output C).                

/**/                                                                

display C.                                                          

/**/                                                                

/*                                                                  

display integer(C[5]).                                              

*/                                                                  

*********************************/

This thread is closed