Fast way to extract lines from a very large text file

Posted by tbergman on 25-Jun-2015 12:18

Windows, Progress 11.

We have a variety of files we send to other groups in the company. These are tab delimited and some of them can be very large, containing 80 million lines or more.

Every so often, they'll report a problem with the data in, let's say, line number 20,000,000. These files are too large for any text editor I have. So I'm trying to write something that will extract a set of lines from the file. Using a .Net StreamReader, I can go through the file and send out only the set of lines of interest. This is not awful, reads about 4 million lines per minute, but that's still a bit slow.

In C#, I can write a bit of code like

 public IEnumerable<string> ExtractLines(string f, int StartLine, int EndLine)
        {
            return File.ReadLines(f).Skip(StartLine - 1).Take(EndLine - StartLine + 2);          
           
        }

This is extremely fast. I can't figure out how to translate this to Progress. The Object returned by the File:Readlines method in Progress does not have a Skip or Take method, The object in .Net does. It's the same object, System.Collections.Generic.IEnumerable<T>. A bit of research tells me this has something to do with Lync but I've been unable to get this to work in Progress. 

If anyone can help translate this, or if you have other suggestions for dealing with pulling a small set of lines from a very large file, It would be greatly appreciated.

Thanks,

Tom

All Replies

Posted by pedromarcerodriguez on 25-Jun-2015 12:34

Hi,

It doesn't look to me like a Progress task at all, why not use SED instead?

For lines 100 to 200

 sed -n 100,200p filename > newfile

That will do it, and pretty quick.

Regards,

Posted by tbergman on 25-Jun-2015 12:37

I meant Linq, not Lync

Posted by TheMadDBA on 25-Jun-2015 13:00

GNU sed for windows will help since sed isn't part of Windows by default.

gnuwin32.sourceforge.net/.../sed.htm

Or you could just make your C# program available to call from Progress.  If your file was fixed length you could use SEEK but other than that... Progress isn't the place to be scanning through large files.

Maybe one of the class gurus can chime in on your original question though.

Posted by tbergman on 25-Jun-2015 13:07

Yes, a windows version of SED is certainly one way to deal with this. But I would really like to understand why I can't use those methods in Progress.

Yes, the compiled C# program works from Progress but I'd rather avoid that if possible.

Posted by TheMadDBA on 25-Jun-2015 13:13

Understood completely... just offering some work arounds until the Windows OO experts show up :-)

Posted by Fernando Souza on 25-Jun-2015 14:45

Skip() is a generic method and we do not support calling generic methods in the ABL.

This thread is closed