[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OT: Help with a search and replace script in Perl for a big file with no line breaks




> On Jan 24, 2018, at 11:14 AM, rhkramer@gmail.com wrote:
> 
> This is OT, but I thought I'd start with this list as it is the list that I 
> deal with more than any other.  If no one here can help, suggestions for a 
> better list to try will be appreciated.
> 

I used to subscribe to Perl Beginners, but the administrator got draconian about discussing other languages, I dropped, and now I appear to be banned:

    https://lists.perl.org/list/beginners.html


> I've never used Perl, but I'm hoping Perl can do the job for me.
> 

Modern Perl (version 5.10 and up) is a UTF-8 compliant, general purpose language, but much of software today is closed and baroque.  You want to use whatever language the database designers had in mind.



> What I need to do:
> 
> I have multiple large files (one example is 5.4 MB).  It is essentially a data 
> dump from a database--I have no control over the database or the format ofe 
> dump.
> 
> The file is ugly, with lots of extraneous characters--I want to run a series of 
> regular expression search and replace commands over the file to clean it up.
> 
> Some of the things that may make it tough:
> 
>   * In essence, there are no line breaks (0Ah) (or 0Dh)--in essence, there is 
> one long 5.4 MB line (well, there are 4 line breaks for some short lines at 
> the beginning of the file, maybe somewhere between 32 and 80 characters on each 
> of those 4 lines.
> 
>   * The file can, and often will have UTF-8 characters in it (iiuc--the file 
> contains URLs, some of which, I'm sure, can include UTF-8 characters, or maybe 
> some other encoding??).  The search and replace doesn't particularly have to 
> handle the UTF-8 search terms (because the keywords and punctuation I will 
> search on will be plain ASCII), but any UTF-8 characters have to remain 
> "intact" after the search and replace.
> 
> I'm hoping that I can write a Perl script that may be something like this:
> 
> Code to open a file (which I will need to learn / find)
> 
> Multiple statements of the form "s/<search regular expression>/<replace 
> regular expression/g
> 
> (Aside, the replace probably doesn't have to be a regular expression, it will 
> need to include things like line break characters (\n).)
> 
> I did try to do this with one of the editors I use (I started with Kate), but 
> kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes / 
> characters (at inconvenient places), and, although I got the job (almost) 
> done, it required a lot of manual intervention / correction, so I want to 
> automate it with a tool that can work on very long lines without inserting 
> line breaks (other than those I require).
> 
> If some simpler tool can do the job, I'll consider that as well (I have 
> occasionally used awk, and maybe sed (I don't think sed ever proved useful for 
> me).
> 
> Any help appreciated.
> 

If you attack the files with raw Perl, you're going to be writing a lexer and parser to read the database dump into a data structure, and then doing your work against that (perhaps by dumping it to a common format and then writing tools against that).  If you don't have an EBNF grammar for the dump, you'll have to figure it out.  Getting the lexer/ parser right, and verifying that you got it right, is going to be a *lot* of work.


Your best bet is to:

1.  Have the database administrator generate exports in a friendlier format, such as flat-file comma-seperated values, tab-seperated values, XML, etc..

2.  Get a tool that understands the dump file (such as the original database engine), import the dumps, and then generate queries/ reports/ etc. as desired to meet your needs.


David


Reply to: