Re: OT: Help with a search and replace script in Perl for a big file with no line breaks
> On Jan 24, 2018, at 11:14 AM, rhkramer@gmail.com wrote:
>
> This is OT, but I thought I'd start with this list as it is the list that I
> deal with more than any other. If no one here can help, suggestions for a
> better list to try will be appreciated.
>
I used to subscribe to Perl Beginners, but the administrator got draconian about discussing other languages, I dropped, and now I appear to be banned:
https://lists.perl.org/list/beginners.html
> I've never used Perl, but I'm hoping Perl can do the job for me.
>
Modern Perl (version 5.10 and up) is a UTF-8 compliant, general purpose language, but much of software today is closed and baroque. You want to use whatever language the database designers had in mind.
> What I need to do:
>
> I have multiple large files (one example is 5.4 MB). It is essentially a data
> dump from a database--I have no control over the database or the format ofe
> dump.
>
> The file is ugly, with lots of extraneous characters--I want to run a series of
> regular expression search and replace commands over the file to clean it up.
>
> Some of the things that may make it tough:
>
> * In essence, there are no line breaks (0Ah) (or 0Dh)--in essence, there is
> one long 5.4 MB line (well, there are 4 line breaks for some short lines at
> the beginning of the file, maybe somewhere between 32 and 80 characters on each
> of those 4 lines.
>
> * The file can, and often will have UTF-8 characters in it (iiuc--the file
> contains URLs, some of which, I'm sure, can include UTF-8 characters, or maybe
> some other encoding??). The search and replace doesn't particularly have to
> handle the UTF-8 search terms (because the keywords and punctuation I will
> search on will be plain ASCII), but any UTF-8 characters have to remain
> "intact" after the search and replace.
>
> I'm hoping that I can write a Perl script that may be something like this:
>
> Code to open a file (which I will need to learn / find)
>
> Multiple statements of the form "s/<search regular expression>/<replace
> regular expression/g
>
> (Aside, the replace probably doesn't have to be a regular expression, it will
> need to include things like line break characters (\n).)
>
> I did try to do this with one of the editors I use (I started with Kate), but
> kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes /
> characters (at inconvenient places), and, although I got the job (almost)
> done, it required a lot of manual intervention / correction, so I want to
> automate it with a tool that can work on very long lines without inserting
> line breaks (other than those I require).
>
> If some simpler tool can do the job, I'll consider that as well (I have
> occasionally used awk, and maybe sed (I don't think sed ever proved useful for
> me).
>
> Any help appreciated.
>
If you attack the files with raw Perl, you're going to be writing a lexer and parser to read the database dump into a data structure, and then doing your work against that (perhaps by dumping it to a common format and then writing tools against that). If you don't have an EBNF grammar for the dump, you'll have to figure it out. Getting the lexer/ parser right, and verifying that you got it right, is going to be a *lot* of work.
Your best bet is to:
1. Have the database administrator generate exports in a friendlier format, such as flat-file comma-seperated values, tab-seperated values, XML, etc..
2. Get a tool that understands the dump file (such as the original database engine), import the dumps, and then generate queries/ reports/ etc. as desired to meet your needs.
David
Reply to: