OT: Help with a search and replace script in Perl for a big file with no line breaks
This is OT, but I thought I'd start with this list as it is the list that I
deal with more than any other. If no one here can help, suggestions for a
better list to try will be appreciated.
I've never used Perl, but I'm hoping Perl can do the job for me.
What I need to do:
I have multiple large files (one example is 5.4 MB). It is essentially a data
dump from a database--I have no control over the database or the format ofe
dump.
The file is ugly, with lots of extraneous characters--I want to run a series of
regular expression search and replace commands over the file to clean it up.
Some of the things that may make it tough:
* In essence, there are no line breaks (0Ah) (or 0Dh)--in essence, there is
one long 5.4 MB line (well, there are 4 line breaks for some short lines at
the beginning of the file, maybe somewhere between 32 and 80 characters on each
of those 4 lines.
* The file can, and often will have UTF-8 characters in it (iiuc--the file
contains URLs, some of which, I'm sure, can include UTF-8 characters, or maybe
some other encoding??). The search and replace doesn't particularly have to
handle the UTF-8 search terms (because the keywords and punctuation I will
search on will be plain ASCII), but any UTF-8 characters have to remain
"intact" after the search and replace.
I'm hoping that I can write a Perl script that may be something like this:
Code to open a file (which I will need to learn / find)
Multiple statements of the form "s/<search regular expression>/<replace
regular expression/g
(Aside, the replace probably doesn't have to be a regular expression, it will
need to include things like line break characters (\n).)
I did try to do this with one of the editors I use (I started with Kate), but
kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes /
characters (at inconvenient places), and, although I got the job (almost)
done, it required a lot of manual intervention / correction, so I want to
automate it with a tool that can work on very long lines without inserting
line breaks (other than those I require).
If some simpler tool can do the job, I'll consider that as well (I have
occasionally used awk, and maybe sed (I don't think sed ever proved useful for
me).
Any help appreciated.
Reply to: