Re: OT: Help with a search and replace script in Perl for a big file with no line breaks
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Wed, Jan 24, 2018 at 02:14:42PM -0500, rhkramer@gmail.com wrote:
> This is OT, but I thought I'd start with this list as it is the list that I
> deal with more than any other. If no one here can help, suggestions for a
> better list to try will be appreciated.
>
> I've never used Perl, but I'm hoping Perl can do the job for me.
>
> What I need to do:
>
> I have multiple large files (one example is 5.4 MB). It is essentially a data
> dump from a database--I have no control over the database or the format ofe
> dump.
Perl won't have a problem with a 5.4MB long line.
[...]
> * The file can, and often will have UTF-8 characters in it (iiuc--the file
> contains URLs, some of which, I'm sure, can include UTF-8 characters, or maybe
> some other encoding??). The search and replace doesn't particularly have to
> handle the UTF-8 search terms (because the keywords and punctuation I will
> search on will be plain ASCII), but any UTF-8 characters have to remain
> "intact" after the search and replace.
Now that's one thing: does the file just contain some UTF-8 characters,
or is it valid UTF-8? This is important to know, because then you can
decide whether to treat it as UTF-8 (then regexps will be OK) or as a
byte stream (then you'll "see" the UTF-8 sequences as single bytes:
there be dragons).
You can check that with
iconv -f UTF-8 < your_file > /dev/null
or something similar
> I'm hoping that I can write a Perl script that may be something like this:
>
> Code to open a file (which I will need to learn / find)
open(my $fh, "<:encoding(UTF-8)", "your_file)
(the whole kaboodle in "perldoc -f open").
> Multiple statements of the form "s/<search regular expression>/<replace
> regular expression/g
If you set $/ (the input record separator) to undef, you can slurp the
whole file into one variable, like so:
$/ = undef;
my $data = <$fh>;
(that narrative is in 'man perlvar', for the special variables).
> (Aside, the replace probably doesn't have to be a regular expression, it will
> need to include things like line break characters (\n).)
The replace string isn't a regexp anyway (doesn't make sense :) -- it's just
a normal string, possibly with placeholders for parenthesized submatches from
the regexp (if that's mumbo jumbo for you, just ask). "\n" is just a normal
character, as is "\t", etc.
Since the whole ugly string will contain newlines, don't forget the /s
modifier, which tells the regexp machine to treat newlines as every
other character, like so:
$data =~ s/tom/jerry/gs;
(the whole story is in 'man perlre').
At the end, you just print that:
open(my $outfh, ">:encoding(UTF-8)", "your_output_file");
print $outfh $data;
(no comma between the filehandle $outfh and $data)
> I did try to do this with one of the editors I use (I started with Kate), but
> kate breaks that 5.4 MB "line" into multiple lines of about 4096 bytes /
> characters (at inconvenient places) [...]
Yikes. Neither vim or Emacs will do that to you (although Emacs gets a bit
sluggish on MB-long lines). I'd put such an editor in the recycle bin (sorry).
[...]
> If some simpler tool can do the job, I'll consider that as well (I have
> occasionally used awk, and maybe sed (I don't think sed ever proved useful for
> me).
Sed is actually pretty nifty, but gets some getting used to.
> Any help appreciated.
I hope that gets you started. Just ask.
Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iEYEARECAAYFAlpo6r8ACgkQBcgs9XrR2kZ/oQCfdeDP0dugi4wFQZmjPc9FhIgz
ltEAn1Wonm+hhYQO1OMkl7X7p4jjBVBQ
=LOLL
-----END PGP SIGNATURE-----
Reply to: