Re: howto extract data from...
On 11 Jun 2003 Grzesiek Sedek <email@example.com> wrote:
> Anyone have an idea how to extract clear text from inbox file (actual
> file is from m$ entuage on mac called Messages) it got corrupded and
> mail client does not read it. its quite big 500 Mb so I have to do it at
> least semi automaticly. main problem are the attachments(I dont need
> them)- they quite big, rest of content is text.
You do not describe what the contents of the file look like, so I must
guess at what distinguishes attachments from message texts.
My guess then is that the 500 Mb file is essentially a text file, and that
the attachments you want to get rid of are big solid blocks of characters:
long sequences of lines, all of the same length, without any spaces in them.
If that is true, a simple sed command will suffice:
sed -e '/^[^ ][^ ]*$/d' Messages > Messages_attachments_stripped
This says: delete all lines that are not empty and do not contain spaces.
Be careful. You may want to refine the regular expression that selects
the lines to be deleted. As it stands, a line like
that someone may have used in a message text to make a line stand out
as a header, will also be deleted, as well as lines delimiting parts
of messages, like
Anjelierstraat 1, 2014 TC Haarlem, Netherlands
tel +31 23 5324909, firstname.lastname@example.org