[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OT - Tool for getting text body of email



on Fri, Dec 28, 2001 at 11:12:49AM -0800, Karsten M. Self (kmself@ix.netcom.com) wrote:
> on Sun, Dec 23, 2001 at 10:49:46AM -0900, Christopher S. Swingley (cswingle@iarc.uaf.edu) wrote:
> > I need to write a program the extracts the ASCII text portion
> > of email messages for insertion into a database.  I looked at the
> > libmailtools-perl package, but it doesn't look like it can deal with
> > the annoying variety of mail that I may need to parse (The silly +'s
> > at the end of lines, MIME-attached HTML, vcards, etc.).
> > 
> > What I want is a filter that I pass an email in, and out pops the
> > ASCII, 72-line width formatted message.  All attachments, HTML mail,
> > vcards and strangeness is removed.
> 
> I'm looking for something vagely similar.
> 
> I think what I'm looking for is a tool that will strictly decode
> printed-quotable mail, base64-encoded mail, and other representations
> that don't resolve as plaintext.  I _don't_ need to resolve HTML or
> other tagging formats.
> 
> The objective is to get the mail body into a form that can be scanned
> for website references.  I use this as part of my spam response system,
> with a script that extracts URLs, strips these to the host portion,
> resolves the IP, queries WHOIS, and parses this for response email
> addresses.
> 
> This isn't possible on messages which are quoted printable (though this
> appears to be possible by converting the string "=2E" to "."), or
> otherwise encoded (the plaintext isn't available).
> 
> I've explored a number of options, including munapct, uudecode,
> metamail, but none appears to do what I want reliably.  My current
> workaround is to pipe a message segment from the "view-attachments" menu
> within mutt.  I'd like to be able to run this from either the index
> mode, or against an mbox or maildir folder.

uudeview was suggested to me off list.  However it doesn't seem to allow
for pipelining data in and/or out.  And when it does, you're liable to
get a graphic image file dumped in a directory someplace.

Grumble.

-- 
Karsten M. Self <kmself@ix.netcom.com>        http://kmself.home.netcom.com/
 What part of "Gestalt" don't you understand?              Home of the brave
  http://gestalt-system.sourceforge.net/                    Land of the free
We freed Dmitry! Boycott Adobe! Repeal the DMCA! http://www.freesklyarov.org
Geek for Hire                      http://kmself.home.netcom.com/resume.html

Attachment: pgpkPve6cbMff.pgp
Description: PGP signature


Reply to: