Re: awk or sed program to convert mbox files to HTML
On Wed, Nov 26, 2025 at 12:29:14 -0500, rhkramer@gmail.com wrote:
> Does anybody here know of an AWK or sed program to convert mbox files to HTML?
> [...]
> I know that maildir is the currently favored approach for mail storage, but I
> have well over 100 MB of emails (or pseudo emails) stored in mbox files, and
> want to convert them for easy viewing on the Internet (by anyone).
Why did you specifically ask for awk or sed? They don't seem like the
best choices for programming languages to implement this.
With that large of an input, I would avoid bash. It'll be slow. Also,
it has no useful libraries.
You're processing a large amount of text, in a fairly well-defined format,
so any language that's good at text processing should do the job. Perl,
Python, or Tcl would be my picks, but that's probably my personal bias.
I'm guessing that what you want to end up with would be a directory
containing one file per message, plus some sort of index.html file that
links to all of them. If all the messages were plain pre-MIME "header and
body", you could probably write a program to do that in less than an hour.
It's going to be tricky if you need to parse MIME attachments. At that
point, you'll probably need to break out whatever MIME libraries your
chosen language has. Even if it's just to discard the attachments, using
a MIME library is a better approach than scrubbing out the MIME metadata
lines with raw text manipulation. If you actually want to preserve and
link to the attachments, then the MIME libraries become indispensable.
Finally, you need to think about what you want to do with multipart
messages. A whole bunch of email these days is written in either HTML
or some kind of "rich text", and then gets sent out as a multipart
message, with the original HTML (or rich text converted to HTML) as the
"preferred" part, and the same HTML or rich text converted to regular
text as a "fallback" part. Would you attempt to offer both parts
somehow? Or just offer the HTML part "as is" (probably with some of
the headers reattached above it)?
Reply to: