[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: awk or sed program to convert mbox files to HTML



Thanks for the helpful reply -- some comments interspersed below:

 

On Wednesday, November 26, 2025 01:30:09 PM Greg Wooledge wrote:

> On Wed, Nov 26, 2025 at 12:29:14 -0500, rhkramer@gmail.com wrote:

> > Does anybody here know of an AWK or sed program to convert mbox files to

> > HTML? [...]

> > I know that maildir is the currently favored approach for mail storage,

> > but I have well over 100 MB of emails (or pseudo emails) stored in mbox

> > files, and want to convert them for easy viewing on the Internet (by

> > anyone).

>

> Why did you specifically ask for awk or sed? They don't seem like the

> best choices for programming languages to implement this.

 

I thought they would be languages I could reasonably "handle" -- Perl, C[++], and Python (and TCL), for example I have little knowledge or experience with. (The last general purpose languages I was reasonably fluent in were Algol and Pascal. (I might be forgetting some.))

 

If I found a reasonably well written and well documented program in some other language that already does most of what I need, I imagine that I could modify it as required.

 

> With that large of an input, I would avoid bash. It'll be slow. Also,

> it has no useful libraries.

>

> You're processing a large amount of text, in a fairly well-defined format,

> so any language that's good at text processing should do the job. Perl,

> Python, or Tcl would be my picks, but that's probably my personal bias.

>

> I'm guessing that what you want to end up with would be a directory

> containing one file per message, plus some sort of index.html file that

> links to all of them.

 

I hadn't thought that far ahead, but that seems like a good approach.

 

> If all the messages were plain pre-MIME "header and

> body", you could probably write a program to do that in less than an hour.

>

> It's going to be tricky if you need to parse MIME attachments. At that

> point, you'll probably need to break out whatever MIME libraries your

> chosen language has. Even if it's just to discard the attachments, using

> a MIME library is a better approach than scrubbing out the MIME metadata

> lines with raw text manipulation. If you actually want to preserve and

> link to the attachments, then the MIME libraries become indispensable.

 

Yeah, MIME. The "pseudo emails" I referred to are basically my own plain text notes without attachments. But, I do want to deal with "real" emails as well and will have to deal with MIME -- requires more thought, or I may defer that until some indefinite time in the future.

 

> Finally, you need to think about what you want to do with multipart

> messages. A whole bunch of email these days is written in either HTML

> or some kind of "rich text", and then gets sent out as a multipart

> message, with the original HTML (or rich text converted to HTML) as the

> "preferred" part, and the same HTML or rich text converted to regular

> text as a "fallback" part. Would you attempt to offer both parts

> somehow? Or just offer the HTML part "as is" (probably with some of

> the headers reattached above it)?

 

Like MIME, my notes are not multipart, and I may defer that until some indefinite time in the future.

 


Reply to: