[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A Bit of a Strange Situation



On Thu, Aug 25, 2011 at 15:45, Bob Proulx <bob@proulx.com> wrote:
> RiverWind wrote:
>> The idea was to concat a large html file and then convert it to
>> text. The pdf can be converted to text, and it so far seems like a
>> pretty viable translation.
>
> If I were going to do that for myself I would convert each individual
> html file to text first and then concatenate the individual text
> files.  The reason being that the individual html files are at that
> moment completely consistent.  Individually they should be able to
> convert to text cleanly with no problems.  And then the text can be
> concatenated.  But once you concatenate the html then you have created
> a Frankenstein html file that is almost certainly going to be
> problematic to convert to text.
>
> Also, my naive experience with this is that converting html to text is
> a lot easier than converting pdf to text.  With html it is already a
> text type.  The mime type is "text/html" after all.  But pdf has been
> less accessible for conversions for me.  The mime time is
> "application/pdf" and isn't a text type.  That introduces more room
> for error to be introduced.
>

yes, converting html to text is easier than converting pdf to text -
pdf is nice in the native format but when you get into extracting
stuff, it's a pain. pdf is not text. you can break the elements into a
dom like structure. however, html's dom and pdf's "dom" aren't the
same - pdf has an absolute x/y where the element is to be displayed
and the element can be binary data (ie a picture).

that said, i don't think there will be any accessibility issues with
that pdf and it might even convert cleanly (one has a lot to do with
the other). so, i would just go with the pdf and be done with it.
however, if you are hell bent on converting it to something, i would
use something that will keep some formatting - latex or pod come to
mind. maybe consider this:
http://cpan.uwinnipeg.ca/htdocs/Pod-HTML2Pod/Pod/HTML2Pod.html

the latex looks pretty simple too (though i have minimal experience with tex):
http://www.iwriteiam.nl/html2tex.html

per parsing those html files to figure out chapter, i'd personally use
perl and search for the chapter and section in the file, build up a
hash of that info and the file that contains it, sort and go from
there.

it does not seem that there is an easy way to go from pdf -> latex (as
i suspected).


Reply to: