Re: A Bit of a Strange Situation
On Thu, Aug 25, 2011 at 15:45, Bob Proulx <bob@proulx.com> wrote:
> RiverWind wrote:
>> The idea was to concat a large html file and then convert it to
>> text. The pdf can be converted to text, and it so far seems like a
>> pretty viable translation.
>
> If I were going to do that for myself I would convert each individual
> html file to text first and then concatenate the individual text
> files. The reason being that the individual html files are at that
> moment completely consistent. Individually they should be able to
> convert to text cleanly with no problems. And then the text can be
> concatenated. But once you concatenate the html then you have created
> a Frankenstein html file that is almost certainly going to be
> problematic to convert to text.
>
> Also, my naive experience with this is that converting html to text is
> a lot easier than converting pdf to text. With html it is already a
> text type. The mime type is "text/html" after all. But pdf has been
> less accessible for conversions for me. The mime time is
> "application/pdf" and isn't a text type. That introduces more room
> for error to be introduced.
>
yes, converting html to text is easier than converting pdf to text -
pdf is nice in the native format but when you get into extracting
stuff, it's a pain. pdf is not text. you can break the elements into a
dom like structure. however, html's dom and pdf's "dom" aren't the
same - pdf has an absolute x/y where the element is to be displayed
and the element can be binary data (ie a picture).
that said, i don't think there will be any accessibility issues with
that pdf and it might even convert cleanly (one has a lot to do with
the other). so, i would just go with the pdf and be done with it.
however, if you are hell bent on converting it to something, i would
use something that will keep some formatting - latex or pod come to
mind. maybe consider this:
http://cpan.uwinnipeg.ca/htdocs/Pod-HTML2Pod/Pod/HTML2Pod.html
the latex looks pretty simple too (though i have minimal experience with tex):
http://www.iwriteiam.nl/html2tex.html
per parsing those html files to figure out chapter, i'd personally use
perl and search for the chapter and section in the file, build up a
hash of that info and the file that contains it, sort and go from
there.
it does not seem that there is an easy way to go from pdf -> latex (as
i suspected).
Reply to: