Re: A Bit of a Strange Situation

To: debian-user@lists.debian.org
Subject: Re: A Bit of a Strange Situation
From: shawn wilson <ag4ve.us@gmail.com>
Date: Thu, 25 Aug 2011 16:57:56 -0400
Message-id: <[🔎] CAH_OBics+bU-go+i4oizOJbWD0owe__GZaq+L_7=LFuxF+PCLw@mail.gmail.com>
In-reply-to: <[🔎] 20110825194556.GA6768@hysteria.proulx.com>
References: <[🔎] Pine.BSF.4.64.1108241933070.67865@server1.shellworld.net> <[🔎] slrnj5c6fs.2lq.curty@einstein.electron.org> <[🔎] alpine.BSF.2.00.1108250554310.85279@freire1.furyyjbeyq.arg> <[🔎] slrnj5ck34.2ng.curty@einstein.electron.org> <[🔎] Pine.BSF.4.64.1108251343200.2308@server1.shellworld.net> <[🔎] 20110825194556.GA6768@hysteria.proulx.com>

On Thu, Aug 25, 2011 at 15:45, Bob Proulx <bob@proulx.com> wrote:
> RiverWind wrote:
>> The idea was to concat a large html file and then convert it to
>> text. The pdf can be converted to text, and it so far seems like a
>> pretty viable translation.
>
> If I were going to do that for myself I would convert each individual
> html file to text first and then concatenate the individual text
> files.  The reason being that the individual html files are at that
> moment completely consistent.  Individually they should be able to
> convert to text cleanly with no problems.  And then the text can be
> concatenated.  But once you concatenate the html then you have created
> a Frankenstein html file that is almost certainly going to be
> problematic to convert to text.
>
> Also, my naive experience with this is that converting html to text is
> a lot easier than converting pdf to text.  With html it is already a
> text type.  The mime type is "text/html" after all.  But pdf has been
> less accessible for conversions for me.  The mime time is
> "application/pdf" and isn't a text type.  That introduces more room
> for error to be introduced.
>

yes, converting html to text is easier than converting pdf to text -
pdf is nice in the native format but when you get into extracting
stuff, it's a pain. pdf is not text. you can break the elements into a
dom like structure. however, html's dom and pdf's "dom" aren't the
same - pdf has an absolute x/y where the element is to be displayed
and the element can be binary data (ie a picture).

that said, i don't think there will be any accessibility issues with
that pdf and it might even convert cleanly (one has a lot to do with
the other). so, i would just go with the pdf and be done with it.
however, if you are hell bent on converting it to something, i would
use something that will keep some formatting - latex or pod come to
mind. maybe consider this:
http://cpan.uwinnipeg.ca/htdocs/Pod-HTML2Pod/Pod/HTML2Pod.html

the latex looks pretty simple too (though i have minimal experience with tex):
http://www.iwriteiam.nl/html2tex.html

per parsing those html files to figure out chapter, i'd personally use
perl and search for the chapter and section in the file, build up a
hash of that info and the file that contains it, sort and go from
there.

it does not seem that there is an easy way to go from pdf -> latex (as
i suspected).

Reply to:

References:
- A Bit of a Strange Situation
  - From: RiverWind <riverwind@shellworld.net>
- Re: A Bit of a Strange Situation
  - From: Curt <curty@free.fr>
- Re: A Bit of a Strange Situation
  - From: Jude DaShiell <jdashiel@shellworld.net>
- Re: A Bit of a Strange Situation
  - From: Curt <curty@free.fr>
- Re: A Bit of a Strange Situation
  - From: RiverWind <riverwind@shellworld.net>
- Re: A Bit of a Strange Situation
  - From: Bob Proulx <bob@proulx.com>

Prev by Date: Re: A Bit of a Strange Situation
Next by Date: [share] grub error message:symbol not found.....
Previous by thread: Re: A Bit of a Strange Situation
Next by thread: Re: A Bit of a Strange Situation
Index(es):
- Date
- Thread