[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A Bit of a Strange Situation

Hey There,

When you talk about converting all of the small files from html to
txt, and doing the concatenation, you are describing the most
desirable course of action, actually the very first method I
thought about. However, the file naming protocol doesn't at all
lend its self to such a conversion, none that I can come up with at
any rate.

You see, the files have a bit of an unconventional extension, to
wit "cookbook3.html#SEC1 or cookbook14.html#SEC2" and so on. You
see, the first number before the ".html" I believe designates the
part, and the number following the "#SEC" indicates the different
sections in the respective parts of the book. This would tend to
make the use of wild cards a bit ticklish. If I could just figure
around this problem however, I would be in business, because
html2txt conversions would be easy, and the concatenation even


Feel free to visit my website and my blog and learn more about me
and what I stand for.
My Website @ http://riverwind.shellworld.net
My Blog http://windraven13.livejournal.com/

On Thu, 25 Aug 2011, Bob Proulx wrote:

RiverWind wrote:
The idea was to concat a large html file and then convert it to
text. The pdf can be converted to text, and it so far seems like a
pretty viable translation.

If I were going to do that for myself I would convert each individual
html file to text first and then concatenate the individual text
files.  The reason being that the individual html files are at that
moment completely consistent.  Individually they should be able to
convert to text cleanly with no problems.  And then the text can be
concatenated.  But once you concatenate the html then you have created
a Frankenstein html file that is almost certainly going to be
problematic to convert to text.

Also, my naive experience with this is that converting html to text is
a lot easier than converting pdf to text.  With html it is already a
text type.  The mime type is "text/html" after all.  But pdf has been
less accessible for conversions for me.  The mime time is
"application/pdf" and isn't a text type.  That introduces more room
for error to be introduced.


Reply to: