[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: PS to HTML?



On 24-Jan-01 Dave Sherohman wrote:
> BTW, anyone know what's up with pstotext?  I ran a PS doc through it
> last night and there were a lot of extra spa ces in  the outpu t,
> including many in mid-word.  Is this preventable?

[Gurus: please read the speculative bit at the end]

No. And there's nothing going wrong with pstotext in this respect
either.

A typical reason would be that the software which created the
PostScript file did some "kerning", i.e. moving letters of a
word (usually) closer together (e.g. in "Wombat" the "ombat"
would be moved slightly left so that the "o" was slightly
over-hung by the "W").

When this happens, the sequence of characters in a word is
broken at that point, and a PostScript "motion" command is
interpolated, so that in the PS code it is no longer a
contiguous sequence of characters. Unless your PStoWhatever
is clever enough to reconstruct the intended word from the
fragments, it will do the dumb default thing of treating
separate sequences as separate words. And it would have to
be pretty clever, since the spacing between words (in "filled"
text) may be done by exactly the same mechanism as kerning.
The following, for instance, is from a PS file containing
the sentence "The Wombat is a small animal.":

  .318(The W)12.318 F .318 (ombat is a)-1.92 F 3.802
  (small animal.)72 244.8 R

Since the break-up is present in the PS file to start with,
it is not due to pstotext in the first place. Only if
pstotext was supposed to be capable of realising that "Wombat"
was the intended result of "W" followed by "ombat", while
"a" sollowed by "small" should be left alone, should you
suspect a flaw in pstotext.

As you apparently realise, you cannot expect to do better than
a very crude extraction of textual content from a PS file;
a PS file is a computer program for placing marks on a page,
and the fact that some of these marks are represented by
characters is pretty incidental.

[For gurus] Nevertheless, I suspect that a relatively straightforward
algorithm could be created for this job, assuming (for present
purposes) that only the standard printable ASCII characters are
needed.

When a construct like "(The W)" is encountered, this is interpreted
as an instruction to render the string "The W" on the display
device. Each character (including the space) in the string is
in fact a pointer to a position in the font definition which
causes the PS interpreter to look up the primitive PS drawing
commands which will creat the shape of the printed character.

It strikes me as eminently possible to construct a program
which would act like a PS interpreter in all respects _except_
that the drawing commands evoked by (e.g) the character "W"
would be replaced by simple emission of the ASCII code for "W"
to the standard output. Questions of motion between characters
could be handled by the following kind of thing (where "Motion"
means the displacement between where the PS file asks for a
character to be printed, and where it would have been printed
if it had immediately followed the previously printed character):

1. If the Motion is a small Motion (kerning) ignore it.
2. If the Motion is (approximately) a positive space, emit
   a space. Similarly for (approximately) 2 or more spaces.
3. If the Motion is (approximately) a negative space (overprinting)
   emit a "backspace".
4. If a Motion is (approximately) a positive or negative line-space,
   (superscript, subscript) emit the corresponding positive or
   negative line feed.
5. If a Motion is a combination of backspace & upwards (accent above)
   emit the appropriate thing.

Etc.

Now: does anyone know a program which works like that?

(The advantage would be that the sort of thing that Dave Sherohman
wants to do would be strsightforward and should come out right,
while many of the computer-program-like things which PostScript
can do -- like loops and conditional branching -- would also
be done as they should be; also definitions within the file
(which can be "macros" that print out as blocks of text)
would work too.)

Best wishes to all,
Ted.

--------------------------------------------------------------------
Topical Thought:  It is better to arrive, than to travel hopefuilly.
E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 284 7749
Date: 24-Jan-01                                       Time: 17:44:01
------------------------------ XFMail ------------------------------



Reply to: