[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RE: How to read a word 7 file?



On 23-Jun-98 Luiz Otavio L. Zorzella wrote:
> 
> "strings" would do a good job for me, but...
> 
>  > and then edit wordfile.txt to clean it up. Raw "strings" will skip
> sequences of
>  > fewer than 4 ASCII characters but these are unlikely to occur in a Word
>  > document. This method will suppress all formatting info except
> end-of-line, so
>  > you are likely to get long lines (= Word paragraphs). It will also fail to
>  > recognise any non-US-ASCII character codes (above 127) so accented
> characters
>  > and special symbols, etc, will be missed. But if you simply need to read
> the
>  > text content of a Word document containing plain English text, then this
> method
>  > works fine.
> 
> ... my text is in portuguese, and does have non-US chars. Is there a
> way to tell "strings" to accept some non-US chars?

Unfortunately not, or not well ... "strings" works by extracting sequences of
US-ASCII characters (codes 32-126) of length (by default) at least 4. If you
KNEW that codes outside this range really did represent characters (such as
the accented characters in Portuguese) then you wouldn't need to use "strings"!

However, in word-processor files (such as Word's or WordPerfect's) the
codes outside that range have various "binary" significances as well (in Word's
case) as representing "special" characters. The only way you could get at these
would be to interpret the binary codes so as to locate stretches of text.
Otherwise, an approach as simple as the one used by "strings" would simply
have to output every byte in the file. Useless. Sorry. And apologies for blindly
assuming you were after plain ASCII!

Without going for programs (such as those suggested by others) which really can
at least partially interpret a Word file, the best you could do with "strings"
would be to edit-in the missing accented characters afterwards. Possible, but
tedious, and maybe error-prone.

However, the handiness of "strings" as a quick utility is such that it might be
worth re-coding it so that, as well as the ASCII codes, it also included the
spcific codes for the accented characters in a specific language. This would
increase the amount of garbage in the output but, so long as one was selective,
perhaps by not too much. (If you're going to do this for Word, remember that
there are two different encodings for accented characters: "Win"-encoding and
"Mac"-encoding).

The best of luck with the other options!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk>
Date: 23-Jun-98                                       Time: 21:45:54
--------------------------------------------------------------------


--  
To UNSUBSCRIBE, email to debian-user-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org


Reply to: