[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hebrew: General question about location of shortcut indicator "&" and reading direction



c.buhtz@posteo.jp writes:

 > How is this represented as bytes on the data disc?
 > 
 > As an simple example lets assume "ABC" is a word in Left-to-Right.
 > Making it a Hebrew word (e.g. via translation) it would be written
 > "CBA" because its read from Right-to-left, starting with "A", then
 > "B" and "C" at the end.

If "ABC" are three Hebrew letters *spoken* in that order, then in most
BIDI UIs

1.  You *type* A B C.
2.  The *disk file* contains "ABC".
3.  You *see* "CBA".

 > Am I right so far?

I'm not sure because "written" is ambiguous here! :-)

 > No lets add such a shortcut indicator to the first letter (the
 > "A").  Weblate and Qt seems to use the correct BIDI algorithm and
 > will display it correctly like this:
 > 
 >      "CB&A" (or an underlined "A" in a Qt GUI)

That doesn't make sense to me as the output of the BIDI algorithm
alone.  I assume that the indicator '&' is the literal character '&',
and in English text comes before the shortcut character, right?  Then
'&' would be a neutral character according to the algorithm, and the
string that is typed "& A B C" would appear in the file as "&ABC" and
on the display as "CBA&" because the neutral character "&" takes its
direction from the surrounding text.  So as I understand the
algorithm, to get the result "CB&A" you would need directional control
characters.

 > So what is in the file?
 >
 >      &ABC
 >
 > or
 >
 >      &CBA
 >
 > I do guess it is the first (&ABC), right? It is coded into unicode
 > that the A the B and the C need to be read the "other way around"?

Yes, and yes.  There is a Unicode Character Database that assigns many
properties to each character (for example it might be a spacing
character, or a separating control character, or a letter, etc).  One
of the properties is the direction.  Letter-like characters (including
Korean Hangul syllables and Asian ideographs) as well as "Arabic"
digits are generally marked L2R, while Hebrew, Arabic, and some other
scripts' characters are marked R2L.  Punctuation and spacing
characters are neutral.  So for the basic algorithm no control
characters are used.  If you type "abc ABC def DEF" what you get is

    |abc CBA def FED                 |

and if you type "ABC abc DEF def" you get

    |                 def FED abc CBA|

where the vertical bars "|" mark the edges of the window.  However,
the "first directional character establishes the initial direction"
rule can give perverse results.  For example, "cat IS HOW YOU SAY CAT
IN ENGLISH" would naturally come out

    |cat HSILGNE NI TAC YAS UOY WOH SI|

but I would assume a Hebrew speaker wants to see "cat" at the
beginning of the sentence, ie, the right hand side:

    |HSILGNE NI TAC YAS UOY WOH SI cat|

There are a number of ways this can be accomplished.  For example, a
control character can explicitly set the initial direction.  If the
initial character is the control character RLE, that just sets the
initial drection to R2L, so the display starts at the right hand
edge.  But then the "c" changes direction to L2R (but not the place!),
and you see

    |                                c|
    |                               ca|
    |                              cat|
    |                             cat |
    |                            I cat|
    |                           SI cat|

as you type "cat IS HOW YOU SAY CAT IN ENGLISH".

As I am not a Hebrew or Arabic speaker myself, you should take all
this with a grain of salt, but there are three simple rules that gives
you the basic idea:

1.  Typed input and data in the file are in the order you would speak
    them ("logical order").
2.  The first directional character (which is a letter or numeral)
    determines the initial direction and whether you start at the left
    edge or the right edge.
3.  Divide the text into runs of characters which all have the same
    direction or are neutral, and "flip" the ones that have the
    opposite direction.

After that there are unobvious issues like "what about line breaks?"
And of course there are the dozen or so control characters that allow
explicit control over the algorithm or even individual characters.


Reply to: