Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data

To: 910548@bugs.debian.org
Cc: Josh Triplett <josh@joshtriplett.org>
Subject: Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data
From: Paul Hardy <unifoundry@gmail.com>
Date: Thu, 18 Oct 2018 22:54:31 -0700
Message-id: <[🔎] CAJqvfD-Es4o55o-5jqrxQJtcJxoMHGMQL1BfqE6GWXFVsawmmA@mail.gmail.com>
Reply-to: Paul Hardy <unifoundry@gmail.com>, 910548@bugs.debian.org
References: <CAJqvfD85q+LwBGTMk=46DkLDtF=HkU_nY2viNm2DtNSKyBmWiw@mail.gmail.com>

Josh,

I understand your intention, but it's not that straightforward.  The
data that I saw in Debian packages I looked through used various
pieces of property data from various files from the Unicode Consortium
within pre-built arrays also containing other data, though I didn't
look through all packages that used Unicode data by any means.

In my case, I used Unicode code point descriptions in the comment
fields of lex patterns (flex on Debian) in my beta2uni program (part
of my unibetacode package), which converts Beta Code to Unicode.  Here
are a few such lines of code:

\*\/[Aa] print_pattern (yytext, 0x0386);  /* GREEK CAPITAL LETTER
ALPHA WITH TONOS    */
\*\/[Ee] print_pattern (yytext, 0x0388);  /* GREEK CAPITAL LETTER
EPSILON WITH TONOS  */
\*\/[Hh] print_pattern (yytext, 0x0389);  /* GREEK CAPITAL LETTER ETA
WITH TONOS      */
\*\/[Ii] print_pattern (yytext, 0x038A);  /* GREEK CAPITAL LETTER IOTA
WITH TONOS     */
\*\/[Oo] print_pattern (yytext, 0x038C);  /* GREEK CAPITAL LETTER
OMICRON WITH TONOS  */
\*\/[Uu] print_pattern (yytext, 0x038E);  /* GREEK CAPITAL LETTER
UPSILON WITH TONOS  */
\*\/[Ww] print_pattern (yytext, 0x038F);  /* GREEK CAPITAL LETTER
OMEGA WITH TONOS    */
etc.

I used the utf8gen program (another package that I wrote and then
debianized) to create those lines of code, typing in the regular
expressions myself by hand after utf8gen did the monotonous work of
printing everything to the right of those patterns on each line for me
from data that I had pre-extracted from a Unicode data file.

I had to have the Unicode names in front of me to type the correct
regular expression for each code point.  The way I did that also will
help me or anyone else debug the program in the future.

Were I to attempt to pull such comment strings from another package on
the fly, I would have to write a program that knew which lines in my
source code needed those comment strings, fetch them from said
external package, and create a new source code file for lex/flex
before building the final program.  Apart from the most obvious
immediate inconveniences of doing that, two others come to mind:

1) I could not then produce the source file in final form without
running on a distro such as Debian that implemented a packaging
scheme, or providing complicated build instructions for an end user
(most likely a student of ancient Greek who would not have deep
knowledge of building software packages).  As implemented, my
unibetacode package builds and installs on many distros just the way
it is, including on non-GNU/Linux systems thanks to the modern miracle
of GNU Autotools.

2) I would have to perform such a partial build just to read the
comments that I intended for debugging (and I would have had to resort
to an external table while typing in the generating regular
expressions rather than having them conveniently on the same line of
code).

There would also be the impracticality of telling such groups as the
Linux kernel developers and other upstream teams that they must switch
to using the Unicode package that Debian provides for their future
builds.


OTOH, packaging the Unicode data files could be useful for other,
unrelated purposes.  Of course, such a package would be one more
instance of the need for the Unicode Consortium's license and
(lengthy) copyright information in yet one more package's
debian/copyright file. :-)

Yet that still doesn't answer the question of whether or not Debian
would find such a common file of Unicode license & copyright terms
useful...but the text is there if Debian makes that decision.  If not,
at least I took the time to make it available.

Thanks,


Paul Hardy

Reply to:

Follow-Ups:
- Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Bug#911165: debian-policy: drop requirement to ship sysvinit init script with same name
Next by Date: Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data
Previous by thread: Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data
Next by thread: Bug#910548: base-files - please consider adding /usr/share/common-licenses/Unicode-Data
Index(es):
- Date
- Thread