[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: DEP-5 and files with white spaces

Le Fri, Feb 10, 2012 at 10:05:55AM -0800, Russ Allbery a écrit :
> Jakub Wilk <jwilk@debian.org> writes:
> > * Russ Allbery <rra@debian.org>, 2012-02-09, 23:05:
> >> Note that another case that I don't think has been discussed, but which
> >> is probably more common than embedded quote marks, is a filename that's
> >> invalid UTF-8 (straight ISO 8859-1, for example). That's also not
> >> representable in our typical debian/copyright file,
> > The specification currently reads: “Only the wildcards * and ? apply; the
> > former matches any number of characters (including none), the latter a
> > single character.”
> > But characters of which encoding? If UTF-8, then for some filenames, no
> > wildcard exist that would match them.
> Indeed.  That's arguably a worse hole in the specification than whitespace
> handling, since it may not be possible to use wildcards to work around it.
> I'm not sure if we need to say something about that explicitly, or if it's
> rare enough that we don't have to care.

Dear all,

how about documenting these facts in the DEP and going ahead with the current
syntax ?

+  <section id="limitations">
+    <title>Limitations</title>
+    <para>
+      The pattern syntax can not distinguish files whose names differ only by
+      whitespaces, nor files that have the same name but are in paths that only
+      differ by whitespaces.
+    </para>
+    <para>
+      It is not possible to represent a file name or a path using an encoding
+      that is not compatible with Unicode.
+    </para>
+  </section>

For the white spaces, it has been a year that we claim that we will not make
normative changes unless necessary, and the possibilities discussed are all
theoretical.  I think that extensions are welcome for next versions of the
format, but the possibility to break existing files with a normative change is
not less unlikely than the possibility to encounter a package where two files
have different licenses and names that differ only by whitespaces, and where
the upstream author would either refuse or not be available to correct that

For the encoding, this is not a problem limited to the machine-readable format.
If the Debian copyright file is in an encoding A, and one file has a name or is
in a directory that has a name in an encoding B that can not be represented in
A, and that there is no way to escape this problem with wildcards, that the
file or directory can not be described by its name regardless of the syntax
followed by the copyright file.

It is good to care about these cases, and I propose to do so by documenting
them the version 1.0 and keeping bugs open, that may be solved in a future
version if there is a solution that satisfies both the developers who write the
files and the developers who write the parsers.

Have a nice week-end,

Charles Plessy
Tsurumi, Kanagawa, Japan

Reply to: