Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)
Hi Ian
Huge thanks for tackling this one... it's a seemingly-simple but
actually complicated field to describe as you have noted.
I've had a bit of a wander through the list of entries that are
currently in Maintainer and Uploaders to look at what the stated
approach would rule in/out. That then raises some examples to consider -
I ask the questions about a few examples below from the "are we sure
this is what we want to do" perspective rather than "we should not do this".
I'll use made-up examples in the discussion below rather than extracting
real people's names from Sources. I don't want to centre the discussion
on any individuals, and I am also conscious that this discussion needs
to not turn into something that has overtones of "you're spelling your
name wrong".
It makes for a very long reply - sorry. It's not because there are lots
of problems, just (corner) cases to understand.
cheers
Stuart
* The field is a comma-separated list of `name <email>` where `name`
can be quoted `"name"` (and may then contain Unicode), or be
unquoted but then has a restriccted character set which excludes
Unicode and excludes `,`.
I'm pleased that we finally have a way to include , in the name part -
that fixes one of the current problems nicely. There is only one current
example of a comma in Maintainer/Uploaders and it is quoted in this way
already.
We have a few of the following constructs in the name that I *think* are
OK by these rules without quoting, but to confirm:
J Smith (js) [parens]
J (js) Smith [parens]
J O'Dear [single quote]
(I have a recollection of parens being special in email addresses;
single quotes often are special and there are lots of them in existing
entries — just double checking!)
I would like to suggest that we find a way to permit non-ASCII unicode
letter characters in the name part without requiring quotes. I
understand that's an extension to RFC5322 but ...
- any use of these data will get to some sort of MUA to fix the
representation prior being an issue
- other fields in d/control and Sources are allowed to contain non-ASCII
unicode letters without any restrictions or encoding.
- it would be a compatible upgrade to RFC5322 in that anyone who did
quote some non-ASCII characters in their name will not have done the
wrong thing
- it is appropriate to find ways of being less Anglocentric in our
format specifications and I have a feeling that is possible to do
safely here
- there are many hundreds of existing entries in Sources where the names
contain non-ASCII letter characters from lots of different languages
- I doubt there is an appetite in Debian to make many thousands of
existing packages insta-buggy and then take the next decade to upload
fixes, and until they are all fixed also have no set format that can
be used by parsers.
Some examples
Julián Niño
J Lee (你好世界)
你好世界
(and we could, of course, imagine lots of other languages and scripts
being used here and there are several others in Sources)
* There is no `\`-escaping: names simply cannot contain `\` or `"`.
From the perspective of someone writing a parser I can see why this is
attractive... we do have a couple of counter-examples in the archive at
present
John "Fred" Smith
It's a big call to tell those people that they don't know how to spell
their name. Can we avoid imposing this restriction without causing too
much pain? (Undoubtedly _some_ don't care between " and ', but is that
the design principle we should work to?)
* The RFC5322 `domain` must be in lowercase.
This is an interesting requirement - is there any need for it? There are
counter-examples currently in the archive and uppercase domain names
work just fine in real mail systems.
The examples above probably explore the space enough, but the attached
script spits out 360 'interesting' Maintainer/Uploader entries to look
at if you are curious to see some real cases and look check for other
variations that I've missed. The regex is overly strict compared to
these rules to pull out 'interesting' for 'are we sure' discussions, and
not 'violations of the above rules'. Note that the script looks at
unique entries in Sources, not people (plenty of repeated names with
different email addresses); it offers a count of unique (name, addr)
pairs and a count of affected source packages in main.
Some variations on the regex in the script let us consider some
variations to these rules.
The rules as written above = about 300 buggy entries across 5500 packages.
Of these:
- approx 290 are unicode letter characters in names - i.e. if we can
allow unicode letter characters in the name part without needing
quoting, we make huge strides in compatibility. (my test was \w which in
Python 3 permits unicode “Lm”, “Lt”, “Lu”, “Ll”, or “Lo” plus some
digit/numeric forms that we don't want to actually permit but aren't in
use in the data set so aren't an issue here)
- approx 10 entries are from domain names being in uppercase
- there's a handful of remaining items that might actually be OK that
are the limits of my current understanding of RFC5322, such as allowing
@ in the name part.
(and then there 6 or so buggy entries already in Maintainers and
Uploaders, either missing commas or with stray commas)
I think these data make a strong case for permitting unicode letter
characters in the name part and uppercase domain names.
### Processing strategy
A system which doesn't need to understand the field can safely display
it as-is in its entirity.
A system which needs to understand an entity and email field could
proceed as follows:
Thanks for listing this out - it's useful to consider this at the same
time. I had a go at coding it (to eventually land in python-debian)
while working through it, but couldn't quite follow a couple of steps below.
* Unfold as if this were a "folded" field, collapsing each whitespace
sequence into a single space, so we have a single line.
* Match `"` quotes to identify quoted text. These quotes always
appear in pairs.
I'm not sure what 'Match' means in practical terms in the algorithm -
would you be storing the list of (start-quote, stop-quote) positions and
then at the latter splitting step, not split at character positions of
"," that are within those (start-quote, stop-quote) positions?
(In my playing, I ended up walking the length of the string, toggling
whether the current status was inside or outside a quoted section, and
only acting on commas that are found while outside; Python's 'yield'
keyword is convenient for that.)
Check that quoted text contains no `\`.
Can you please unpack why this is needed? they are defined to not exist
;) Is the purpose to parse or to validate? There are lots of other
things that one should check if the purpose is to validate.
* Split the whole field on unqquoted `,`.
I don't see a nice way of doing that at that point as I'm not seeing the
bigger picture for the algorithm. Perhaps this is a good DebCamp discussion?
It would be worth noting that we have many examples of trailing commas
in Uploaders and that should be specifically allowed (partly so that
implementations don't assume that the last entry is not actually empty).
(In looking at examples in the archive I also found a 3 cases where
commas were missing in Uploaders; one fixed on salsa, one bug filed)
If the field is a Maintainer field and this would result in any
fragments that do not end in `>`, skip this step. In the future,
this rule will be abolished, and only be relevant for old data.
The only maintainer fields containing "," are ones with a single entry
that ends with "," — they are already buggy and the parser would drop
the empty section anyway, so perhaps this wart can be omitted?
* Strip whitespace from the ends.
* Now each entry will end in `<....>`. That is the email address
part.
It has a restricted syntax: the allowable character set is ascii
alphanumerics plus any of the following punctuation:
! # $ % & ' * + - / = ? ^ _ ` { | } ~
also
@ .
;)
The rules above restrict that further to lowercase ASCII; does one feel
the need to actually check that it matches those things? If actually
validating, there's a lot more to do than a char check; if not
validating, then it's just the bit between < and >
* The remainer of the entry (with white space normalised to single
spaces) is the name part. Strip any `"`.
whitespace normalisation was already done in the first step so can be
avoided here; whitespace on the left end was dealt with 2 steps before.
Stripping the single whitespace from the right end would be needed
though. (Side question: is the whitespace between name part and < required?)
The name part may be used for human display and possibly ordering.
It should not be involved in equality comparisons, lookups, etc.
While true of course... we also do that in lots of places in Debian to
squash together the multiple emails that an individual has within
Sources. (e.g. in the UDD dashboard views)
--
Stuart Prescott http://www.nanonanonano.net/ stuart@nanonanonano.net
Debian Developer http://www.debian.org/ stuart@debian.org
GPG fingerprint 90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
#!/usr/bin/python3
import re
from debian.deb822 import Sources
sources = "/var/lib/apt/lists/deb.debian.org_debian_dists_sid_main_source_Sources"
everyone = []
# this regex doesn't quite match the rules being discussed, it's a
# quick-and-dirty one to highlight some interesting cases.
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")
# permitting also ( ) ' and "
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")
# widening to also permit unicode letter characters anywhere in the name
# in Python 3 \w is automatically unicode aware
# https://docs.python.org/3/library/re.html#re.UNICODE
# https://docs.python.org/3/library/stdtypes.html#str.isalnum
# https://docs.python.org/3/library/stdtypes.html#str.isalpha
# \w means:
# c.isalpha() or c.isdecimal() or c.isdigit() or c.isnumeric()
# where isalpha means
# Alphabetic characters are those characters defined in the Unicode character
# database as “Letter”, i.e., those with general category property being one
# of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the
# Alphabetic property defined in the section 4.10 ‘Letters, Alphabetic, and
# Ideographic’ of the Unicode Standard.
#
# The inclusion of various digit forms from isdigit and isnumeric in this is
# probably overly broad for the name part but does no harm here.
# https://docs.python.org/3/library/stdtypes.html#str.isdigit
# https://docs.python.org/3/library/stdtypes.html#str.isnumeric
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")
# widening to also permit uppercase domain names
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-zA-Z0-9.-]+>$")
pkgcount = 0
for src in Sources.iter_paragraphs(sources):
counted = False
if not valid_maint.match(src["Maintainer"]):
everyone.append(src["Maintainer"])
pkgcount += 1
counted = True
for upl in src.get("Uploaders", "").split(","):
# note: split(",") does not handle the one current example in Sources
# with a name that contains , and is actually quoted in double quotes
upl = upl.strip()
if upl and not valid_maint.match(upl):
everyone.append(upl)
if not counted:
pkgcount += 1
uniq = sorted(set(everyone))
print(*uniq, sep="\n")
print("Total entries", len(uniq))
print("Total packages", pkgcount)
Reply to: