[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)



Hi Ian

Huge thanks for tackling this one... it's a seemingly-simple but actually complicated field to describe as you have noted.

I've had a bit of a wander through the list of entries that are currently in Maintainer and Uploaders to look at what the stated approach would rule in/out. That then raises some examples to consider - I ask the questions about a few examples below from the "are we sure this is what we want to do" perspective rather than "we should not do this".

I'll use made-up examples in the discussion below rather than extracting real people's names from Sources. I don't want to centre the discussion on any individuals, and I am also conscious that this discussion needs to not turn into something that has overtones of "you're spelling your name wrong".

It makes for a very long reply - sorry. It's not because there are lots of problems, just (corner) cases to understand.

cheers
Stuart




  * The field is a comma-separated list of `name <email>` where `name`
    can be quoted `"name"` (and may then contain Unicode), or be
    unquoted but then has a restriccted character set which excludes
    Unicode and excludes `,`.

I'm pleased that we finally have a way to include , in the name part - that fixes one of the current problems nicely. There is only one current example of a comma in Maintainer/Uploaders and it is quoted in this way already.

We have a few of the following constructs in the name that I *think* are OK by these rules without quoting, but to confirm:

	J Smith (js)			[parens]
	J (js) Smith			[parens]
	J O'Dear			[single quote]

(I have a recollection of parens being special in email addresses; single quotes often are special and there are lots of them in existing entries — just double checking!)

I would like to suggest that we find a way to permit non-ASCII unicode letter characters in the name part without requiring quotes. I understand that's an extension to RFC5322 but ...

- any use of these data will get to some sort of MUA to fix the
  representation prior being an issue
- other fields in d/control and Sources are allowed to contain non-ASCII
  unicode letters without any restrictions or encoding.
- it would be a compatible upgrade to RFC5322 in that anyone who did
  quote some non-ASCII characters in their name will not have done the
  wrong thing
- it is appropriate to find ways of being less Anglocentric in our
  format specifications and I have a feeling that is possible to do
  safely here
- there are many hundreds of existing entries in Sources where the names
  contain non-ASCII letter characters from lots of different languages
- I doubt there is an appetite in Debian to make many thousands of
  existing packages insta-buggy and then take the next decade to upload
  fixes, and until they are all fixed also have no set format that can
  be used by parsers.

Some examples

	Julián Niño
	J Lee (你好世界)
	你好世界

(and we could, of course, imagine lots of other languages and scripts being used here and there are several others in Sources)


  * There is no `\`-escaping: names simply cannot contain `\` or `"`.

From the perspective of someone writing a parser I can see why this is attractive... we do have a couple of counter-examples in the archive at present

	John "Fred" Smith

It's a big call to tell those people that they don't know how to spell their name. Can we avoid imposing this restriction without causing too much pain? (Undoubtedly _some_ don't care between " and ', but is that the design principle we should work to?)


  * The RFC5322 `domain` must be in lowercase.

This is an interesting requirement - is there any need for it? There are counter-examples currently in the archive and uppercase domain names work just fine in real mail systems.



The examples above probably explore the space enough, but the attached script spits out 360 'interesting' Maintainer/Uploader entries to look at if you are curious to see some real cases and look check for other variations that I've missed. The regex is overly strict compared to these rules to pull out 'interesting' for 'are we sure' discussions, and not 'violations of the above rules'. Note that the script looks at unique entries in Sources, not people (plenty of repeated names with different email addresses); it offers a count of unique (name, addr) pairs and a count of affected source packages in main.


Some variations on the regex in the script let us consider some variations to these rules.

The rules as written above = about 300 buggy entries across 5500 packages.

Of these:

- approx 290 are unicode letter characters in names - i.e. if we can allow unicode letter characters in the name part without needing quoting, we make huge strides in compatibility. (my test was \w which in Python 3 permits unicode “Lm”, “Lt”, “Lu”, “Ll”, or “Lo” plus some digit/numeric forms that we don't want to actually permit but aren't in use in the data set so aren't an issue here)

- approx 10 entries are from domain names being in uppercase

- there's a handful of remaining items that might actually be OK that are the limits of my current understanding of RFC5322, such as allowing @ in the name part.

(and then there 6 or so buggy entries already in Maintainers and Uploaders, either missing commas or with stray commas)


I think these data make a strong case for permitting unicode letter characters in the name part and uppercase domain names.




### Processing strategy

A system which doesn't need to understand the field can safely display
it as-is in its entirity.

A system which needs to understand an entity and email field could
proceed as follows:

Thanks for listing this out - it's useful to consider this at the same time. I had a go at coding it (to eventually land in python-debian) while working through it, but couldn't quite follow a couple of steps below.


  * Unfold as if this were a "folded" field, collapsing each whitespace
    sequence into a single space, so we have a single line.

  * Match `"` quotes to identify quoted text.  These quotes always
appear in pairs.

I'm not sure what 'Match' means in practical terms in the algorithm - would you be storing the list of (start-quote, stop-quote) positions and then at the latter splitting step, not split at character positions of "," that are within those (start-quote, stop-quote) positions?

(In my playing, I ended up walking the length of the string, toggling whether the current status was inside or outside a quoted section, and only acting on commas that are found while outside; Python's 'yield' keyword is convenient for that.)

Check that quoted text contains no `\`.

Can you please unpack why this is needed? they are defined to not exist ;) Is the purpose to parse or to validate? There are lots of other things that one should check if the purpose is to validate.


  * Split the whole field on unqquoted `,`.

I don't see a nice way of doing that at that point as I'm not seeing the bigger picture for the algorithm. Perhaps this is a good DebCamp discussion?


It would be worth noting that we have many examples of trailing commas in Uploaders and that should be specifically allowed (partly so that implementations don't assume that the last entry is not actually empty).

(In looking at examples in the archive I also found a 3 cases where commas were missing in Uploaders; one fixed on salsa, one bug filed)


    If the field is a Maintainer field and this would result in any
    fragments that do not end in `>`, skip this step.  In the future,
    this rule will be abolished, and only be relevant for old data.


The only maintainer fields containing "," are ones with a single entry that ends with "," — they are already buggy and the parser would drop the empty section anyway, so perhaps this wart can be omitted?


  * Strip whitespace from the ends.

  * Now each entry will end in `<....>`.  That is the email address
    part.

    It has a restricted syntax: the allowable character set is ascii
    alphanumerics plus any of the following punctuation:
       ! # $ % & ' * + - / = ?  ^ _ ` { | } ~

also
	@ .
;)

The rules above restrict that further to lowercase ASCII; does one feel the need to actually check that it matches those things? If actually validating, there's a lot more to do than a char check; if not validating, then it's just the bit between < and >


  * The remainer of the entry (with white space normalised to single
    spaces) is the name part.  Strip any `"`.


whitespace normalisation was already done in the first step so can be avoided here; whitespace on the left end was dealt with 2 steps before. Stripping the single whitespace from the right end would be needed though. (Side question: is the whitespace between name part and < required?)



    The name part may be used for human display and possibly ordering.
    It should not be involved in equality comparisons, lookups, etc.

While true of course... we also do that in lots of places in Debian to squash together the multiple emails that an individual has within Sources. (e.g. in the UDD dashboard views)







--
Stuart Prescott   http://www.nanonanonano.net/ stuart@nanonanonano.net
Debian Developer  http://www.debian.org/       stuart@debian.org
GPG fingerprint   90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
#!/usr/bin/python3

import re
from debian.deb822 import Sources

sources = "/var/lib/apt/lists/deb.debian.org_debian_dists_sid_main_source_Sources"

everyone = []

# this regex doesn't quite match the rules being discussed, it's a
# quick-and-dirty one to highlight some interesting cases.
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# permitting also ( ) ' and "
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# widening to also permit unicode letter characters anywhere in the name
# in Python 3 \w is automatically unicode aware
#   https://docs.python.org/3/library/re.html#re.UNICODE
#   https://docs.python.org/3/library/stdtypes.html#str.isalnum
#   https://docs.python.org/3/library/stdtypes.html#str.isalpha
# \w means:
#  c.isalpha() or c.isdecimal() or c.isdigit() or c.isnumeric()
# where isalpha means
#  Alphabetic characters are those characters defined in the Unicode character
#  database as “Letter”, i.e., those with general category property being one
#  of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the
#  Alphabetic property defined in the section 4.10 ‘Letters, Alphabetic, and
#  Ideographic’ of the Unicode Standard.
#
# The inclusion of various digit forms from isdigit and isnumeric in this is
# probably overly broad for the name part but does no harm here.
#   https://docs.python.org/3/library/stdtypes.html#str.isdigit
#   https://docs.python.org/3/library/stdtypes.html#str.isnumeric
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# widening to also permit uppercase domain names
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-zA-Z0-9.-]+>$")


pkgcount = 0

for src in Sources.iter_paragraphs(sources):
   counted = False
   if not valid_maint.match(src["Maintainer"]):
      everyone.append(src["Maintainer"])
      pkgcount += 1
      counted = True

   for upl in src.get("Uploaders", "").split(","):
      # note: split(",") does not handle the one current example in Sources
      # with a name that contains , and is actually quoted in double quotes
      upl = upl.strip()
      if upl and not valid_maint.match(upl):
         everyone.append(upl)
         if not counted:
            pkgcount += 1

uniq = sorted(set(everyone))
print(*uniq, sep="\n")

print("Total entries", len(uniq))
print("Total packages", pkgcount)

Reply to: