Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)

To: Ian Jackson <ijackson@chiark.greenend.org.uk>, 401452@bugs.debian.org
Subject: Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)
From: Stuart Prescott <stuart@debian.org>
Date: Mon, 23 Jun 2025 12:41:55 +1000
Message-id: <[🔎] 34fa753a-c1a2-4e90-9a6d-42df5ae7404c@debian.org>
Reply-to: Stuart Prescott <stuart@debian.org>, 401452@bugs.debian.org
In-reply-to: <[🔎] 26700.27271.976166.237007@chiark.greenend.org.uk>
References: <[🔎] 26700.22000.503529.753415@chiark.greenend.org.uk> <handler.401452.B401452.17498332073608731.ackinfo@bugs.debian.org> <20061203174923.9431.89284.reportbug@localhost.localdomain> <[🔎] 26700.27271.976166.237007@chiark.greenend.org.uk> <20061203174923.9431.89284.reportbug@localhost.localdomain>

Hi Ian

Huge thanks for tackling this one... it's a seemingly-simple butactually complicated field to describe as you have noted.

I've had a bit of a wander through the list of entries that arecurrently in Maintainer and Uploaders to look at what the statedapproach would rule in/out. That then raises some examples to consider -I ask the questions about a few examples below from the "are we surethis is what we want to do" perspective rather than "we should not do this".

I'll use made-up examples in the discussion below rather than extractingreal people's names from Sources. I don't want to centre the discussionon any individuals, and I am also conscious that this discussion needsto not turn into something that has overtones of "you're spelling yourname wrong".

It makes for a very long reply - sorry. It's not because there are lotsof problems, just (corner) cases to understand.


cheers
Stuart

  * The field is a comma-separated list of `name <email>` where `name`
    can be quoted `"name"` (and may then contain Unicode), or be
    unquoted but then has a restriccted character set which excludes
    Unicode and excludes `,`.

I'm pleased that we finally have a way to include , in the name part -that fixes one of the current problems nicely. There is only one currentexample of a comma in Maintainer/Uploaders and it is quoted in this wayalready.

We have a few of the following constructs in the name that I *think* areOK by these rules without quoting, but to confirm:


	J Smith (js)			[parens]
	J (js) Smith			[parens]
	J O'Dear			[single quote]

(I have a recollection of parens being special in email addresses;single quotes often are special and there are lots of them in existingentries — just double checking!)

I would like to suggest that we find a way to permit non-ASCII unicodeletter characters in the name part without requiring quotes. Iunderstand that's an extension to RFC5322 but ...


- any use of these data will get to some sort of MUA to fix the
  representation prior being an issue
- other fields in d/control and Sources are allowed to contain non-ASCII
  unicode letters without any restrictions or encoding.
- it would be a compatible upgrade to RFC5322 in that anyone who did
  quote some non-ASCII characters in their name will not have done the
  wrong thing
- it is appropriate to find ways of being less Anglocentric in our
  format specifications and I have a feeling that is possible to do
  safely here
- there are many hundreds of existing entries in Sources where the names
  contain non-ASCII letter characters from lots of different languages
- I doubt there is an appetite in Debian to make many thousands of
  existing packages insta-buggy and then take the next decade to upload
  fixes, and until they are all fixed also have no set format that can
  be used by parsers.

Some examples

	Julián Niño
	J Lee (你好世界)
	你好世界

(and we could, of course, imagine lots of other languages and scriptsbeing used here and there are several others in Sources)

  * There is no `\`-escaping: names simply cannot contain `\` or `"`.

From the perspective of someone writing a parser I can see why this isattractive... we do have a couple of counter-examples in the archive atpresent


	John "Fred" Smith

It's a big call to tell those people that they don't know how to spelltheir name. Can we avoid imposing this restriction without causing toomuch pain? (Undoubtedly _some_ don't care between " and ', but is thatthe design principle we should work to?)

  * The RFC5322 `domain` must be in lowercase.

This is an interesting requirement - is there any need for it? There arecounter-examples currently in the archive and uppercase domain nameswork just fine in real mail systems.

The examples above probably explore the space enough, but the attachedscript spits out 360 'interesting' Maintainer/Uploader entries to lookat if you are curious to see some real cases and look check for othervariations that I've missed. The regex is overly strict compared tothese rules to pull out 'interesting' for 'are we sure' discussions, andnot 'violations of the above rules'. Note that the script looks atunique entries in Sources, not people (plenty of repeated names withdifferent email addresses); it offers a count of unique (name, addr)pairs and a count of affected source packages in main.

Some variations on the regex in the script let us consider somevariations to these rules.


The rules as written above = about 300 buggy entries across 5500 packages.

Of these:

- approx 290 are unicode letter characters in names - i.e. if we canallow unicode letter characters in the name part without needingquoting, we make huge strides in compatibility. (my test was \w which inPython 3 permits unicode “Lm”, “Lt”, “Lu”, “Ll”, or “Lo” plus somedigit/numeric forms that we don't want to actually permit but aren't inuse in the data set so aren't an issue here)


- approx 10 entries are from domain names being in uppercase

- there's a handful of remaining items that might actually be OK thatare the limits of my current understanding of RFC5322, such as allowing@ in the name part.

(and then there 6 or so buggy entries already in Maintainers andUploaders, either missing commas or with stray commas)

I think these data make a strong case for permitting unicode lettercharacters in the name part and uppercase domain names.

### Processing strategy

A system which doesn't need to understand the field can safely display
it as-is in its entirity.

A system which needs to understand an entity and email field could
proceed as follows:

Thanks for listing this out - it's useful to consider this at the sametime. I had a go at coding it (to eventually land in python-debian)while working through it, but couldn't quite follow a couple of steps below.

  * Unfold as if this were a "folded" field, collapsing each whitespace
    sequence into a single space, so we have a single line.

  * Match `"` quotes to identify quoted text.  These quotes always

appear in pairs.

I'm not sure what 'Match' means in practical terms in the algorithm -would you be storing the list of (start-quote, stop-quote) positions andthen at the latter splitting step, not split at character positions of"," that are within those (start-quote, stop-quote) positions?

(In my playing, I ended up walking the length of the string, togglingwhether the current status was inside or outside a quoted section, andonly acting on commas that are found while outside; Python's 'yield'keyword is convenient for that.)

Check that quoted text contains no `\`.

Can you please unpack why this is needed? they are defined to not exist;) Is the purpose to parse or to validate? There are lots of otherthings that one should check if the purpose is to validate.

  * Split the whole field on unqquoted `,`.

I don't see a nice way of doing that at that point as I'm not seeing thebigger picture for the algorithm. Perhaps this is a good DebCamp discussion?

It would be worth noting that we have many examples of trailing commasin Uploaders and that should be specifically allowed (partly so thatimplementations don't assume that the last entry is not actually empty).

(In looking at examples in the archive I also found a 3 cases wherecommas were missing in Uploaders; one fixed on salsa, one bug filed)

    If the field is a Maintainer field and this would result in any
    fragments that do not end in `>`, skip this step.  In the future,
    this rule will be abolished, and only be relevant for old data.

The only maintainer fields containing "," are ones with a single entrythat ends with "," — they are already buggy and the parser would dropthe empty section anyway, so perhaps this wart can be omitted?

  * Strip whitespace from the ends.

  * Now each entry will end in `<....>`.  That is the email address
    part.

    It has a restricted syntax: the allowable character set is ascii
    alphanumerics plus any of the following punctuation:
       ! # $ % & ' * + - / = ?  ^ _ ` { | } ~


also
	@ .
;)

The rules above restrict that further to lowercase ASCII; does one feelthe need to actually check that it matches those things? If actuallyvalidating, there's a lot more to do than a char check; if notvalidating, then it's just the bit between < and >

  * The remainer of the entry (with white space normalised to single
    spaces) is the name part.  Strip any `"`.

whitespace normalisation was already done in the first step so can beavoided here; whitespace on the left end was dealt with 2 steps before.Stripping the single whitespace from the right end would be neededthough. (Side question: is the whitespace between name part and < required?)

    The name part may be used for human display and possibly ordering.
    It should not be involved in equality comparisons, lookups, etc.

While true of course... we also do that in lots of places in Debian tosquash together the multiple emails that an individual has withinSources. (e.g. in the UDD dashboard views)








--
Stuart Prescott   http://www.nanonanonano.net/ stuart@nanonanonano.net
Debian Developer  http://www.debian.org/       stuart@debian.org
GPG fingerprint   90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7

#!/usr/bin/python3

import re
from debian.deb822 import Sources

sources = "/var/lib/apt/lists/deb.debian.org_debian_dists_sid_main_source_Sources"

everyone = []

# this regex doesn't quite match the rules being discussed, it's a
# quick-and-dirty one to highlight some interesting cases.
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# permitting also ( ) ' and "
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# widening to also permit unicode letter characters anywhere in the name
# in Python 3 \w is automatically unicode aware
#   https://docs.python.org/3/library/re.html#re.UNICODE
#   https://docs.python.org/3/library/stdtypes.html#str.isalnum
#   https://docs.python.org/3/library/stdtypes.html#str.isalpha
# \w means:
#  c.isalpha() or c.isdecimal() or c.isdigit() or c.isnumeric()
# where isalpha means
#  Alphabetic characters are those characters defined in the Unicode character
#  database as “Letter”, i.e., those with general category property being one
#  of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the
#  Alphabetic property defined in the section 4.10 ‘Letters, Alphabetic, and
#  Ideographic’ of the Unicode Standard.
#
# The inclusion of various digit forms from isdigit and isnumeric in this is
# probably overly broad for the name part but does no harm here.
#   https://docs.python.org/3/library/stdtypes.html#str.isdigit
#   https://docs.python.org/3/library/stdtypes.html#str.isnumeric
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-z0-9.-]+>$")

# widening to also permit uppercase domain names
valid_maint = re.compile(r"^[a-zA-Z0-9/. ~_()'\"\w-]+ <[a-zA-Z0-9.+_-]+@[a-zA-Z0-9.-]+>$")


pkgcount = 0

for src in Sources.iter_paragraphs(sources):
   counted = False
   if not valid_maint.match(src["Maintainer"]):
      everyone.append(src["Maintainer"])
      pkgcount += 1
      counted = True

   for upl in src.get("Uploaders", "").split(","):
      # note: split(",") does not handle the one current example in Sources
      # with a name that contains , and is actually quoted in double quotes
      upl = upl.strip()
      if upl and not valid_maint.match(upl):
         everyone.append(upl)
         if not counted:
            pkgcount += 1

uniq = sorted(set(everyone))
print(*uniq, sep="\n")

print("Total entries", len(uniq))
print("Total packages", pkgcount)

Reply to:

References:
- Bug#401452: Standardize syntax of the name in the Maintainer control field
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

Prev by Date: Bug#1107137: Distinguish "native source packsge" from "native version number"
Next by Date: 缓解焦虑的“快乐水果”，这是怎么回事？
Previous by thread: Bug#401452: Info received (Bug#401452: Standardize syntax of the name in the Maintainer control field)
Next by thread: Bug#401452: Standardize syntax of the name in the Maintainer control field
Index(es):
- Date
- Thread