[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#715216: qa.debian.org: collab-qa/upload-history: Software trusts "Date" headers which are sometimes set wrong



New proposed fix:

In collab-qa/upload-history/munge_ddc.py , if the Message-Date we were going to emit is not within the year of the envelope From, plus/minus one year, we drop Message-Date.

This permits the udd/upload_history_gatherer.py code to use its existing logic about dropping replacing "Message-Date" with "Date" when "Message-Date" is visible.

I've tested that on a subset of debian-devel-changes.201307 and confirmed the patched code has identical output, and that the crazy "The year is 2019" message has N/A as its Message-Date. So I believe the behavior is correct.

I've also tested the performance of this. I re-ran the patched + unpatched code until runtimes were close to consistent (in an attempt to remove disk or other caches as the difference) and then I got these numbers after one run of each:

Unpatched: 22.994sec

➜ upload-history time python munge_ddc.py /tmp/debian-devel-changes/debian-devel-changes.201307 > /tmp/debian-devel-changes.201307.out.without-patch

17.93s user 4.59s system 97% cpu 22.994 total

Patched: 22.983sec

➜ upload-history time python munge_ddc.py /tmp/debian-devel-changes/debian-devel-changes.201307 > /tmp/debian-devel-changes.201307.out.with-patch

18.00s user 4.52s system 97% cpu 22.983 total


I'm not realistically arguing that my code runs faster than the original code, but I believe there is less than 1% runtime difference between the two, and probably way less than 1%.

Patch attached for munge_ddc.py. Please review and merge, if you like!

-- Asheesh.
Index: munge_ddc.py
===================================================================
--- munge_ddc.py	(revision 2767)
+++ munge_ddc.py	(working copy)
@@ -105,6 +105,55 @@
 
     return ret
 
+def get_message_date_if_it_is_sane(msg, rc):
+    # See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=715216 for the reason
+    # this function exists at all.
+    raw_msg_date = msg['Date']
+    message_date = normalize_date(msg['Date'])
+
+    # If there is no usable message date, then we should bail out.
+    # This is preferable to emitting any other value as the message date;
+    # the other date data is also emitted (look for "Message-Date" vs. "Date"
+    # in munge_ddc()), so downstream consumers of this data can decide for
+    # themselves how to proceed.
+    if message_date is None:
+        return raw_msg_date
+
+    # If there is a message date, we want to check how that compares
+    # to the date embedded in the Envelope from, if there is one.
+    #
+    # In the past, there have been situations where the message date
+    # is far ahead of the actual date, probably due to a server error
+    # somewhere.
+    envelope_from = msg.get_unixfrom()
+
+    # If we have no envelope from, then we cannot do the sanity check. So we
+    # let the original data through.
+    if envelope_from is None:
+        return raw_msg_date
+
+    # If the we have an envelope from, try to find a date. If we don't have
+    # one, then we also have to bail out.
+    envelope_from_date = None
+    if envelope_from.count(' ') >= 2:
+        envelope_from_date = normalize_date(
+            envelope_from.split(' ', 2)[2])
+    if envelope_from_date is None:
+        return raw_msg_date
+
+    # Given that we have both, we use the following heuristic: if the
+    # Message-Date is more than one year different from the date in the
+    # envelope From, we assume the Message-Date is corrupt.
+    message_date_year = emailutils.parsedate_tz(message_date)[0]
+    envelope_from_year = emailutils.parsedate_tz(envelope_from_date)[0]
+
+    # If the message date year is within a sane range, we return it.
+    if ((envelope_from_year - 1) < message_date_year <
+        (envelope_from_year + 1)):
+        return raw_msg_date
+    else:
+        return 'N/A'
+
 def build_keyring_list(dir):
     """ Scan dir and return a list of every file ending with .gpg or .pgp """
     keyrings = []
@@ -141,7 +190,7 @@
         c['Signed-By'] = 'N/A'
 
         c['Message-Id'] = str(msg['Message-Id']).strip('\n')
-        c['Message-Date'] = str(msg['Date']).strip('\n')
+        c['Message-Date'] = get_message_date_if_it_is_sane(msg, rc)
 
 	if not set(['Source', 'Architecture', 'Version', 'Date', 'Changes']).issubset(rc.keys()):
             sys.stderr.write("Required fields not found, skipping.\n")

Reply to: