[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [UDD] Fixing (most) email addresses in upload_history table



Hi,

On Sat, Jan 22, 2011 at 03:24:49PM +0100, Lucas Nussbaum wrote:
> Yes, please fix this in the importer.

I think the attached patch will do the trick and when setting debug=1
in aux.py the parsed strings look good.

However, I had serious trouble to import the complete upload-history
when setting

Index: config-org.yaml
===================================================================
--- config-org.yaml     (Revision 1895)
+++ config-org.yaml     (Arbeitskopie)
@@ -407,6 +407,7 @@
   update-command: if [ ! -e /org/udd.debian.org/tmp/upload-history/ ]; then mkdir /org/udd.debian.org/tmp/upload-history/; fi; lftp -c 'mirror -e -P http://master.debian.org/~lucas/ddc-parser/ /org/udd.debian.org/tmp/upload-history'
   schema: upload_history
   table: upload_history
+  only-recent: False
 
 hints:
   type: hints
 

I needed to catch psycopg2.IntegrityError as you can see in the patch
and this DuplicateKeyError happened VERY frequently.  So something seems
to be really wrong and after importing ecerything my tables very filled
only to a fraction of what is in the original UDD.  I adimt I do not
understand this problem and I wonder if somebody has an idea.  Please
not that this is not related to the patch itself (I stumled upon this
Duplicate Key problem previosely when I tried to create a UDD copy but
ignored it finally because there was no real reason to recreate these
tables).

Could anybody trao to create the upload-history from scratch and find
out why it fails that heavily?

Kind regards

    Andreas.

-- 
http://fam-tille.de
Index: udd/aux.py
===================================================================
--- udd/aux.py	(Revision 1895)
+++ udd/aux.py	(Arbeitskopie)
@@ -5,6 +5,8 @@
 import psycopg2
 from os import path
 import fcntl
+import re
+from email.Utils import parseaddr
 
 # If debug is something that evaluates to True, then print_debug actually prints something
 debug = 0
@@ -89,3 +91,13 @@
   if debug:
     sys.stdout.write(*args)
     sys.stdout.write("\n")
+
+def parse_email(str):
+  """Use email.Utils to parse name and email.  Afterwards check whether it was successful and try harder to get a reasonable address"""
+  name, email = parseaddr(str)
+  # if no '@' is detected in email but string contains a '@' anyway try harder to get a reasonable Mail address
+  if email.find('@') == -1 and str.find('@') != -1:
+    email = re.sub('^[^<]+[<\(]([.\w]+@[.\w]+)[>\)].*',                  '\\1', str)
+    name  = re.sub('^[^\w]*([^<]+[.\w\)\]]) *[<\(][.\w]+@[.\w]+[>\)].*', '\\1', str)
+    print_debug("parse_email: %s ---> %s <%s>" % (str, name, email))
+  return name, email
Index: udd/upload_history_gatherer.py
===================================================================
--- udd/upload_history_gatherer.py	(Revision 1895)
+++ udd/upload_history_gatherer.py	(Arbeitskopie)
@@ -7,7 +7,6 @@
 import gzip
 import psycopg2
 import sys
-import email.Utils
 import os.path
 
 def get_gatherer(config, connection, source):
@@ -83,10 +82,10 @@
         line = line.lstrip()
         # Stupid multi-line maintainer fields *grml*
         if line == '':
-          current['Changed-By_name'], current['Changed-By_email'] = email.Utils.parseaddr(current['Changed-By'])
-          current['Maintainer_name'], current['Maintainer_email'] = email.Utils.parseaddr(current['Maintainer'])
+          current['Changed-By_name'], current['Changed-By_email'] = aux.parse_email(current['Changed-By'])
+          current['Maintainer_name'], current['Maintainer_email'] = aux.parse_email(current['Maintainer'])
           if current['Signed-By'].find('@') != -1:
-            current['Signed-By_name'], current['Signed-By_email'] = email.Utils.parseaddr(current['Signed-By'])
+            current['Signed-By_name'], current['Signed-By_email'] = aux.parse_email(current['Signed-By'])
           else:
             current['Signed-By_name'] = current['Signed-By']
             current['Signed-By_email'] = ''
@@ -132,7 +131,13 @@
 
       cursor.executemany(query, uploads)
       cursor.executemany(query_archs, uploads_archs)
-      cursor.executemany(query_closes, uploads_closes)
+      try:
+        cursor.executemany(query_closes, uploads_closes)
+      except psycopg2.IntegrityError, err: 
+        print "Skipping upload values from %s because of duplicate key error.\nThe following values caused the problem:" % (uploads_closes[0]['File'])
+        for ul in uploads_closes:
+          print ul['Source'], ul['Version'], ul['closes'], ul
+        self.connection.rollback()
       
     cursor.execute("DEALLOCATE uh_insert")
     cursor.execute("ANALYZE " + self.my_config['table'] + '_architecture')

Reply to: