[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [UDD] Fixing (most) email addresses in upload_history table



Hi again,

additional remark to the Duplicated Key problem:  If I do a
  TRUNCATE upload_history_closes
before I do a full import this works smoothly and the exception is not
triggered.  So I commited the code without the try-exccept code in the
patch suggested in my previous mail.

If you want to become the new code full effect you need to set

  only-recent: False

in your config file as said below and as long as the Duplicate Key
problem exists you need to TRUNCATE the upload_history_closes table
as said above.

BTW, a

udd=# SELECT maintainer, maintainer_email, changed_by, changed_by_email, signed_by FROM upload_history WHERE (maintainer_email not like '%@%' or changed_by_email not like '%@%') and changed_by_email != 'N/A' and maintainer_email != 'N/A';

reveals some other cases where some reasonable guesses about valid
e-mail adresses can be done if you compare maintainer, changed_by and
signed_by for the name and username part of the email.  While for
my application it is not really necessary to be that picky, it could
help when gaining for real completeness.

Any comments are welcome

    Andreas.

On Sat, Jan 22, 2011 at 08:48:51PM +0100, Andreas Tille wrote:
> Hi,
> 
> On Sat, Jan 22, 2011 at 03:24:49PM +0100, Lucas Nussbaum wrote:
> > Yes, please fix this in the importer.
> 
> I think the attached patch will do the trick and when setting debug=1
> in aux.py the parsed strings look good.
> 
> However, I had serious trouble to import the complete upload-history
> when setting
> 
> Index: config-org.yaml
> ===================================================================
> --- config-org.yaml     (Revision 1895)
> +++ config-org.yaml     (Arbeitskopie)
> @@ -407,6 +407,7 @@
>    update-command: if [ ! -e /org/udd.debian.org/tmp/upload-history/ ]; then mkdir /org/udd.debian.org/tmp/upload-history/; fi; lftp -c 'mirror -e -P http://master.debian.org/~lucas/ddc-parser/ /org/udd.debian.org/tmp/upload-history'
>    schema: upload_history
>    table: upload_history
> +  only-recent: False
>  
>  hints:
>    type: hints
>  
> 
> I needed to catch psycopg2.IntegrityError as you can see in the patch
> and this DuplicateKeyError happened VERY frequently.  So something seems
> to be really wrong and after importing ecerything my tables very filled
> only to a fraction of what is in the original UDD.  I adimt I do not
> understand this problem and I wonder if somebody has an idea.  Please
> not that this is not related to the patch itself (I stumled upon this
> Duplicate Key problem previosely when I tried to create a UDD copy but
> ignored it finally because there was no real reason to recreate these
> tables).
> 
> Could anybody trao to create the upload-history from scratch and find
> out why it fails that heavily?
> 
> Kind regards
> 
>     Andreas.
> 
> -- 
> http://fam-tille.de

> Index: udd/aux.py
> ===================================================================
> --- udd/aux.py	(Revision 1895)
> +++ udd/aux.py	(Arbeitskopie)
> @@ -5,6 +5,8 @@
>  import psycopg2
>  from os import path
>  import fcntl
> +import re
> +from email.Utils import parseaddr
>  
>  # If debug is something that evaluates to True, then print_debug actually prints something
>  debug = 0
> @@ -89,3 +91,13 @@
>    if debug:
>      sys.stdout.write(*args)
>      sys.stdout.write("\n")
> +
> +def parse_email(str):
> +  """Use email.Utils to parse name and email.  Afterwards check whether it was successful and try harder to get a reasonable address"""
> +  name, email = parseaddr(str)
> +  # if no '@' is detected in email but string contains a '@' anyway try harder to get a reasonable Mail address
> +  if email.find('@') == -1 and str.find('@') != -1:
> +    email = re.sub('^[^<]+[<\(]([.\w]+@[.\w]+)[>\)].*',                  '\\1', str)
> +    name  = re.sub('^[^\w]*([^<]+[.\w\)\]]) *[<\(][.\w]+@[.\w]+[>\)].*', '\\1', str)
> +    print_debug("parse_email: %s ---> %s <%s>" % (str, name, email))
> +  return name, email
> Index: udd/upload_history_gatherer.py
> ===================================================================
> --- udd/upload_history_gatherer.py	(Revision 1895)
> +++ udd/upload_history_gatherer.py	(Arbeitskopie)
> @@ -7,7 +7,6 @@
>  import gzip
>  import psycopg2
>  import sys
> -import email.Utils
>  import os.path
>  
>  def get_gatherer(config, connection, source):
> @@ -83,10 +82,10 @@
>          line = line.lstrip()
>          # Stupid multi-line maintainer fields *grml*
>          if line == '':
> -          current['Changed-By_name'], current['Changed-By_email'] = email.Utils.parseaddr(current['Changed-By'])
> -          current['Maintainer_name'], current['Maintainer_email'] = email.Utils.parseaddr(current['Maintainer'])
> +          current['Changed-By_name'], current['Changed-By_email'] = aux.parse_email(current['Changed-By'])
> +          current['Maintainer_name'], current['Maintainer_email'] = aux.parse_email(current['Maintainer'])
>            if current['Signed-By'].find('@') != -1:
> -            current['Signed-By_name'], current['Signed-By_email'] = email.Utils.parseaddr(current['Signed-By'])
> +            current['Signed-By_name'], current['Signed-By_email'] = aux.parse_email(current['Signed-By'])
>            else:
>              current['Signed-By_name'] = current['Signed-By']
>              current['Signed-By_email'] = ''
> @@ -132,7 +131,13 @@
>  
>        cursor.executemany(query, uploads)
>        cursor.executemany(query_archs, uploads_archs)
> -      cursor.executemany(query_closes, uploads_closes)
> +      try:
> +        cursor.executemany(query_closes, uploads_closes)
> +      except psycopg2.IntegrityError, err: 
> +        print "Skipping upload values from %s because of duplicate key error.\nThe following values caused the problem:" % (uploads_closes[0]['File'])
> +        for ul in uploads_closes:
> +          print ul['Source'], ul['Version'], ul['closes'], ul
> +        self.connection.rollback()
>        
>      cursor.execute("DEALLOCATE uh_insert")
>      cursor.execute("ANALYZE " + self.my_config['table'] + '_architecture')


-- 
http://fam-tille.de


Reply to: