Re: udd/blends_metadata_gathener.py hints
On Sat, Oct 19, 2013 at 12:18:27AM +0300, Emmanouil Kiagias wrote:
> Hello Andreas,
> I updated blends_metadata_gathener.py
> >From first intuition I would think it might make sense to add single
> > paragraphs to
> > the configfile, like
> > blends-all
> > blend-med
> > blend-edu
> > blend-gis
> > blend-...
> > I added the above paragraphs inside config-ullman.yaml.
> The gathener with blends-all runs for each available Blend else it runs for
> the selected blend.
That's pretty cool, specifically the implementation with the checksum!
It speeds up the daily cron job drastically (I was about to write this
soon - it is so great that you have beaten me! ;-) )
> I created the single blend paragraphs using <<: *blends-conf in case we
> need to override any of the blends-all attributes.
Reaal cool - good to have some expert like you - I personally was not
aware of this very helpful option!
> Each Blend now has each own log file by the name :
OK, that's helpful as well.
> In case the gathener fails before he updates any blend it logs into a
> blends_metadata_gatherer-default.log file.
> For checking if a task file has changed I added a "hashkey" column in the
> blends_tasks. When a task is imported I save a md5 hash in the
> blends_tasks. Before I delete and add from scratch a taskfile I checked
> whether its hashkey has changed. So if you run once the new gathener in
> order to save some first hashkeys then it will only delete/adds the changed
Yes, that's a great feature which reduces the load on alioth
drastically. The only problem I noticed is that it does not clean up
deleted / renamed tasks. It checks taskfile by task file - but if a
task file is missing that was previously injected into the database
these data would be stay for ever inside the database. I'd recommend to
simply store a list of all (successfully) parsed tasks files and remove
all those tasks from the database which are not in this list. Once this
is done I'll put the code immediately into effect on production UDD.
> In the above case I could not delete and readd the Blend entry from
> blends_metadata table (because of the references in blends_tasks etc) so I
> check whether a Blends exists. If it exists I update the entry to save any
> changes else I use the blends_metadata_insert to create a new entry.
> You can test the gathener. Any feedback/comments is more than welcome :-).
In short: Great job with a minor missing bit (handling of deleted tasks).
> I will now check on the following (quoting from a previous mail of yours):
> c) try to make the insertion procedure itself more efficient by for
> - check, whether we could speed up the check for a package that
> just exists in UDD
> - inject all packages in one rush
May be this could enhance things even more. It would be good to put the
least stress on UDD as possible.
Thanks for your work on this