Re: [GSoC] Profiling SQL query (Was: [GSoC] Reg Blends Web Sentinel)
Hi Akshita,
On Thu, Mar 12, 2015 at 06:44:03PM +0530, Akshita Jha wrote:
> I was working on optimizing the query - "query_bug_packages". I have a
> small doubt regarding the query:
s/small/severe/ :-)
I confirm that your doubt is perfectly rectified.
> In the sub-query:
>
> SELECT s.source, b.component, s.homepage, s.vcs_browser, s.maintainer,
> s.version, row_number() OVER (PARTITION BY s.source ORDER BY s.version DESC)
> FROM blends_dependencies b
> JOIN packages p ON p.package = b.package
> JOIN bugs bu ON bu.source = p.source
> JOIN sources s ON s.source = p.source
> WHERE blend = $1 AND b.distribution = 'debian'
> GROUP BY s.source, b.dependency, b.component, s.homepage,
> s.vcs_browser, s.maintainer, s.version
> ) sources -- check status of dependency relation because only
> suggested packages are less important for bugs sentinel
>
> we are already using PARTITION BY and selecting "row_number=1", then why
> exactly do we need GROUP BY ?
Since the code was never peer reviewed? I first went with the GROUP BY
which was sufficient for the first test queries. Later I tried to get
rid of some duplicates via PARTITION and simply forgot to remove the
GROUP BY ...
> I tried indexing, but udd already has very well-defined indexes. I also
> tried changing the join conditions but there was not much improvement. I
> maybe wrong in my analysis but if you could help me understand the query a
> little better, it would be of great help.
I think your hint was of great help! Droping the GROUP BY increased the
speed by an order of magnitude. I also droped the other similarly
redundant GROUP BY (see my last commit). This has caused a drastical
performance kick. if you compare the old result:
$ cat logs/*bugs_udd.out
...
Time: 281.581650019
Query: EXECUTE query_bug_packages('debian-games')
Rendering bugs from UDD for Blend 'debian-games' took 299 seconds
Time: 34.5106248856
Query: EXECUTE query_bug_packages('debian-gis')
Rendering bugs from UDD for Blend 'debian-gis' took 37 seconds
...
Time: 868.058537006
Query: EXECUTE query_bug_packages('debian-junior')
Rendering bugs from UDD for Blend 'debian-junior' took 892 seconds
...
Rendering bugs from UDD for Blend 'debian-med' took 27 seconds
Time: 193.018954039
Query: EXECUTE query_bug_packages('debian-multimedia')
Rendering bugs from UDD for Blend 'debian-multimedia' took 198 seconds
...
Time: 183.071342945
Query: EXECUTE query_bug_packages('debian-science')
Rendering bugs from UDD for Blend 'debian-science' took 194 seconds
cat logs/*bugs_udd.out
...
Rendering bugs from UDD for Blend 'debian-games' took 57 seconds
Rendering bugs from UDD for Blend 'debian-gis' took 9 seconds
Time: 70.1499569416
Query: EXECUTE query_bug_packages('debian-junior')
Rendering bugs from UDD for Blend 'debian-junior' took 86 seconds
...
Rendering bugs from UDD for Blend 'debian-med' took 28 seconds
Rendering bugs from UDD for Blend 'debian-multimedia' took 28 seconds
...
Rendering bugs from UDD for Blend 'debian-science' took 31 seconds
So if you compare the time from Debian Junior 892 versus 86 seconds
that's the amount of time which is perfectly acceptable.
Now I was also able to run debian-edu on my box in a sensible time
frame:
Time: 178.415522814
Query: EXECUTE query_bugs('debian-edu')
Time: 1252.45552516
Query: EXECUTE query_bug_packages('debian-edu')
Rendering bugs from UDD for Blend 'debian-edu' took 1437 seconds
I think this would be also acceptable to run on production UDD.
However, I wonder whether you could do some further review after this
issue should be clarified. Perhaps we get another factor 2 or 4 for
speeding up to make sure we will not uselessly stress test Debian
hardware.
In summary: Thanks a lot for your help to
a) Get bugs_udd working at all
b) Speed it up to some acceptable state.
That's really welcome since my time capacity is to hard occupied
by other things and I have obviously not given enough love to these
tools.
Another hint: In the dir misc/sql/ in the Git repository I have stored
some test queries. The script bugs.sh (or
bugs_including_metapackage_bugs.sh - I do not remember properly) was
used as template to test the queries quickly. If you want to use these
feel free to change these scripts - they have no other purpose than
testing. Make sure you check whether they are in sync with bugs_udd
first!
Kind regards
Andreas.
--
http://fam-tille.de
Reply to: