[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UDD: statement_timeout for guest user set to 120s



Re: Lucas Nussbaum 2011-10-29 <20111029045114.GA25092@xanadu.blop.info>
> Clearly the problem here is the Seq scans on packages, that require reading the
> data from disk since the table usually doesn't stay in RAM. (the packages table
> takes 2 GB). Apparently, when the machine is idle, the runtime is reasonable (I
> could get a runtime of 14s). But if the query runs at the same time as an
> import of data, it can take more than 200s.
> 
> I tried to add an INDEX:
> CREATE INDEX packages_pkgverdescr_idx on packages(package, version,description);

Including description in the index doesn't really help as that's not
an column that is searched for, but that should be retrieved.
(packages_pkgverdescr_idx is still there and should be dropped again.)

>                                                                QUERY PLAN                                                                
> -----------------------------------------------------------------------------------------------------------------------------------------
>  HashAggregate  (cost=499026.31..499722.83 rows=69652 width=58) (actual time=14708.197..14751.840 rows=43791 loops=1)
>    ->  Hash Join  (cost=471046.39..497285.02 rows=348258 width=58) (actual time=12295.665..14135.913 rows=444577 loops=1)
>          Hash Cond: (public.packages.package = public.packages.package)
>          Join Filter: (public.packages.version = (max(public.packages.version)))
>          ->  HashAggregate  (cost=228999.76..229306.82 rows=24565 width=24) (actual time=6314.553..6382.598 rows=43787 loops=1)
>                ->  Seq Scan on packages  (cost=0.00..225517.17 rows=696517 width=24) (actual time=163.114..4332.862 rows=696049 loops=1)
>          ->  Hash  (cost=225517.17..225517.17 rows=696517 width=68) (actual time=5980.429..5980.429 rows=696049 loops=1)
>                Buckets: 65536  Batches: 2  Memory Usage: 30568kB
>                ->  Seq Scan on packages  (cost=0.00..225517.17 rows=696517 width=68) (actual time=167.867..4632.245 rows=696049 loops=1)
>  Total runtime: 14776.545 ms
> (10 rows)

The best I could get now was to use a window function to get rid of
the self-join:

EXPLAIN ANALYZE SELECT package, description, version FROM
  (SELECT packages.package, packages.description, packages.version,
  row_number() OVER (PARTITION BY packages.package ORDER BY version DESC) AS row
FROM packages) AS sub
WHERE row = 1;

                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on sub  (cost=433124.90..478921.49 rows=7046 width=96) (actual time=18954.689..22608.978 rows=44375 loops=1)
   Filter: (sub."row" = 1)
   ->  WindowAgg  (cost=433124.90..461307.42 rows=1409126 width=68) (actual time=18954.650..22210.341 rows=700467 loops=1)
         ->  Sort  (cost=433124.90..436647.71 rows=1409126 width=68) (actual time=18954.612..20714.929 rows=700467 loops=1)
               Sort Key: packages.package, packages.version
               Sort Method:  external merge  Disk: 53984kB
               ->  Seq Scan on packages  (cost=0.00..233817.26 rows=1409126 width=68) (actual time=0.376..6184.397 rows=700467 loops=1)
 Total runtime: 22832.212 ms
(8 Zeilen)

This is a better plan, but doesn't run faster than your version.
(Ideally, Postgres would realize it could use one of the (package,
version, ...) indexes instead of sorting the whole table.)

> I also tried to rewrite the query into:
> explain analyze select distinct packages.package, packages.description
> from packages where (package, version) in (select package, max(version) from packages_summary group by package);
> 
> (packages_summary is the table with a restricted set of field from packages)

That's still a pretty big table with several times more rows than
should be retrieved (because of multiple archs or something like
that).

In the end, there's probably no way to optimize the query because
effectively the query asks for (almost) all packages to be considered
in the computation.

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/


Reply to: