[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#675106: ITP: pgbulkload -- A high speed data loading utility for PostgreSQL



Ivan Shmakov <oneingray@gmail.com> writes:
> Alexander Kuznetsov <acca@cpan.org> writes:
[…]
>	(Some wording fixes and suggestions.)

Thanks a lot! For some reasons the message got off the thread, I
accidently found it while searching for another. Also lists.debian.org
cannot find the original post, while GMANE shows it perfectly fine. Is
it supposed to be like that?


[...]
>> ignored during the loading. For example, you can skip integrity checks for
>> performance when you copy data from another database to PostgreSQL. On the
>> other hand, you can enable constraint checks when loading unclean data.
>
>	Are “constraint checks” different to “integrity checks” in the
>	above?  Unless they are, it should rather be, e. g.:

Integrity check does include constraint check but in this case they
are kept separate. The authors emphasize the fact that you can perform
constraint check with pg_bulkload for unclean data while having
[expensive] database server integrity check turned off.


>> PostgreSQL, but version 3.0 or later has some ETL features like input data
>> validation and data transformation with filter functions.
>
>   … but as of version 3.0 some ETL features… were added.
>
>	And what's ETL, BTW?

Enter-Transform-Load - a software development pattern which currently
evolved into an industry. Used to be a nice girl by a keyboard,
nowadays implemented with network clusters.


>> In version 3.1, pg_bulkload can convert the load data into the binary file
>> which can be used as an input file of pg_bulkload. If you check whether
>
>	Perhaps:
>
>   As of version 3.1, pg_bulkload can dump the preprocessed data into a
>   binary file, allowing for…

This would not be entirely true. While pg_bulkload does allow to
convert the data into binary file, it requires assistance of
server-side components of the package. Which one may consider not
pg_bulkload utility itself and this is certainly not simple dumping
preprocessed data.


>	(Here, the purpose should be mentioned.  Is this for improving
>	the performance of later multiple “bulkloads”, for instance?)

I would say the reverse. Multiple `bulkload' instances perform
conversion using multiple [satellite] servers, which may populate
[network] storage. Later a "main" server could pick up preprocessed
data chunks and quickly load them.

To make use of pg_bulkload 3.1+ ability to convert the data into
binary form it is currently required to create a rather specific
setup. I would withhold the promises of better performance as people
would expect "dump binary locally, then upload to the server"
functionality. It may hardly be feasible if, say, the server and the
client have different CPU types.

A single server/single storage case is the worst for the binary
conversion. The process will be constrained by the RAM/storage
bandwidth and slowed down almost twice.


>> the load time itself. Also in version 3.1, parallel loading works
>> more effectively than before.
>	s/effectively/efficiently/.  But the whole sentence makes little
>	sense, as the earlier versions weren't packaged for Debian.

Good point, thanks!

-- 
Sincerely yours, Alexander Kuznetsov



Reply to: