Re: reiserfs & databases.
On Wed, 30 Aug 2000, Bulent Murtezaoglu wrote:
> RC> The idea is that the database vendor knows their data storage
> RC> better than the OS can guess it, and that knowledge allows
> RC> them to implement better caching algorithms than the OS can
> RC> use. The fact that benchmark results show that raw partition
> RC> access is slower indicates that the databases aren't written
> RC> as well as they are supposed to be.
>I am not convinced that this conclusion is warranted, though I admit I
>have not seen those benchmarks. The DB vendor's raw disk driver might
I have to admit that I have not seen the benchmarks either. However one
reason that I believe the results are likely to be correct is the issue of
determining the cache size for the database. If the database does raw access
then it must manage it's own cache, and for the sake of sanity it must
mlock() the cache memory (having disk cache being swapped is stupid, and
doubly stupid when swap is slower than the database storage file system as is
often the case). This means that the cache memory is not available for the
OS. If the machine does nothing but database access then this is probably
OK, however such dedicated database servers are quite rare.
If we assume that every database server will be running other tasks than the
database server (if only cron jobs that manage backups, tripwire, reporting,
etc) then you will be hit by two problems, one is the situation of having an
idle database mlock()ing all your memory so active programs run very slow,
another problem is the database being the only active program but being
configured not to use all the memory. If the OS does the caching then it
will dynamically allocate the system memory to the process that needs it.
>be doing things like synchronous writes for maintaining its own
>invariants, while a [non-journalling] file system will care about fs
>meta-data consistency at best. While it is possible that the general
The journalling will make sure that the file system doesn't get trashed after
a crash. The database can call fdatasync() to make sure that it's own data
is correctly synchronised. If there is a need to sync only part of a file
then you can memory map it and use msync() to synchronise one page while
leaving other data in the write-back cache.
>purpose file system with more man-hours behind it is better written,
>the benchmarks might be omitting crucial criteria like crash
>protection and such. Do you guys have references to benchmarking
If the database correctly calls fsync(), fdatasync(), and msync() at
appropriate times and the file system and OS correctly implement these system
calls then the crash protection should be as good as it is going to get.
Also it should reduce the code paths in the database. If the database is
writing everything synchronously (as it will want to do with a raw device)
then it will have to use it's own write-back cache which will involve lots of
inter-process or inter-thread communication and other overhead.
My current location - X marks the spot.