[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)



On Wed, 13 Oct 2004 07:29, Henrique de Moraes Holschuh <hmh@debian.org> wrote:
> We have a lot of resources, why can't we invest some of them into a small
> three or four machine cluster to handle all debian email (MLs included),

A four machine cluster can be used for the entire email needs of a 500,000 
user ISP.  I really doubt that we need so much hardware.

> and tune the entire thing for the ground up just for that? And use it
> *only* for that?  That would be enough for two MX, one ML expander and one
> extra machine for whatever else we need. Maybe more, but from two (master +
> murphy) two four optimized and exclusive-for-email machines should be a
> good start :)

I think that front-end MX machines is a bad idea in this environment.  It 
means that more work is required to correctly give 55x codes in response to 
non-existent recipients (vitally important for list servers which will 
receive huge volumes of mail to random-name@list-server and which should not 
generate bounces for it).

We don't have the performance requirements that would require front-end MX 
machines.

> colaborative work needs the MLs in tip-top shape, or it suffers a LOT. Way,
> way too many developers use @debian.org as their primary Debian contact
> address (usually the ONLY well-advertised one), and get out of the loop
> everytime master.d.o croaks.

OK, having a single dedicated mail server instead of a general machine like 
master makes sense.

> One of the obvious things that come to mind is that we should have MX
> machines with very high disk throughput, of the kinds we need RAID 0 on top
> of RAID 1 to get.  Proper HW RAID (defined as something as good as the
> Intel SCRU42X fully-fitted) would help, but even LVM+MD allied to proper
> SCSI U320 hardware would give us more than 120MB/s read throughput (I have
> done that).

U320 is not required.  I don't believe that you can demonstrate any 
performance difference between U160 and U320 for mail server use if you have 
less than 10 disks on a cable.  Having large numbers of disks on a cable 
brings other issues, so I recommend a scheme that has only a single disk per 
cable (S-ATA or Serial Attached SCSI).

RAID-0 on top of RAID-1 should not be required either.  Hardware RAID-5 with a 
NV-RAM log device should give all the performance that you require.

You will NEVER see 120MB/s read throughput on a properly configured mail 
server that serves data for less than about 10,000,000 users!  When I was 
running the servers for 1,000,000 users there was a total of about 3M/s 
(combined read and write) on each of the five back-end servers.  A total of 
15MB/s while each server had 4 * U160-15K disks (total of 20 * U160-15K 
disks).  The bottlenecks were all on seeks, nothing else mattered.

> Maybe *external* journals on the performance-critical filesystems would
> help (although data=journal makes that a *big* maybe for the spools, the
> logging on /var always benefit from an external journal). And in that case,
> we'd need obviously two IO-independent RAID arrays. That means at least 6
> discs, but all of them can be small disks.

http://www.umem.com/16GB_Battery_Backed_PCI_NVRAM.html

If you want to use external journals then use a umem device for it.  The above 
URL advertises NV-RAM devices with capacities up to 16G which run at 64bit 
66MHz PCI speed.  Such a device takes less space inside a PC than real disks, 
produces less noise, has no moving parts (good for reliability) and has ZERO 
seek time as well as massive throughput.

Put /var/spool on that as well as the external journal for the mail store and 
your mail server should be decently fast!

> The other is to use a filesystem that copes very well with power failures,
> and tune it for spool work (IMHO a properly tunned ext3 would be best, as
> XFS has data integrity issues on crashes even if it is faster (and maybe
> the not-even-data=ordered XFS way of life IS the reason it is so fast). I
> don't know about ReiserFS 3, and ReiserFS 4 is too new to trust IMHO).

reiserfsck has a long history of not being able to fix all possible errors.  A 
corrupted ReiserFS file system can cause a kernel oops and this isn't 
considered to be a serious issue.

ext3 is the safe bet for most Linux use.  It is popular enough that you can 
reasonably expect that bugs get found by someone else first, and the 
developers have a good attitude towards what is a file system bug.

> The third is to not use LDAP for lookups, but rather cache them all in a
> local, exteremly fast DB (I hope we are already doing that!).  That alone
> could get us a big speed increase on address resolution and rewriting,
> depending on how the MTA is configured.

I've run an ISP with more than 1,000,000 users with LDAP used for the 
back-end.  The way it worked was that mail came to front-end servers which 
did LDAP lookups to determine which back-end server to deliver to.  The 
back-end servers did LDAP lookups to determine the directory to put the mail 
in.  When users checked mail via POP or IMAP Perdition did an LDAP lookup to 
determine which back-end server to proxy the connection to, and then the 
back-end server had Courier POP or IMAP do another LDAP lookup.  It worked 
fine with about 5 LDAP servers for 1,000,000 users.

As we have far fewer users we should be able to use a single LDAP server with 
no performance issues.  If there are LDAP performance issues then it 
shouldn't be difficult to solve, I can offer advice on this if I am given 
details of what's happening.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


-- 
Please respect the privacy of this mailing list.

Archive: file://master.debian.org/~debian/archive/debian-isp/

To UNSUBSCRIBE, use the web form at <http://db.debian.org/>.



Reply to: