Hello all, I might be doing something wrong, but I'm getting a really strange behavior in a program I'm writing. Here's the scenario: I've got a daemon program which creates one thread. The main daemon process is listening on a unix socket, getting data from the client, creating a request (by malloc'ing storage for a structure) and putting it in the circular buffer. The circular buffer and all variables related to it are protected by a mutex. The thread created at the startup blocks waiting for a condition to be signalled which is done by the main program right after it puts a new request in the queue. When that happens, the thread wakes up, gets the request from the queue and unlocks the mutex that protects it. So far, so good - but the problem is that the data gotten from the queue in the thread is partially corrupted: the circular buffer address is correct, the address of the malloc'ed request structure is correct, but the data stored in the structure is apparently random. The interesting thing is that the actual data of the request stored in the buffer is _not_ corrupted. Here's a fragment of a debug log from the daemon: [prog] Inserting req == 0x804f258; nto == 2; to == 0x804f300 [prog] Putting into queue 0x804d0a0 slot 0 [prog] Inserted req == 0x804f258; nto == 2; to == 0x804f300 at this point the condition is broadcast and the thread wakes up: [thread] Checking req == 0x804f258; nto == 2; to == (nil) note that the address of the request is correct but the data is corrupted. [thread] Getting from queue 0x804d0a0 slot 0 queue address is also correct [thread] Retrieved req == 0x804f258; nto == 2; to == 0x30613064 random data appears in the req storage [thread] Got req == 0x804f258; nto == 1869881403; to == 0x203b3835 even more randomness the above output from the thread is done with the lock held, there is no way the data could be modified in the meantime. Now we're back in the main program after pthread_cond_broadcast returns and the mutex is unlocked: [prog] Checking(2) req == 0x804f258; nto == 2; to == 0x804f300 And the data is correct again. The code is not complex and I'm positive that there is no race condition when accessing the data. The variables and routines that operate on the queue (with the mutex lock held) are as follows: ----- CUT ----- static pthread_mutex_t reqq_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t reqq_cond = PTHREAD_COND_INITIALIZER; static unsigned long reqq_size = 0; static unsigned long reqq_write_idx = 0; static unsigned long reqq_read_idx = 0; static vda_request **reqq = NULL; inline static int rq_empty() { return reqq_write_idx == reqq_read_idx; } inline static int rq_full() { return ((reqq_write_idx + 1) % reqq_size) == reqq_read_idx; } vda_request *rq_get() { vda_request *ret; if (rq_empty()) return NULL; logmsg("Getting from queue %p slot %d", reqq, reqq_read_idx); ret = reqq[reqq_read_idx++]; reqq_read_idx %= reqq_size; logmsg("Retrieved req == %p; nto == %d; to == %p", ret, ret->nto, ret->to); return ret; } int rq_insert(vda_request *req) { if (rq_full()) return 1; logmsg("Putting into queue %p slot %d", reqq, reqq_write_idx); reqq[reqq_write_idx++] = req; reqq_write_idx %= reqq_size; logmsg("Inserted req == %p; nto == %d; to == %p", reqq[reqq_read_idx], reqq[reqq_read_idx]->nto, reqq[reqq_read_idx]->to); return 0; } ----- CUT ----- The glibc is 2.3.1-10 from Sid, unmodified. I was trying to compile one with nptl and tls support to check whether the same would happen, but it seems that nptl 0.17 doesn't really want to compile with the latest glibc from the cvs, so I gave up on this for now. Is my understanding that the malloc'ed memory area can be shared between threads in a process correct? That's what I've always known and thought, but I might be wrong. Or is it some kind of obscure bug in glibc 2.3.1? TIA, marek
Attachment:
pgpzyRw4POgIK.pgp
Description: PGP signature