Bug#695182: Write couple of 1GB files for OOM crash

To: jrnieder@gmail.com
Cc: 695182@bugs.debian.org, ben@decadent.org.uk
Subject: Bug#695182: Write couple of 1GB files for OOM crash
From: paul.szabo@sydney.edu.au
Date: Mon, 7 Jan 2013 14:03:17 +1100
Message-id: <[🔎] 201301070303.r0733Hs0026687@como.maths.usyd.edu.au>
Reply-to: paul.szabo@sydney.edu.au, 695182@bugs.debian.org
In-reply-to: <[🔎] 20130107012024.GC3823@elie.Belkin>

Dear Jonathan,

> Here's a quick review from a non-expert (i.e., me). ...

Thanks!

> ... When the patch
> seems ready to you, please send it inline in a message to
> linux-mm@kvack.org, cc-ing linux-kernel@vger.kernel.org and Ben and me
> (or this bug log) so we can track it.  The mm folks will probably have
> more feedback, and we can take it from there.

I was hoping that Ben would take that on; but OK, will do (first waiting
for maybe Ben's comments, and your reply to my questions below).

>> Also included are several minor fixes:
> Those should go in separate patches, but if you're just seeking
> comments then lumping them with the main change in a patch with
> [RFC/PATCH] in the subject line is fine.

In a sense, I am seeking comments on why lowmem is ever polluted with
caches and why slabs are not cleared up without dropping pagecache, as
I feel that the "real bug" is lurking there. - I admit that I do not
understand MM, it seems a mess to my un-educatedf eyes. - Some of those
comments are questions, some others I am sure are bugs.

OK, will use [RFC/PATCH] in the subject.

>> +/* Easy call: do "echo 3 > /proc/sys/vm/drop_caches" */
>
> I don't understand this comment.  What does "Easy call" mean?
>
>> +void easy_drop_caches(void)
>> +{
>> +	iterate_supers(drop_pagecache_sb, NULL);
>> +	drop_slab();
>> +}
>
> This should be declared in some appropriate header (e.g.,
> include/linux/mm.h).

I added one new call, so I could easily call it from elsewhere without
having to (cross?)-declare many things; in fact would have preferred to
call iterate_supers() and drop_slab() directly.

Yes, easy_drop_caches() might be declared in some include file, but
should that be <linux/mm.h> or <linux/fs.h>? Would add another file to
those changed, seemed more direct this way.

Should I change something?

>> --- mm/page-writeback.c.old	2012-10-17 13:50:15.000000000 +1100
>> +++ mm/page-writeback.c	2013-01-06 21:54:59.000000000 +1100
>> @@ -39,7 +39,8 @@
>>  /*
>>   * Sleep at most 200ms at a time in balance_dirty_pages().
>>   */
>> -#define MAX_PAUSE		max(HZ/5, 1)
>> +/* Might as well be max(HZ/5,4) to ensure max_pause/4>0 always */
>> +#define MAX_PAUSE		max(HZ/5, 4)
>
> This is one of the "while at it"s rather than the main point of
> the patch, right?

Yes, it was one of the unimportant oddities I noticed.

> [...]
>> @@ -343,11 +344,26 @@ static unsigned long highmem_dirtyable_m
>>  unsigned long determine_dirtyable_memory(void)
>>  {
>>  	unsigned long x;
>> +	int y = 0;
>> +	extern int min_free_kbytes;
>
> Probably should be declared in a header like mm/internal.h, but
> there's precedent for the ad-hoc declaration, so meh.

Yes, I noticed some precedents.

> [...]
>> +	/*
>> +	 * Seems that highmem_is_dirtyable is only used here, in the
>> +	 * calculation of limits and threshholds of dirtiness, not in deciding
>> +	 * where to put dirty things. Is that so? Is that as should be?
>> +	 * What is the recommended setting of highmem_is_dirtyable?
>> +	 */
>>  	if (!vm_highmem_is_dirtyable)
>>  		x -= highmem_dirtyable_memory(x);
>
> I dunno.  See 195cf453d2c3 (mm/page-writeback: highmem_is_dirtyable
> option, 2008-02-04) and the thread surrounding
> <http://thread.gmane.org/gmane.linux.alsa.devel/49438/focus=603793>.

That made my head spin... Thanks for the pointers, will go through them
sometime.

>> +	/* Subtract min_free_kbytes */
>> +	if (min_free_kbytes > 0)
>> +		y = min_free_kbytes >> (PAGE_SHIFT - 10);
>
> Why the "if"?

Sanity-check paranoia: because I do not trust the input values. Maybe
should have
  BUG_ON(min_free_kbytes < 0);
but that does not belong there.

> If I were doing it, I think I'd unconditionally set 'y' here, for
> clarity and to avoid bugs if some later patch reuses the var for
> something else.
>
> 	y = min_free_kbytes >> (PAGE_SHIFT - 10);
> 	x -= min(x, y);

Do you think I should change my patch? To me, it seemed clearer the way 
I had it.

> [...]
>> @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
>>  	 *     => fast response on large errors; small oscillation near setpoint
>>  	 */
>>  	setpoint = (freerun + limit) / 2;
>> -	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
>> +	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>>  		    limit - setpoint + 1);
>
> Why are 'setpoint' and 'dirty' of type unsigned long instead of long
> in the first place?  What happens if one of their values overflows a
> long?

I guess because most other such things are also unsigned long; and I
guess they were so when they meant memory addresses. These are now page
counts, cannot overflow (with current limit of 64GB RAM).

>> @@ -995,6 +1014,13 @@ static unsigned long bdi_max_pause(struc
>>  	 * The pause time will be settled within range (max_pause/4, max_pause).
>>  	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
>>  	 */
>> +	/*
>> +	 * On large machine it seems we always return 4,
>> +	 * on smaller desktop machine mostly return 5 (rarely 9 or 14).
>> +	 * Are those too small? Should we return something fixed e.g.
>> +	return (HZ/10);
>> +	 * instead of this wasted/useless calculation?
>> +	 */
>>  	return clamp_val(t, 4, MAX_PAUSE);
>
> Another "while at it", I guess.

Yes. On one hand the code sometimes pauses for HZ/10 or so, and on the
other hand we have this routine working so hard to return the minimum
possible value.

>> @@ -1109,6 +1135,11 @@ static void balance_dirty_pages(struct a
>>  		}
>>  		pause = HZ * pages_dirtied / task_ratelimit;
>>  		if (unlikely(pause <= 0)) {
>> +			/*
>> +			 * Not unlikely: often we get zero.
>> +			 * Seems we always get 0 on large machine.
>> +			 * Should not do a pause of 1 here?
>> +			 */
>>  			trace_balance_dirty_pages(bdi,
>
> "git log -S'if (unlikely(pause <= 0))' -- mm/page-writeback.c" tells
> me this is from 57fc978cfb61 (writeback: control dirty pause time,
> 2011-06-11), in case that helps.

Will try to look it up sometime. - I had printk() to tell me the value
of pause, and mostly I got zero. Wonder how others measured it.

> [...]
>> --- mm/vmscan.c.old	2012-10-17 13:50:15.000000000 +1100
>> +++ mm/vmscan.c	2013-01-06 09:50:49.000000000 +1100
> [...]
>> @@ -2726,9 +2731,87 @@ loop_again:
>>  				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>>  				total_scanned += sc.nr_scanned;
>> +				if (unlikely(
>> +				    i == 1 &&
>> +				    nr_slab < 10 &&
>> +				    (reclaim_state->reclaimed_slab) < 10 &&
>> +				    zone_page_state(zone, NR_SLAB_RECLAIMABLE) > 10 &&
>> +				    !zone_watermark_ok_safe(zone, order,
>> +						high_wmark_pages(zone), end_zone, 0))) {
>> +					/*
>> +					 * We are stressed (desperate), better
>
> This is getting really deeply nested.  Would it be possible to split
> out a function so this code could be more easily contemplated in
> isolation?

Hmm... I would much prefer to leave it as is.

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

Reply to:

Follow-Ups:
- Bug#695182: Write couple of 1GB files for OOM crash
  - From: Jonathan Nieder <jrnieder@gmail.com>
- Bug#695182: Write couple of 1GB files for OOM crash
  - From: Ben Hutchings <ben@decadent.org.uk>

References:
- Bug#695182: Write couple of 1GB files for OOM crash
  - From: Jonathan Nieder <jrnieder@gmail.com>

Prev by Date: Bug#624343: data corruption: md changes max_sector setting of running md devices
Next by Date: Bug#695182: Write couple of 1GB files for OOM crash
Previous by thread: Bug#695182: Write couple of 1GB files for OOM crash
Next by thread: Bug#695182: Write couple of 1GB files for OOM crash
Index(es):
- Date
- Thread