Bug#695182: Write couple of 1GB files for OOM crash

To: Paul Szabo <paul.szabo@sydney.edu.au>
Cc: 695182@bugs.debian.org, Ben Hutchings <ben@decadent.org.uk>
Subject: Bug#695182: Write couple of 1GB files for OOM crash
From: Jonathan Nieder <jrnieder@gmail.com>
Date: Sun, 6 Jan 2013 17:20:25 -0800
Message-id: <[🔎] 20130107012024.GC3823@elie.Belkin>
Reply-to: Jonathan Nieder <jrnieder@gmail.com>, 695182@bugs.debian.org
In-reply-to: <[🔎] 20130106233039.GA32071@maths.usyd.edu.au>
References: <[🔎] 20130101213303.GA19208@maths.usyd.edu.au> <[🔎] 1357083551.31299.54.camel@deadeye.wl.decadent.org.uk> <[🔎] 20130106233039.GA32071@maths.usyd.edu.au>

Hi Paul,

Paul Szabo wrote:

> I changed the proposed patch accordingly, scripts/checkpatch.pl produces
> just a few warnings. I had my patch in use for a while now, so I believe
> it is suitably tested.

Thanks much --- this is easier to read.

Here's a quick review from a non-expert (i.e., me).  When the patch
seems ready to you, please send it inline in a message to
linux-mm@kvack.org, cc-ing linux-kernel@vger.kernel.org and Ben and me
(or this bug log) so we can track it.  The mm folks will probably have
more feedback, and we can take it from there.

> Avoid OOM when filesystem caches fill lowmem and are not reclaimed,
> doing drop_caches at that point. The issue is easily reproducible on
> machines with over 32GB RAM. The patch correctly protects against OOM.
> The added call to drop_caches has been observed to trigger "needlessly"
> but on quite rare occasions only.

Sounds like a reasonable strategy.

> Also included are several minor fixes:

Those should go in separate patches, but if you're just seeking
comments then lumping them with the main change in a patch with
[RFC/PATCH] in the subject line is fine.

[...]
> Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au>

Thanks.

> --- fs/drop_caches.c.old	2012-10-17 13:50:15.000000000 +1100
> +++ fs/drop_caches.c	2013-01-04 21:52:47.000000000 +1100
> @@ -65,3 +65,10 @@ int drop_caches_sysctl_handler(ctl_table
>  	}
>  	return 0;
>  }
> +
> +/* Easy call: do "echo 3 > /proc/sys/vm/drop_caches" */

I don't understand this comment.  What does "Easy call" mean?

> +void easy_drop_caches(void)
> +{
> +	iterate_supers(drop_pagecache_sb, NULL);
> +	drop_slab();
> +}

This should be declared in some appropriate header (e.g.,
include/linux/mm.h).

> --- mm/page-writeback.c.old	2012-10-17 13:50:15.000000000 +1100
> +++ mm/page-writeback.c	2013-01-06 21:54:59.000000000 +1100
> @@ -39,7 +39,8 @@
>  /*
>   * Sleep at most 200ms at a time in balance_dirty_pages().
>   */
> -#define MAX_PAUSE		max(HZ/5, 1)
> +/* Might as well be max(HZ/5,4) to ensure max_pause/4>0 always */
> +#define MAX_PAUSE		max(HZ/5, 4)

This is one of the "while at it"s rather than the main point of
the patch, right?

[...]
> @@ -343,11 +344,26 @@ static unsigned long highmem_dirtyable_m
>  unsigned long determine_dirtyable_memory(void)
>  {
>  	unsigned long x;
> +	int y = 0;
> +	extern int min_free_kbytes;

Probably should be declared in a header like mm/internal.h, but
there's precedent for the ad-hoc declaration, so meh.

[...]
> +	/*
> +	 * Seems that highmem_is_dirtyable is only used here, in the
> +	 * calculation of limits and threshholds of dirtiness, not in deciding
> +	 * where to put dirty things. Is that so? Is that as should be?
> +	 * What is the recommended setting of highmem_is_dirtyable?
> +	 */
>  	if (!vm_highmem_is_dirtyable)
>  		x -= highmem_dirtyable_memory(x);

I dunno.  See 195cf453d2c3 (mm/page-writeback: highmem_is_dirtyable
option, 2008-02-04) and the thread surrounding
<http://thread.gmane.org/gmane.linux.alsa.devel/49438/focus=603793>.

> +	/* Subtract min_free_kbytes */
> +	if (min_free_kbytes > 0)
> +		y = min_free_kbytes >> (PAGE_SHIFT - 10);

Why the "if"?

If I were doing it, I think I'd unconditionally set 'y' here, for
clarity and to avoid bugs if some later patch reuses the var for
something else.

	y = min_free_kbytes >> (PAGE_SHIFT - 10);
	x -= min(x, y);

[...]
> @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
>  	 *     => fast response on large errors; small oscillation near setpoint
>  	 */
>  	setpoint = (freerun + limit) / 2;
> -	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>  		    limit - setpoint + 1);

Why are 'setpoint' and 'dirty' of type unsigned long instead of long
in the first place?  What happens if one of their values overflows a
long?

>  	pos_ratio = x;
>  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> @@ -995,6 +1014,13 @@ static unsigned long bdi_max_pause(struc
>  	 * The pause time will be settled within range (max_pause/4, max_pause).
>  	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
>  	 */
> +	/*
> +	 * On large machine it seems we always return 4,
> +	 * on smaller desktop machine mostly return 5 (rarely 9 or 14).
> +	 * Are those too small? Should we return something fixed e.g.
> +	return (HZ/10);
> +	 * instead of this wasted/useless calculation?
> +	 */
>  	return clamp_val(t, 4, MAX_PAUSE);

Another "while at it", I guess.

>  }
>  
> @@ -1109,6 +1135,11 @@ static void balance_dirty_pages(struct a
>  		}
>  		pause = HZ * pages_dirtied / task_ratelimit;
>  		if (unlikely(pause <= 0)) {
> +			/*
> +			 * Not unlikely: often we get zero.
> +			 * Seems we always get 0 on large machine.
> +			 * Should not do a pause of 1 here?
> +			 */
>  			trace_balance_dirty_pages(bdi,

"git log -S'if (unlikely(pause <= 0))' -- mm/page-writeback.c" tells
me this is from 57fc978cfb61 (writeback: control dirty pause time,
2011-06-11), in case that helps.

[...]
> --- mm/vmscan.c.old	2012-10-17 13:50:15.000000000 +1100
> +++ mm/vmscan.c	2013-01-06 09:50:49.000000000 +1100
[...]
> @@ -2726,9 +2731,87 @@ loop_again:
>  				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
>  				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  				total_scanned += sc.nr_scanned;
> +				if (unlikely(
> +				    i == 1 &&
> +				    nr_slab < 10 &&
> +				    (reclaim_state->reclaimed_slab) < 10 &&
> +				    zone_page_state(zone, NR_SLAB_RECLAIMABLE) > 10 &&
> +				    !zone_watermark_ok_safe(zone, order,
> +						high_wmark_pages(zone), end_zone, 0))) {
> +					/*
> +					 * We are stressed (desperate), better

This is getting really deeply nested.  Would it be possible to split
out a function so this code could be more easily contemplated in
isolation?

Thanks for taking this on and hope that helps.

Regards,
Jonathan

Reply to:

Follow-Ups:
- Bug#695182: Write couple of 1GB files for OOM crash
  - From: paul.szabo@sydney.edu.au

References:
- Bug#695182: linux-image-3.2.0-4-686-pae: Write couple of 1GB files for OOM crash
  - From: Paul Szabo <paul.szabo@sydney.edu.au>
- Bug#695182: linux-image-3.2.0-4-686-pae: Write couple of 1GB files for OOM crash
  - From: Ben Hutchings <ben@decadent.org.uk>
- Bug#695182: linux-image-3.2.0-4-686-pae: Write couple of 1GB files for OOM crash
  - From: Paul Szabo <paul.szabo@sydney.edu.au>

Prev by Date: Bug#695182: linux-image-3.2.0-4-686-pae: Write couple of 1GB files for OOM crash
Next by Date: Bug#624343: data corruption: md changes max_sector setting of running md devices
Previous by thread: Bug#695182: linux-image-3.2.0-4-686-pae: Write couple of 1GB files for OOM crash
Next by thread: Bug#695182: Write couple of 1GB files for OOM crash
Index(es):
- Date
- Thread