[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Cloud-init delay - sometimes



On Tue, 28 Sept 2021 at 19:10, Noah Meyerhans <noahm@debian.org> wrote:
>
> On Tue, Sep 28, 2021 at 09:48:15AM -0700, Ross Vandegrift wrote:
> > > I'm regularly reinstantiating many AWS instances that run Debian  (bullseye).
> > > I see some strange behavior, in that instance come up very
> > > consistently in about 6 seconds (according to
> > > /var/log/cloud-init-output.log).
> > > But every now and then it takes much longer, in the order of 40 seconds.
> > [snip]
> > >
> > > I guess this is not a problem with the Debian image, and I can
> > > probably configure our deployment to cope with this delay, but I'm
> > > curious as to why this happens.
> >
> > When it happens, take a look at /var/log/cloud-init.log - that'll
> > contain details of cloud-init's actions, and timestamps.  Hopefully the
> > 40s issue will stand out.

The issue I have is that the very first line of
/var/log/cloud-init.log already mentions the long Up time:

2021-09-28 16:13:50,130 - util.py[DEBUG]: Cloud-init v. 20.4.1 running
'init-local' at Tue, 28 Sep 2021 16:13:50 +0000. Up 42.21 seconds.

vs the first line of that file on another host that didn't have the delay:

2021-09-28 15:07:27,909 - util.py[DEBUG]: Cloud-init v. 20.4.1 running
'init-local' at Tue, 28 Sep 2021 15:07:27 +0000. Up 5.52 seconds.

So whatever happened, seems to have happened before cloud-init even started.

The first lines of some other files in /var/log start 4 seconds further:

root@ip-172-16-209-4:/var/log# head -1 *.log
==> auth.log <==
Sep 28 16:13:54 ip-172-16-209-4 useradd[641]: new group: name=admin, GID=1000

==> cloud-init-output.log <==
Cloud-init v. 20.4.1 running 'init-local' at Tue, 28 Sep 2021 16:13:50
+0000. Up 42.21 seconds.

==> cloud-init.log <==
2021-09-28 16:13:50,130 - util.py[DEBUG]: Cloud-init v. 20.4.1 running
'init-local' at Tue, 28 Sep 2021 16:13:50 +0000. Up 42.21 seconds.

==> daemon.log <==
Sep 28 16:13:54 ip-172-16-209-4 systemd[1]: modprobe@configfs.service:
Succeeded.

==> kern.log <==
Sep 28 16:13:54 ip-172-16-209-4 kernel: [    0.000000] Linux version
5.10.0-8-cloud-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian
10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1
SMP Debian 5.10.46-4 (2021-08-03)

==> user.log <==

Sep 28 16:13:55 ip-172-16-209-4 ec2:



> `cloud-init analyze blame` is also helpful for analyzing cloud-init's
> logs.

No smoking guns in there either (i think):

root@ip-172-16-209-4:~# cloud-init analyze blame
-- Boot Record 01 --
     00.54200s (init-network/config-ssh)
     00.32700s (init-local/search-Ec2Local)
     00.20400s (modules-config/config-grub-dpkg)
     00.16000s (init-network/check-cache)
     00.09300s (init-network/config-users-groups)
     00.08300s (modules-final/config-ssh-authkey-fingerprints)
     00.07200s (init-network/config-resizefs)
     00.06900s (modules-final/config-keys-to-console)
     00.04800s (modules-config/config-apt-configure)
     00.04200s (init-network/config-growpart)
     00.02400s (modules-config/config-locale)
     00.01300s (init-network/consume-user-data)
     00.01000s (init-network/config-update_etc_hosts)
     00.00400s (modules-final/config-final-message)
     00.00200s (init-network/config-update_hostname)
     00.00200s (init-network/config-mounts)
     00.00200s (init-network/config-ca-certs)
     00.00100s (modules-final/config-scripts-vendor)
     00.00100s (modules-final/config-rightscale_userdata)
     00.00100s (modules-final/config-phone-home)
     00.00100s (modules-final/config-package-update-upgrade-install)
     00.00100s (modules-final/config-chef)
     00.00100s (modules-config/config-timezone)
     00.00100s (modules-config/config-ssh-import-id)
     00.00100s (modules-config/config-set-passwords)
     00.00100s (modules-config/config-runcmd)
     00.00100s (modules-config/config-ntp)
     00.00100s (modules-config/config-byobu)
     00.00100s (modules-config/config-apt-pipelining)
     00.00100s (init-network/setup-datasource)
     00.00100s (init-network/config-write-files)
     00.00100s (init-network/config-set_hostname)
     00.00100s (init-network/config-seed_random)
     00.00100s (init-network/config-rsyslog)
     00.00100s (init-network/config-disk_setup)
     00.00100s (init-network/activate-datasource)
     00.00100s (init-local/check-cache)
     00.00000s (modules-final/config-scripts-user)
     00.00000s (modules-final/config-scripts-per-once)
     00.00000s (modules-final/config-scripts-per-instance)
     00.00000s (modules-final/config-scripts-per-boot)
     00.00000s (modules-final/config-salt-minion)
     00.00000s (modules-final/config-puppet)
     00.00000s (modules-final/config-power-state-change)
     00.00000s (modules-final/config-mcollective)
     00.00000s (modules-final/config-fan)
     00.00000s (modules-config/config-emit_upstart)
     00.00000s (modules-config/config-disable-ec2-metadata)
     00.00000s (init-network/consume-vendor-data)
     00.00000s (init-network/config-migrator)
     00.00000s (init-network/config-bootcmd)

1 boot records analyzed

> How often does this happen?  What instance types are you using when you
> experience this issue?

Very roughly one out of ten VMs do this.

I think I've found some clue, in the instance's system log:


[    2.845436] ena 0000:00:05.0: LLQ is not supported Fallback to host
mode policy.
[    2.856881] ena 0000:00:05.0: Elastic Network Adapter (ENA) found
at mem febf4000, mac addr 02:3b:60:14:f2:ea
[    2.866980] ena 0000:00:05.0 ens5: renamed from eth0
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
Begin: Will now check root file system ... fsck from util-linux 2.36.1
[/sbin/fsck.ext4 (1) -- /dev/nvme0n1p1] fsck.ext4 -a -C0 /dev/nvme0n1p1
/dev/nvme0n1p1: clean, 35442/516096 files, 242371/2064379 blocks
done.
[    9.042449] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: (null)
done.
Begin: Running /scripts/local-bottom ... [   28.809447]  nvme0n1: p1 p14 p15
GROWROOT: CHANGED: partition=1 start=262144 old:[   28.819359]
nvme0n1: p1 p14 p15
 size=16515039 end=16777183 new: size=41680863 end=41943007
[   40.228970] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: (null)
done.
Begin: Running /scripts/init-bottom ... done.
[   40.258874] Not activating Mandatory Access Control as
/sbin/tomoyo-init does not exist.
[   40.359770] systemd[1]: Inserted module 'autofs4'


And this is from a non-problematic instance:


[    2.490628] ena 0000:00:05.0: LLQ is not supported Fallback to host
mode policy.
[    2.502639] ena 0000:00:05.0: Elastic Network Adapter (ENA) found
at mem febf4000, mac addr 02:43:23:8c:27:0e
[    2.513028] ena 0000:00:05.0 ens5: renamed from eth0
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
Begin: Will now check root file system ... fsck from util-linux 2.36.1
[/sbin/fsck.ext4 (1) -- /dev/nvme0n1p1] fsck.ext4 -a -C0 /dev/nvme0n1p1
/dev/nvme0n1p1: clean, 35442/516096 files, 242371/2064379 blocks
done.
[    2.648924] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: (null)
done.
Begin: Running /scripts/local-bottom ... [    2.685130] input: AT
Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/inp
ut0
[    3.137992]  nvme0n1: p1 p14 p15
GROWROOT: CHANGED: partition=1 start=262144 old: size=16515039
end=16777183 new: size=41680863 end=41943007
[    3.184491] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: (null)
done.
Begin: Running /scripts/init-bottom ... done.
[    3.210828] Not activating Mandatory Access Control as
/sbin/tomoyo-init does not exist.
[    3.368117] systemd[1]: Inserted module 'autofs4'


These instances with the delay are t3a.medium or t3.medium. Their
filesystems are all configured to be 20G.


Dick


--
Dick Visser
Trust & Identity Service Operations Manager
GÉANT


Reply to: