[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Producing rescue images (at least for OpenStack, maybe others?)



On 6/9/20 3:46 PM, Bastian Blank wrote:
> On Tue, Jun 09, 2020 at 02:57:14PM +0200, Thomas Goirand wrote:
>> In OpenStack, there's the possibility to rescue instances with a special
>> image made for it.
> 
> You mean this?  https://docs.openstack.org/nova/latest/user/rescue.html

Yes. That's what I'm talking about.

> According to the documentation, the default behaviour is to use a fresh
> copy of the the image already in use by the instance.  So using a
> special rescue image is kind of a special case.

Typically, you'd do:

openstack server rescue --image <RESCUE-IMAGE> --password <PASS>

Yes, you may use the image that was used to spawned the instance, but it
would make a lot of sense to have a special image with all the tooling
to rescue a broken instance. For example, the original image may not
have xfsprogs, but a user may have install it on his instance after
booting, and may need to use it.

Yes, it's possible to apt-get install all the tools once the rescue
image is booted, but it's nicer to have them all available already if
possible. In such a disaster recovery time, the service may be down:
winning a bit of precious time is nice.

Plus there's the problem described on the link you pointed to from Red
Hat, which means it may be easier to use a different image.

>>                    The only thing that changes is the cloud-init
>> configuration, so that it allows:
>> - ssh as root
>> - ssh using a password set by cloud-init (which can be seen with
>> "openstack server show" once the VM is in rescue mode).
> 
> Where do those settings come from?  Is this some kind of convention?  If
> yes, please share them with us.

For the password thing, I'm still trying to get it to work. I believe
that I still don't have the right cloud.cfg. That one is an OpenStack
standard, yes (which I believe was invented for Windows, where ssh isn't
that useful).

Using the ssh key that was originally used to boot the instance still
works, though maybe the project admin that is trying to repair the
broken instance doesn't have the key which was originally used to boot
the instance. So using a root ssh password set by OpenStack may be
useful in such case (where the problem that one is trying to solve may
precisely be not having access).

I've also set autologin in the image as use, so that users can simply
use the VNC console to get in, without any password.

>> it's *not* reasonable to expect that:
>> - cloud users would use a normal image for rescue
> 
> Using the normal image seems to be the default behaviour.

Well, you aren't *forced* to use the --image parameter, that's truth.
But as much as I can tell, it's a common thing from cloud providers to
upload a "rescue image" made just for that purpose, so it's available to
customers.

> Example of public documention from cloud providers who propose to just
> use the default image:
> https://help.switch.ch/engines/documentation/rescue-vm/

Well, reading this page, they also explain that it's possible to use
alternatives (such as the system-rescue-cd). If they wrote that,
probably it's because they know it's not always the best option to use
the same image.

> Red Hat describes possible problems with that approach:
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/instances_and_images_guide/ch-manage_instances#section-instance-rescue

Unfortunately, the tune2fs workaround may sometimes not practical,
because in some cases, the instance would simply not boot up to
interactive mode (because of that UUID issue, and the system booting on
the "wrong" disk). I have to admit though that I never experienced this
problem, as I always used the rescue mode using the --image option.

> SWITCH for example describes the rescue mode to be used to fix the
> following problems:
> | - ssh key is lost → temporarily enable password login
> | - broken network configuration
> | - broken boot configuration
> | - interactive fsck needed
> None of those tasks require a special image, as the normal ones have
> everything on board to fix those problems.
> 
> Please elaborate which problems you see.

I don't think we should attempt to limit the scope of problems this type
of image would be trying to solve, as I'm quite certain we will miss
some use case.

The above is for sure the most common problems, but on top, I'd add a
few more that I have in mind:

- broken /etc/fstab leading to non-booting system
- broken getty leading to non-functional console
- broken iptables preventing to ssh in
- no root password set (this is most of the time default) and no way to
get in anymore
- broken boot process (rare, but I saw it in real)

And worse ... a combination of some of the above. :P

>> Does the team has any idea of what kind of tool (ie: package names) that
>> we should install in such image? I thought about at least parted, mbr,
>> kpartx, dosfstools, e2fsprogs, qemu-utils, scrub, testdisk, scalpel,
>> gpart, gddrescue, foremost, ddrutility.
>> Anything else?
> 
> Half of that list are recovery tools for hardware errors.  Why would a
> cloud user care about hardware?  Isn't that the providers job.

Well, in some cases, recovering files from a broken hard drive may be
useful, for example if using Ironic (ie: OpenStack bare metal service,
for those who don't know (I guess you, Bastian, know what Ironic is)).
Then if it's a private cloud that we're talking about, the user and the
provider can be the same person.

>> Therefore, IMO it'd be nice to also produce such image in our image-set.
> 
> It might make sense to build such an image.
> 
> But please make it into the form of a swiss army knife, so it can work
> of a thumb drive on a hardware machine as well.  Kinda like grml.  It
> would be more or less a hybrid of generic (includes cloud-init) and
> nocloud (can run without any infrastructure, but may require some
> fixes).

I very much would like to add more tools in this type of image, so the
list isn't exhaustive, as I just got them from the top of my mind and a
bit of apt search. So, very much, suggestion would be welcome.

I like your idea that such a rescue image could have many purpose, such
as installing it on a thumb drive, so it could be useful outside of the
cloud scope too. So yeah, it'd make sense for the Generic and NoCloud
images.

Cheers,

Thomas Goirand (zigo)


Reply to: