[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Systemd



On Wed, 2015-01-21 at 17:58 +0100, Matthias Urlichs wrote:
> You seem to have sent this email before you finished writing it.

Correct.  It was late, the tone of post was heading in the wrong
direction so I gave up for the night and pressed "Save to Drafts".  This
morning I discovered I had missed the "Save to Drafts" button,
apparently.

The post wasn't going to continue in the same direction.  Regardless, I
stand by my assertion you _will_ be running journald if you use systemd.
And while perhaps other people might make design different decisions, if
I wanted logging in a resource constrained environment (think battery
powered) where any unnecessary overhead is too much, if I use systemd I
don't have much of a choice about what logging system I drop.

It was also true then when I first came across systemd, I was taken
aback by the idea that the designers of an init system would consider it
reasonable to force me to adopt their logging system, which was
binary(?!?!).  What on earth does a logging system have to do with init?

Only later did I manage to put 2+2 together and come up with 4.
Journald started life as a way to capture stdout and stderr from stuff
systemd started.  And it just so happens that just about all programs
that run in the background are started by systemd.  That's how it sees
it's job after all - a super server that replaces init, cron, inetd, and
so on - so all background programs are started by it. So far from being
unrelated, systemd and journald are addressing two facets of the same
problem.  Systemd starts the background programs, and journald allows
those same programs to tell the sysadmin something.

And surprise, surprise it turns that the effects of design decisions
like that ripple out across the system.  Firstly, you notice that simply
writing stuff to stdout is way easier to use than some logging API -
even shell scripts can do it.  Secondly some meta data becomes
automagically available - like the path to the program writing it, and
the systemd unit that started it.  The users of the logging data,
sysadmins, then notice that finding the log entries they are interested
in becomes easier - because that the thing they search on *is* the name
of the files they use to control the system (unit file names and
executable file names).  No more guessing between [CRON] / [cron], or
[ppp] / [pppd] or whatever that daemons programmer decided on the day.
And no more figuring to which of the many files under /var/log it will
end up in.  It's nice when such minor irritants just vanish.

I'm not sure why they chose a binary format, and I remain somewhat
suspicious of the decision.  Writing the entries to a text only
files(s), and adding a binary b-tree index that can be re-built at will
seems more robust than having a binary format that can does does get
corrupted on unplanned shutdowns.  On the other hand making it binary
means only journalctl can look at it.  Since they control journalctl
this means they can change the format to something more robust later,
and none of us will be the wiser.

The bottom line is I think it's fair to say you *are* locked into
systemd journald in a way you weren't locked into syslog.  Claims that
you can just write another one exporting the same API don't ring true,
because I suspect the API will be about as stable as an internal kernel
API.  Tracking an internal kernel API from outside of the tree is
really, really hard work.

That said, the lock in looks to be an outcome of good design decisions,
that have yielded a (mostly) better system than we had before.  And
where it isn't better, it is designed in such a way that fixing it looks
relatively easy.  All I can say is well done boys (and girls?).

So now onto another lock in.  This one was driven home to me at this
years LCA, when a Google sysadmin expressed his opinion of systemd over
dinner.  It was, literally, "we will be so fucked when it arrives".  On
inquiring, he claimed (I haven't checked) that cgroups were actually a
Google initiative that was accepted by the upstream kernel.  They
(google) runs everything in it's own container - (think GMail, G+,
Search, ...), and in an average week they spin up some 2 billion of
them.  Naturally they have written a large body of their own software to
manage them, but systemd insists that it manage the cgroups via it's
API, which is incompatible with every other person's API.  I don't know
enough to about cgroups to say if "so fucked" is an overly dramatic way
of putting it, but I notice cgmanager exports it's own cgroup API yet
manages co-exist with systemd.

Just like journald there are design reasons for the lock in.  In this
case I think systemd's end goal is to put every system and user in its
own cgroup.  Some tools exist in systemd.exec to do this (see the
Private* settings in systemd.exec).

There is no denying putting each service and user in their own container
certainly has more then enough security advantages to justify it on its
own.  The problem is this is all new hot off the press stuff - big
changes in the cgroup implementation happened in the very kernel jessie
is using - 3.16.  Systemd's way of using them it is one of many early
explorations of this space.  (Many of it's seemingly inexplicable
addon's arise from it.  For example, you need to assign containers IP
addresses, thus systemd ships with a dhcp server and client.)
Currently, systemd is one of the less inventive uses of cgroups, doing
far less with the concept that competitors like Docket, Rocket, and OS's
built around it like OpenShift and CoreOS.  So it's insisting it be the
one that manages control groups could well be problematic in the future.

An ironic outcome of the move to small, one job containers is the init
system becomes less important.  This is partially because the containers
themselves don't need an init system.  Replacing /sbin/init with a
simple shell script often suffices, and failing that inittab with its
ability to restart things is just fine.  But the deeper reason is by
definition an init system starts and stop things things on one box.  In
a container world, the containers exist in a cloud of boxes, and you use
something like kubernetes to manage them.  For kubernetes read: init
system for the cloud.  And yes, it really is cloud based.  It replies on
etcd, which is a distributed key value store.  People familiar with
Windows should recognise the concept immediately.  It's like the
registry, but redundantly distributed so a box spinning up at some
random place in the cloud can access it's configuration information.  (I
can't help at smile at the thought of people reading this, and it
dawning them we are replacing the text configuration files in /etc with
the Windows solution.)

I can't say I was surprised to discover kubernetes is an internal Google
project they released to the world.

Finally, I am convinced this is all highly relevant to the Debian
project.  The move to containerisation is going to effect what out
server users expect from us.  Currently packaging concerns are largely
ignored in the container world, but this surely has to be a passing
phase, born from the need to prove central concept before embarking on
"side issues".  Currently security updates, fixing compatibility issues
between packages, repeatable builds, secure distribution - all things
Debian has solved, are ignored.  If we fixed these issues for the
container people without too much effort from them, we would be very,
very popular.  It doesn't seem like it would be too difficult.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: