I'm a strong believer in backtracking during the process of planning. Choosing a route and stubbornly sticking to it will rarely lead to a good result. On the other hand, evaluating too many options in depth is a sure-fire way of wasting too much time. Instead, I like to focus on eliminating options as early as possible – knowing what deal-breaking limitations to look for is, in my opinion, critical. And, as weird as it may sound, intuition can also help in this scenario.
While I've previously said that I had decided on using LXD to isolate the Ghost platform that powers this blog, this process was by no means linear. OpenVZ seemed to be the more mature technology and I have certainly found instances where LXD simply felt like work-in-progress. Docker has been a very trendy technology for a while now and can be credited with popularising containers, but it did not feel to be the right choice for this project. In the end, being able to easily and cleanly achieve what I wanted with LXD encouraged me to pursue and eventually stick to this path. For now.
The desired setup
The host machine is running Ubuntu Xenial (itself a KVM system). I wanted an independent Xenial system running in an unprivileged container with nothing else than systemd and whatever the blog needs to run. Communication to the outside world would be restricted to a software bridge. In addition to being the recommended choice for storage, a ZFS pool over an LVM logical volume offers some pretty nifty features.
In LXC parlance, an unpriviledged container is one that makes use of user namespaces – the user and group IDs in the container (UID and GID 0 included) are mapped to unused, unpriviledged user and group IDs on the host. After all, the independence of a system running in a container is a bit of an illusion, so not sharing the IDs on which so many access controls are built on sits very well with me. Furthermore, the contained UID 0 user is allowed privileged operations within the namespaces the container processes are running under, but not outside of them.
But user namespaces were just the final piece of the puzzle that allowed proper containerisation technologies to be built using features in the mainline kernel, so let's have a look at what other things get isolated via the six other namespaces available under Linux:
- Cgroup namespaces
- IPC namespaces
- Network namespaces
- Mount namespaces
- PID namespaces
- UTS namespaces
The names are meant to be self-explanatory, but for me the last term just didn't ring any bells; the manual page for uname(2) shed some light on that one. Most namespaces are meant to isolate the container from a security point of view, but the PID namespaces are an exception. They're simply meant to allow each container to have its very own PID 1, a useful Unix convention, as well as help migrate containers from one host to another. I won't dwell on these any more – the manual pages provide good descriptions.
As far as networking is concerned, the default configuration of LXD is to setup a virtual bridge on the host (called lxdbr0) and attach a virtual ethernet device on each container (called eth0). LXD is even nice enough to setup forwarding, NAT, dnsmasq to provide IPv4 and IPv6 DHCP, RAs for IPv6, DNS resolution, DNS registration and a bunch of other features to make it really easy to connect the container to the outside world.
The only problem is that I actually do not want any of this. For one, I would like the container to not communicate with the Internet. That means I have no use for forwarding, NAT or IPv6 RAs. That's simple enough, as all these features can be deactivated. But I had an idea which appealed even more to me: using only IPv6 link-local address on the bridge, as this would completely eliminate any chance that packets originating in the container would ever leave the bridge. You see, the IPv6 link-local prefix is identical on all links: fe80::/10. This would make forwarding and NAT that bit more complicated to get packets out of the bridge. As it turns out, this was no big deal on LXD. That leaves us with the DNS features – well, I opted out of those too.
But isn't all this lack of communication simply going to cause a maintenance nightmare: would I have to use the host for all software installation and for regular updates? Mounting the container's filesystem and using chroot would certainly be possible, but far from idyllic. Let's take a step back: I need the virtual bridge to be able to access the service running in the container: a reverse proxy would make the blog's web application available to the Internet. In the opposite direction, an HTTP proxy could offer all the connectivity to the rest of the world I would ever need to be able to manage software from within the container (let's face it, nowadays HTTP is pretty much ubiquitous when it comes to software distribution).
At this point you would be well within your rights to wonder what I've actually achieved. I may have cut the container's IP connectivity to the wider world, but it can still make HTTP requests. However, you can think of the proxy as a layer-7 firewall: instead of having to decide what IP networks I allow my jailed applications to communicate with and over which layer-4 protocols, I can more easily define a list of HTTP sites it's allowed to access. As a bonus, I get user-friendly logs of all communication attempts. The downside? Let's just hope Squid is secure.
Wait, isn't this an application container?
It certainly is and you're probably thinking about Docker. I may be isolating a single application (i.e. the blog) within a container, but I refuse to define an application as a set of related processes. Docker seems too entrenched in running the application as PID 1, instead of running an init system (like systemd) to provide an eco-system for the application to run in.
In the end, as you're going to find out from part 3 of this series of posts, that ecosystem proved useful. I'm not saying it's not possible to run systemd within Docker, but it seems that the Docker development team don't consider this the way to use it – and this raised some concerns for me.
But that's enough jibber-jabber for one article – on to the practical stuff!
At the time of this writing , LXD is very much under active development, so, even though I'm using the latest Ubuntu LTS, I had to use the backports repository to get a decent enough version. An interesting feature of Ubuntu is that, by default, it pins the packages in the backports repository to a low priority (100 vs 500). While the
-t flag of your favourite APT frontend is going to do the trick for the initial install, I'm a not big fan of relying on my memory when it comes to periodic system upgrades – I'd rather not have to remember to treat some packages specially. So APT preferences to the rescue! The following file dropped under
/etc/apt/preferences.d/ will make sure I don't have to:
Package: lxd* Pin: release a=xenial-backports Pin-Priority: 500
And I was ready to fire-up my favourite APT frontend:
# aptitude install lxd lxd-client zfsutils-linux
There's a catch to this, though. APT might then pull other packages as dependencies from the backports repository, but not upgrade them. Your best bet is to be aware of this and check what packages are installed from backports every now and then, with aptitude's seriously-powerful-but-impossible-for-mere-mortals-to-not-find-confusing search terms:
# # No, the ?and term will not do what you want here # aptitude search '?narrow(?installed,?archive(xenial-backports))' i lxd - Container hypervisor based on LXC - daemon i lxd-client - Container hypervisor based on LXC - client
My system uses LVM, so I decided to set up a logical volume as storage for a ZFS pool. This might be a weird setup, but without any physical disk drives to use exclusively for this, it had to do.
# lvcreate -L30g -n lxd-zfs system
I'm not a very big fan of wizards, or, to be more accurate, of some of the decisions they tend to make for me. However, LXD's turned out OK. Here's a breakdown of what my answers to its questions were:
# lxd init Do you want to configure a new storage pool (yes/no) [default=yes]? yes Name of the storage backend to use (dir or zfs) [default=zfs]: zfs Create a new ZFS pool (yes/no) [default=yes]? yes Name of the new ZFS pool or dataset [default=lxd]: lxd Would you like to use an existing block device (yes/no) [default=no]? yes Path to the existing block device: /dev/system/lxd-zfs Would you like LXD to be available over the network (yes/no) [default=no]? no Do you want to configure the LXD bridge (yes/no) [default=yes]? yes Configure bridge? yes - name: lxdbr0 - IPv4 subnet: no - IPv6 subnet: no
Not being very sure about the implications for some of the other network settings, I decided to set some of them manually using
lxc network edit lxdbr0 (this will use your preferred editor). I configured the following:
dns.mode: none ipv4.address: none ipv6.address: none ipv6.dhcp: false ipv6.firewall: false ipv6.nat: false ipv6.routing: false
And told LXD I was ready to play:
# lxd ready
At this stage, I started wondering about the persistance of the IPv6 link-local address assigned on the host to lxdbr0. This is derived from the MAC address assigned to the interface, but with a virtual bridge I had no assurances that it would stay the same after a reboot. So I decided to test my theory and, sure enough, the MAC address and IPv6 address changed when the system came back on. Thankfully, systemd can be asked to assign a specific MAC address to an interface using link files. I saved the following as
/etc/systemd/network/10-lxdbr0.link and tested again:
[Match] OriginalName=lxdbr0 Driver=bridge [Link] MACAddressPolicy=none MACAddress=XX:XX:XX:XX:XX:XX
As expected, the IPv6 address remained persistent across the reboot.
I should probably mention that LXD provides a nice set of abstractions to make it easier to manage the containers. The networks, which I previously mentioned, are one of them. Another useful abstraction is the container profile – it allows you to define a common set of settings inherited by the containers. One is provided by default and it's called, surprise-surprise,
default. I defined the
user.network_mode setting with a value of
link-local to stop cloud-init from delaying the container's booting process by waiting for the DHCP client to get a address it never will. I also threw in a
security.devlxd set to
false for good measure – why would the container want to communicate with the host? Naughty...
# lxc profile set default user.network_mode link-local # lxc profile set default security.devlxd false
The only thing that's now left is getting the proxy – our very own layer-7 firewall – up and running. Squid is a bit like what Apache used to be a number of years ago: bloated with features and lacking any serious contender; but we all love it. After the usual
aptitude install squid3, I set it up with the following:
acl ssl_ports port 443 acl safe_ports port 80 acl safe_ports port 443 acl CONNECT method CONNECT acl localnet src fe80::/10 acl safe_sites dstdomain -n archive.ubuntu.com acl safe_sites dstdomain -n security.ubuntu.com http_access deny !safe_ports http_access deny CONNECT !ssl_ports http_access allow localnet safe_sites http_access deny all http_port 3128 cache deny all
In other words:
- It listens on port 3128;
- Only allows clients from fe80::/10 to use it;
- Only allows requests to ports 80 and 443;
- Only allows use of the CONNECT method to port 443;
- Only allows requests to archive.ubuntu.com and security.ubuntu.com;
- Doesn't cache anything.
If at this point you're wondering if my ISP can use my Squid proxy over IPv6, I only allowed access to port 3128 on interface lxdbr0 using ip6tables – an unfortunate side-effect of using link-local addresses.
The finishing touch was to automatically tell the containers to use the proxy by setting the
https_proxy environment variables in the
default profile. There's yet another catch here because of the IPv6 link-local addresses: the same feature that makes them more complex to communicate with beyond a link (the fact that all links share the same prefix) also means that an IPv6 link-local address, in itself, does not provide sufficient information to send an IP packet to – you need to couple it with an interface (well, technically it's called a zone). Thankfully, most Unix systems provide a friendly format using the percentage sign:
# lxc profile set default environment.http_proxy 'http://[fe80::XXXX:XXXX:XXXX:XXXX%eth0]:3128' # lxc profile set default environment.https_proxy 'http://[fe80::XXXX:XXXX:XXXX:XXXX%eth0]:3128'
If you've read this far, I can only assume you're either very bored or skipped to the end. In case of the former, stay tuned for the final article in the series – it's bound to be even longer!
P.S. The top picture depicts isolation.
Which one would like to think is based on experience. ↩︎
As I'm writing this, I'm making a mental note to try it out. ↩︎
Yes, I did actually achieve something: take a look at the number of packages in backports – the chances of pulling dependencies from there are next to nil. ↩︎
If you're on Ubuntu, try
update-alternatives --config editor. ↩︎
Let's be honest, I did this after discovering it was a problem and took a bit of digging around to find the clean solution. ↩︎