minas.morgul.net is the hub of much of my digital life. It also provides services for quite a few friends, ranging from backup DNS to mailing lists and IRC. It lives in a datacenter 3000 miles away from where I live, with conditioned power, climate control, etc. It’s got redundant power supplies, RAID disks, remote console, and most of the other stuff you’d expect from a machine that’s supposed to be up and running non-stop. There are a few things, though, that can’t really be made redundant. (At least not cheaply.) CPUs are one of those things…
Having established some background, the story picks up this past thursday morning, when I awoke to find some strange log messages on minas. Some things that don’t usually crash had crashed over night. Some web app stuff had failed in strange ways. My initial suspicion was that somebody was probing some web apps, possibly looking for security flaws to exploit. My fears only increased when some of the commands I was running during my investigation also started crashing. Had somebody broken in and modified the system to hide their presence, but done so in a sloppy way that left things unstable? w(1), a tool to report some system information including the people who are logged in, crashed with a “bus error”. Not good.
After some time spent thinking about the best way to recover from a security compromise, which would have been difficult without physical access to the system, I started noticing additional puzzling behavior. The programs that I had seen crash didn’t actually seem to always crash. Sometimes they’d run just fine. df(1) would sometimes report meaningful disk usage numbers, but other times report wacky numbers that made no sense. The bus errors I’d been seeing typically happen when a program tries to access memory outside of its legal address space. Math errors when calculating pointer addresses. Math errors when calculating disk usage. Hmm. What does math in a computer? The CPU. How many CPUs are in this system? 2. Could it be that one of two CPUs is failing, and that any time the scheduler places a process on that CPU it is susceptible to crashing? How might I find out? Here’s where things start to get interesting.
I started looking into possible ways to manipulate the kernel’s scheduler to see if I might be able to control which CPU a given process runs on. I discovered the taskset(1) program, which can adjust a process’s “CPU affinity”. Using this tool, it’s possible to target a specific CPU when launching a process. It’s also possible to manipulate the CPU affinity of an already running process. Child processes inherit their parent’s CPU affinity. So, to start with, it should be pretty easy to determine whether or not a given CPU is bad:
$ taskset -c 0 uptime Floating point exception $ taskset -c 1 uptime 14:39:25 up 4 days, 16:28, 6 users, load average: 0.03, 0.22, 0.26
This was reliable and repeatable. CPU 0 is apparently bad. Time to move long running processes off of it. To ensure that I got all children, I restarted some daemons (cron, apache) with cpuset. My shell processes all got migrated to CPU 1. I set the CPU affinity of init. The server became, in essence, a uniprocessor box when it had previously been a dual processor system. It has been running for two days like this, and seems reliable . I have no idea what will happen if this host reboots. I actually am not sure how it is that the system hasn’t crashed. One thing whose CPU affinity can’t be adjusted is the kernel itself. I am not sure, but I believe that it would still be running on CPU 0. But maybe that’s not always true. Does CPU affinity affect system calls as well? That might explain why it’s still running now, but what will happen when it reboots? I’m familiar with the ‘maxcpus’ kernel parameter, but that doesn’t actually let me specify which CPU is used. I suspect that setting maxcpus=1 will put everything on CPU 0 and I’ll be hosed. There’s also an isolcpus kernel parameter. This one seems a bit more promising. It essentially lets you tell the kernel never to put a process on the given CPU(s). It’s normally used for realtime stuff, where you want to dedicate a specific CPU to your realtime process. But the kernel still needs to actually boot. Is it actually going to be able to do so? I don’t really want to find out, and I don’t want depend on it being able to do so. Time to think about disaster recovery plans.
In any case, this was an awfully interesting problem, and I think the solution was kind of neat as well. I’ve never had to deal with such a situation before.
On a related note, does anybody have an old Opteron 248 you want to part with?