TCAD station, part II

As a testbed for commercial TCAD software, we will use a standard desktop PC equipped with an i4790 CPU and 32 GB RAM, and an Nvidia 710 GT to connect the monitor via DVI. As a substitute for Redhat Enterprise Linux (RHEL) required by all of the packages we're about to evaluate, I'm going to install CentOS.

I create a bootable USB stick by issuing

dd bs=4M if=CentOS-7-x86_64-Minimal.iso of=/dev/sdc status=progress && sync

The first stick doesn't work, and it takes us perhaps half an hour to realize that it's the fault of the stick, not the system. A second one works right away. I manually partition the disk, as in my installation in a virtual machine, choosing btrfs as filesystem for both system and home partitions. I add a UEFI boot partition as well as a swap partition.

During the first boot from hard disk, the system throws an error message concerning nouveau (the open-source driver for the Nvidia graphics card) and subsequently hangs with a kernel soft lock signified by the repeated message that CPU #i stuck for 120s. After a hard reset, the system boots and behaves as expected, but when I boot again to activate the new kernel installed by yum, the system hangs again. Since the hangup is always preceded by the error message from the nouveau driver, we remove the Nvidia card and reboot. Now the systems hangs without any message. Great! A cold boot (with the Nvidia card installed again) seems to help at first, but later the system hangs repeatedly and finally refuses to boot to the login screen at all.

What I thought to be done within an hour has already taken most of this Friday morning. Since the Nvidia card does not seem to responsible for these problems, I start to suspect the btrfs filesystem, which Redhat still considers to be a technology preview. In the afternoon, I thus reinstall the system, this time choosing the default filesystem XFS. And indeed, while I still get the error message from nouveau, the system boots up without any of the previous symptoms. I go home with the conviction that I've solved the problem.

Saturday morning, curiosity gets the better of me and I decide to check if the system still runs. It does, but htop shows that the uptime is only 49 min instead of the expected 13 h. Weird! I check again Sunday morning, and again the uptime is less than an hour. At the same time, I notice that the filesystem usage seems to increase by roughly 1 GB per day. Aha! An 'll /var/crash' confirms my suspicion: the system crashes and dumps the kernel roughly every 2 or 3 hours.

Core dumps can be analyzed to determine their cause. Following the Redhat tutorial and issuing the command

crash /usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2017-05-01-07\:35\:53/vmcore

I get the following crash report:

KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.x86_64/vmlinux
        DUMPFILE: /var/crash/127.0.0.1-2017-05-01-07:35:53/vmcore  [PARTIAL DUMP]
                CPUS: 8
                DATE: Mon May  1 07:35:42 2017
          UPTIME: 08:27:43
LOAD AVERAGE: 0.05, 0.03, 0.05
           TASKS: 224
        NODENAME: testbed.de
         RELEASE: 3.10.0-514.16.1.el7.x86_64
         VERSION: #1 SMP Wed Apr 12 15:04:24 UTC 2017
         MACHINE: x86_64  (3600 Mhz)
          MEMORY: 31.9 GB
           PANIC: "Kernel panic - not syncing: Hard LOCKUP"
                 PID: 0
         COMMAND: "swapper/6"
                TASK: ffff880174a9edd0  (1 of 8)  [THREAD_INFO: ffff880174ab8000]
                 CPU: 6
           STATE: TASK_RUNNING (PANIC)

This diagnosis lets me find the cause of the core dumps very easily: it's an unresolved bug reported in September 2016 and given high priority by the developers. The bug has been found in version 7.2.1511 but has not been fixed even in the current version 7.3.1611. However, the bug reporter and others have identified the nouveau driver to be the culprit. And indeed: after removing the GT710 and rebooting, the system does not suffer from any further lockups and core dumps:

 ob@testbed:~$ uptime
19:00:08 up 17 days,  9:00,  2 users,  load average: 0.00, 0.01, 0.05

I have to admit that this experience was an unexpected surprise. CentOS, as a binary compatible clone of RHEL, has the reputation of being the very model of a conservative Linux distribution, and thus to be a paragon of stability and reliability. Consequently, I had expected outdated software, but not buggy implementations of core packages, and critical bugs that are open for 8 month without eliciting any response from a developer.

Thinking twice, however, I realize that I should not have been surprised at all. About ten years ago, we decided to switch our core servers from SUSE Enterprise Linux (SLES) to OpenSUSE since we were entirely frustrated with the support and bug fixing policy of SLES, despite the fact that we payed a handsome amount to Novell every single year. Personally, I'm not too fond of OpenSUSE either, but the core servers don't overly concern me. Our compute servers, for which I'm responsible, are running Debian Testing, and in view of the minimal administrative effort required over the past ten years, I congratulate myself for this decision.