Linux Setup Notes

name and address
created may 07, 2013; updated may 25, 2013

OpenSuse 12.3 NFS Problems

Did OpenSuse 12.3 cause a problem? If you call hanging by the neck until dead a problem, then as Perry Mason would say: Yes. Yes it did. Update: Partial solution found

Summary

The Dell PowerEdge T410 uses a PERC H200 RAID controller. With hardware RAID, the BIOS takes care of all RAID functions like recovery and consistency checking transparently. The OS just sees one large disk. Someone at our location bought two of these babies, and despite our problems with OpenSuse 12.1 we decided to give OpenSuse another chance. We deleted the Microsoft Server partitions, added three Linux ones (root, /home, and a swap) and formatted them with ext4. OpenSuse installed without any major incident, but locked up solid after 300-400 GB of data were copied onto the hard drive.

Background

Since I am the closest thing my employer has to a Unix expert, they asked me to configure them for one of their locations, which is a three- to four-hour drive (depending on the traffic) away from where I work. Here are the gory details of the problems we had. Some of them were caused by wonky hardware. But the rest were caused by changes in Linux that seem to have been made simply for the sake of change.

  1. Mexican Keyboard Mexican Keyboard Mexican keyboard: Somebody found a great deal on some USB keyboards. Unfortunately, they turned out to be Mexican keyboards. Only a few of the punctuation keys actually matched what was printed on the keycaps. For example, to get a forward slash ('/'), you have to press the '_' key. To get a '+', you press the '¿' key. Now you know why there are so few good computer programmers in Mexico.

  2. Acer V183HV Monitor: This inexpensive and compact monitor seems to only have one screen mode: 1366×768. Linux Xorg doesn't handle this mode without major tweaking. But we couldn't even get far enough to get to the tweaking step, because Opensuse switched to graphics mode as soon as it hit GRUB. At that point, the screen became unreadable. Configuration was just not possible with this monitor attached. Changing GRUB's config file boot parameters to use text mode or other VGA modes didn't work—they were all ignored.

  3. Other Monitors: Even though there's a computer store two blocks away, no one at this location is allowed to purchase anything. Rules are rules. The administrative staff has to ensure their continued employment somehow. So for installation, we borrowed a CRT monitor from their old server. Then someone found an old 14-inch LCD monitor that was not being used, stuck behind the desk of a former employee. As soon as it was attached, Xorg decided that there were “No Valid Screen Modes” and unceremoniously deleted its own configuration file. As a result, no monitors would work. It turns out there is no more xorg.conf file. It's been replaced by a /etc/X11/xorg.conf.d directory with a bunch of separate files in it. But if you create an xorg.conf file, Xorg can use it. We finally found a file called /etc/X11/xorg.conf.install and copied it to /etc/X11/xorg.conf.

  4. X11 doesn't start for regular user: Edit /etc/permissions.local and uncomment the last line about Xorg being 4711. Then, as root, type chmod 4711 /usr/bin/Xorg.

Installation Oddities

Opensuse 12.3 has quite a few nice shiny new bugs. Or maybe we just never noticed them before.
  1. Automatic Login: On one computer we accidentally left this option checked. By default the system boots up into what used to be called runlevel 5, and with automatic login, rather than logging you in automatically as you might expect, it presents the xdm login prompt for that particular user. If you enter their password, it says “login successful” and then recycles endlessly back to their password prompt. So the “Auto Login” feature actually blocks that particular person from ever logging in. We couldn't find any way to fix this problem short of re-installing the entire operating system.

  2. inittab no longer works: The standard way of changing the default runlevel no longer works. Damn kids, always changing stuff.

  3. Systemd: All the distros (except Debian, which is planning to) have switched to systemd, which means your old custom startup scripts don't work. Someday we might get the time to figure out whether there's a way to make systemd handle them. For now, we just put a note on the machine reminding people to run the scripts manually.

  4. Some config files work, and some don't: We were able to change from xdm to gdm by editing /etc/sysconfig/displaymanager. But you can't use this method to switch to using startx instead of xdm. Changing it to none just causes Xorg to hang. Go figure.

  5. IMAP The Cyrus Imapd on the DVD didn't work. See linuxsetup117.html for details.

  6. Apache httpd The Apache web server, for the first time ever, worked out of the box, or at least seemed to, putting up a blank web page that says "It works." But we couldn't find it in the process table, and it didn't work with our PHP files. So we compiled and installed a custom one.

OpenSuse 12.3 NFS Hangs By The Neck Until It Is Dead

So we got past all these little annoyances. I guess the new guys have to make their mark on Linux, so they make useless changes like switching from sysvinit to systemd and eliminating /etc/inittab, so this sort of thing is inevitable. Maybe it even makes desktops boot faster. But a server will take 5-10 minutes before it even gets to GRUB. Those few seconds systemd saves mean nothing. All it does is make it harder to administrate remotely. Used to be I could log in over my cell phone and fix problems. Ah, the good old days.

We also scraped all the dried salsa off that Mexican keyboard, and resigned ourselves to using a giant 20-year-old CRT monitor from the Ronald Reagan era. Then we found something wonky with the NFS in OpenSuse 12.3 that caused big problems.

After four hours of copying users' data files over NFS, it just locked up. I use cp -pRuv so I can watch the files coming in, and after copying 300-400 GB, they just stopped coming. Typing df caused the terminal to hang. In another terminal, we found that all the partitions were readable except '/', which caused Suse to hang whenever we tried to read it.

What. The. Fuque. By now it was 4 p.m., with an almost four hour drive ahead of me, and Opensuse had given me another flare-up of the old Multiple Continuous Spewing Expletive (MCSE) Syndrome. (Not to be confused with Microsoft Certified Systems Engineer Syndrome, which is very similar.)

The system load was near zero, and of course there were no error messages. So I rebooted into "rescue system" mode, and had no problems reading the files in the root directory ('/'). Everything looked intact. Nothing in log/messages. After reboot, I get the normal password prompt, then these symptoms:

We considered that maybe the system was doing LDAP for some idiotic reason, possibly looking on the network for authentication. We booted into rescue mode, edited nsswitch.conf, and wiped everything that we could find that had anything to do with NIS or LDAP. (Of course, it was not possible to run yast2.) No effect.

No problem, we have another server configured identically. So I started copying files onto that one, and the Exact Same Thing happened, except this time it happened immediately after the first reboot. So it was not a bad hard drive; anyway, a RAID consistency check (a great feature, except that takes over an hour) showed no errors.

Partial solution: We finally discovered that the login problem was caused by having copied the shadow file from the old machine several hours earlier. Apparently the file format has been changed. The new file contains several extra fields.

Here is how to reproduce the problem:

Login will also hang if the number of lines in /etc/shadow is different from the number of lines in /etc/passwd. Again, logins work fine until you reboot, possibly months later, and you discover that you are suddenly unable to log in.

It seems that login processes the file differently after a reboot. A single misplaced character in any line of /etc/shadow could cause it to hang. This is tough to identify, because it has no effect for weeks or months until the next time you reboot. The only way to recover is to boot into rescue mode and delete the shadow file, then rebuild it user by user. A pain in the neck if you have thousands of them.

Another NFS problem: Copying files over NFS from a different old server running Suse 9 with kernel 2.4.21-99 caused the new Suse 12.3 server to hang repeatedly. The mouse, keyboard, and network all were non-responsive. It doesn't even respond to pings. The only way to recover was by yanking the power cable. The logs in the old server had multiple repetitions of rpc-serv/tcp: nfsd sent only -32 bytes of 32900 - shutting down socket. We fixed this easily by changing the number of NFS server processes on the old computer from 4 to 64 in /etc/sysconfig/nfs. Despite the errors, the old computer continued to run just fine, but the shiny new computer with Suse 12.3 and Linux 2.6.25.20-0.5-pae locked up solid. This, children, is what we call progress.

Back