Hardware diagnostics with open source tools
Like all pieces of electronic equipment, computers have a tendency to malfunction and break; if you have never experienced kernel core dumps or unexpected crashes, consider yourself lucky. Many common hardware problems are caused by bad RAM modules, overheated or broken CPUs, or bad sectors or clusters on hard disks. In this article we will introduce you to some open source tools you can use to trace these problems, and thus save time, money, and headaches.
A GNU/Linux live CD distribution can come handy for hardware diagnostics. For this purpose, my favorite live CD distribution is GRML, which bundles the tools we're about to discuss, along with some other useful programs for both home users and veteran system administrators. Other distributions also include some or all of these tools.
Who's afraid of the big bad memory?
Bad memory can cause crashes that lead to system hard locks or even data corruption. Next time you try to compile a program and the compilation fails, check your memory before sending any bug reports to the program's authors. Memtest86+ is an excellent utility for testing RAM. It is based on memtest86, but supports most modern hardware, including the AMD64 architecture, whereas memtest86 is strictly x86-based. Memtest86+ is a boot image and thus is independent of an operating system.
To run the program, boot your system with the GRML CD and enter memtest on the boot prompt. The program is simple to use, since it starts testing memory by itself immediately. Pressing c shows the configuration menu, which you can use to select the test method, enter ECC mode (if your system uses that kind of RAM), restart the test, or refresh the screen; however, most people should be fine with the defaults.
Memory problems are usually tough to spot, so in order to be sure it's better to leave memtest86+ running for a long period of time and complete at least 10 passes of the test. If you want to quit memtest86+ and restart your computer, just press Esc.
Burn, CPU, burn
Overheating CPUs can also cause system crashes. CPU problems tend to reveal themselves when you're running CPU-hungry applications such as code compilation or video encoding, and not during everyday tasks. You can check whether your CPU is the weakest link in your system by putting a heavy load on it with the cpuburn package, a collection of programs whose main purpose is to load processors as heavily as possible.
Cpuburn includes executable binaries optimized for specific CPU types, named as burn[CPU_TYPE] -- where [CPU_TYPE] is one of P5, P6, K6, K7, MMX, and BX. Read the README file (in GRML installed as /usr/share/doc/cpuburn/README) to decide which one to use for your system.
You can also combine cpuburn with a thermal sensor probing program such as Lm_sensors or ACPI (in case you test a laptop) and have real-time information about your CPU's temperature; just run burn[CPU_TYPE] on a virtual terminal and sensors on another. If you are into overclocking or extreme cooling, this program will be your best friend.
Hard disk problems
Storage media manufacturers always create smaller and faster hard disks with larger capacities, but all disks are prone to failure. Most hard disks integrate a monitoring system, called Self-Monitoring, Analysis and Reporting Technology (or SMART for short), which besides providing all sorts of information about the drive (model, serial number, operating temperature, etc.) offers a nice way to test the disk's integrity. To interact with that system you can use a program like smartmontools.
The smartmontools package contains two programs: smartctl, a command-line utility for performing SMART tasks, and smartd, a daemon that monitors the SMART system and can be used to take proactive measures against hard disk failure. Before using these programs make sure to read their man pages carefully.
Let's start by reading all SMART information off the drive by issuing, smartctl -a /dev/HDD_DEVICE (replace HDD_DEVICE with your disk's device node -- eg. /dev/hda for the primary master IDE disk). If you use a SATA drive, add -d ata at the end of the previous command. If smartctl fails, complaining that SMART is not enabled, run smartctl -s on /dev/HDD_DEVICE and try again. Verify the drive's integrity by running a long SMART test, with smartctl -t long /dev/HDD_DEVICE. Since the test runs in the background, we can check the results by issuing smartctl -l selftest /dev/HDD_DEVICE .
The smartd daemon can perform SMART tests periodically on a running system; its configuration file smartd.conf (usually installed in /etc) includes examples of how to accomplish that, and its man page provides details about the program's operation.
If for some reason you can't use SMART -- for example if your drive does not support it -- you can check your disk for problems with the badblocks program, which is part of the e2fsprogs package, installed by default in almost all GNU/Linux distributions. Run it as badblocks -n -v /dev/HDD_DEVICE for a non-destructive read-write test that will reveal all bad blocks on the disk.
Conclusion
Occasionally I offer consulting and system administration services to small businesses, and in most cases these tools, along with a Philips screwdriver, have saved the day. You can save a lot of time by identifying where a problem comes from and replacing just that component, instead of sending the whole system to the vendor's service department. Whether you're having problems with an existing system or building a new one and want to check it before putting it into production, these tools are priceless -- and free.
|