Dell PowerEdge 2400 - C:Amie (not) Com!

System Requirements:

Dell PowerEdge 2400

The Problem:

Dell PowerEdge 2400 with 2 Gigabytes of main RAM (4x512MB PC133 ECC, Registered) blue screens out sporadically with the following kernel error:

*** Hardware Malfunction ***
Call your hardware vendor for support
*** The system has halted ***NMI: Parity Check / Memory Parity Error

More Information:

The PowerEdge 2400 is functionally quite a nice little server assembly, it was the last of the PIII Slot 1 generation systems released by Dell in a tower chassis form factor.

When I originally got my hands on the system, it was billed as being non-functional, was equipped with the following specification:

1x PIII 600EB
2GB Registered ECC PC133 (4x512MB) DIMM
3x8GB SCSI 160 Hard Drives
3Com high capacity server 10/100 NIC

The original stated issue with the server was a propensity towards unreliability and interrupt BSOD’s so had been written off by an IT service company also involved with this particular client that was more interested in tendering a new (and unnecessary) £24,000 hardware requisition than trying to fix the problem.

The source of their problem was so spectacularly simple that I had solved it in less than 20 minutes after dusting it down. In attempting to be clever, they had installed the reference drivers for the integrated Intel motherboard NIC released by the Intel Corporation. The higher driver resource files had got stuck on the system and couldn’t be removed, preventing driver regression. Anyone who has ever had any experience with integrated NIC’s on Dell motherboard will know instantly not to use reference drivers on Dell modified hardware. Clearly someone here didn’t.

NMI Memory Parity Error

This is the point at which the NMI parity errors started showing through on the system. Initially they would only crop up during low load periods, approximately every 3 days. If the system was being kept busy it never seemed to crop up.
The problem was that the more I cleaned up the messy Windows 2000 install, the more frequent the problems started to become, until I got around to updating all of the systems primary firmware.

BIOS A09
ESM / Backplane A51
Raid Controllers 2.8.1.6098

Interestingly this “strongly recommended” set of updates brought the frequency of the NMI Parity errors to a head, with them now occurring approximately 20-30 second after the boot splash screen has finished animating.

Let us stop here and have a quick look at what a NMI error is.
Put as simply as possible: NMI, or Non-Maskable Interrupt is an Interrupt (commonly referred to as IRQ’s, best described as an ‘attention signal’) which can be generated by the lower level hardware devices in a system. Standard Interrupt ‘signals’ are used in a computer system to to request that the processor pay attention to the initiating device as a priority operation over general data processing tasks. What distinguishes an NMI event is that generally speaking, when a NMI is triggered by a hardware device, the processor (and fundamentally the operating system kernel) is not at liberty to ignore it.

It is because the Windows Kernel has been coded to comply with the inability to ignore the NMI that the Parity error on the PowerEdge doesn’t behave like your average BSOD – there is no memory dump, no automatic reboot, no ACAPI responsiveness. It is the hardware demanding that Windows terminate, rather than Windows deciding that it needs to stop; hence the synonymous NT BSOD or ***** STOP ***** error.

The Generic, catch all approach to fixing NMI Memory Parity errors

I’ve o doubt that if you’re suffering from this problem, and have got here through some sort of search engine you’ve waded through numerous posts telling you exactly the same thing already. So without wanting to spend too much time on the generic’s here’s a quick recap.

“It’s a Memory Parity Error, Stupid” – I’ve not seen a T-Shirt with that on yet, but I suspect I will someday. The blatantly obvious cause of the error is that you have actually experienced a Memory Parity Error. Hard to believe I know!In the server system, you are using Registered main RAM (also known as buffered memory). This is a little sub-process performed on the clock which checks to see if there are any parity (mathematically formulated true/false checks on data) errors in the data being fed into RAM. The parity check is designed to prevent errors being passed back into the processor and wasting additional processor cycles or worse causing the system to crash – they should get caught directly in RAM and be fixed through processes present at the hardware level. If you get the NMI Parity Error because you have had a genuine failure in RAM, the system bottles out because it can’t work out how to fix it, so instead of sending gibberish back to the system registers, it panics and suspends processing.
What causes it, simple, bad RAM. This can be the result of a manufacturing error, badly inserted DIMMs, sub-standard contact between the DIMM pins and the DIMM slot connectors (due to bent pins, damaged motherboard slots, grit, grime and dust), using DIMMs of different quality standards together (always try and match DIMMs to the same manufacturer, model and batch) or it is even possible that you have the DIMMs inserted in the wrong order – always place the largest DIMM’s towards the processor source/sink and the smallest at the end of the array – on the 2400, this is DIMM bank A (farthest from the CPU) followed by B, C and finally D.

Memory Test the DIMM’s using as many testing metrics as you can find, both environmentally (outside of the operating system and natively (inside the operating system). Enable the full RAM test (POST) in the BIOS if you have this option available to you. Microsoft also have some free tools that you can obtain from Microsoft OCA to do this, and there are plenty of others – including the Win32 and bootable Dell Diagnostics & F10 management partitions on Dell configured disks. Whatever you do, don’t think that you can run one iteration and “it means its OK”. Be prepared to walk away for a day or two and come back to a green panel and THEN move onto the next test.
Start pulling out your DIMMs and sequence test them, see if the crashing only happens with a certain combination, or in a certain slot. If it does, you can bet with a fair level of certainty that you have a real Parity error from either a bad DIMM or a bad DIMM slot.
Update the system. I spend a lot of time reminding people of this. If you don’t have the latest Kernel enhancements provided through Service Packs and patches, how can you expect driver/firmware written after their release to be fully interoperable with the older systems.
If you update your firmware, then update the drivers to match the firmware revision.
If at this point you are attempting to run NT 3.51 or NT 4.0 with 2GB of RAM in a modern hardware environment – go get a nice shiny, shrink wrapped version of 2003 Server. OK, so I threw that one in for no particular reason… someone has to keep Mr. Gates in the manner he’s accustom to! Don’t they?
Check that you aren’t overloading your PSU (power supply). If you have a poor flow down your power rails you can upset the components inside a computer.
Update stroppy, meddling kernel-mode system drivers to the latest versions that you can find. I’m going to include Anti-virus applications in this as well, I have heard of people experiencing NMI issue when using Symantec anti virus server products (why? I have to ask) – update the fundamentals or try and test the system with the latest version of the product if you aren’t already rolled out, not forgetting to strip out its lower level driver routines!
Yank the RAID array, throw in a blank disk and reinstall Windows. I know it’s a rather unruly suggestion, however you’ve already spent 72 hours performing constant RAM diagnostics, the idea of spending 39 minutes reinstalling Windows 2003 in a non-destructive manner isn’t that big a deal in the greater scheme of things.
What is this designed to tell you? If the system still dies under this test with a clean install of Windows, then you are likely looking at a hardware problem rather than a genuine Windows malconfiguration. If it works, try installing your base application rollout, drivers and so on and continuity test the system.
Do NOT in-place upgrade/repair the operating system. It is utterly pointless. If it is a Windows error bite the bullet and spend the time performing a clean install. As you’ll see below, upgrading the in place operating system does no good what so ever.

The experiment(s)

Clearly all of the above have been performed on my victim… er test subject and the PowerEdge, running Windows 2000 Server is still experiencing the problems. So how to fix it? I have outlined all the additional steps that I took, and outlined briefly the outcome of each. I’ve listed them primarily to give you some ideas over what you can do if you are experiencing the same problem. Do remember that just because something impacted or didn’t impact my predicament, doesn’t for a moment mean the same will be true for you.

The first and most immediate solution became apparent quite quickly in the generic testing.

Don’t run it with 2GB RAM. My system has 4x512MB ECC DIMMs in it. This fills all four banks and maxes out the RAM capacity of the system. If I take out ANY of the four DIMMs from ANY of the slots, the problem vanishes. Completely. All the RAM passed all the testing I threw at it, but for some reason it seems that the system just didn’t want to address and remain stable with 2048MB in it. At least the system could be quickly made production worth. However… I don’t like leaving things unfinished. It’s not a very professional way to approach the trials that life throws at us.
Put 4x128MB ECC PC133 DIMMs in the system. This was one of the last tests I did, and the system seemed to operate flawlessly with 512MB in it. Proving that the issue is not a bus related one.
Try an non-Microsoft operating system. I have to confess that at this point I was plum out of SCSI drives to use, so I elected to perform testing with Live CD’s instead – it’s not perfect – but several renditions of Linux showed no observable problems, and perhaps more interestingly, no Live CD versions of Windows did either.
I tested using Windows PE for 2003 Server and Bart PE, and although both use the core of Windows XP, they exhibited no obvious problems and certainly no NMI BSOD’s even with most of the system drivers loaded. Except the PERC 2/Si controller.
Dump the RAID configuration and start from Scratch. Nothing happened except having to pull one of the 512’s out to get it to go through setup without BSOD stalling.
Tape restore the system and in-place upgrade from Windows 2000 Server to Windows 2003 Standard Server. New Kernel, more robust operating system you think? Alas no. At this point 2003 Server became the test operating system.
Replace the PIII 600EB processor with a PIII 733EB processor. I got a nice speed boost, but, alas, not fix to the problem.
Replace the uni-processor PIII 733EB with two PIII 1000 (1GHz) SL4BS’s (making the system dual-processor). I was quite hopeful on this one, but sadly no. Best £19 I ever spent though!
Pull everything from the PCI bus, disable all the integrated BIOS hardware (NIC, USB etc). No change.
Reset the EEPROM (Pull the BIOS battery) – no change
Alternate and re-sequence the use of the RAID controller positions on the backplane – no change.
Replace the RAID controller DIMM – no change

Ah!

All seems rather hopeless doesn’t it?

I pulled the server from the farm (again) and brought it somewhere a little more comfortable to work with (again). Cleaned the DIMM slots, cleaned the pins on the DIMMs (again). I have always felt that the NMI error in this case has been something of a misnomer.

I have been gravitating towards the backplane/RAID controller for some time in my other experiments, so primarily out of having no better ideas, I decided to completely disconnect the backplane assembly. I unlinked the ribbon cables and cleaned them, removed all the PCB’s, straightened out the wiring, cleaned off the heavily dust laden drive connectors and cleaned the connector pins on both the daughter board and the motherboard. I then fully cleaned out the inside of the case around the drive bays, reseated the backplane and put the cabling back.

I fired the 2400 up and was immediately presented with a POST warning stating a warning that the ESM firmware revision was out of date.

!!!!****Warning: Firmware is out of date, please update****!!!!

Particularly strange considering everything was long since flashed to the latest version before the system was last powered out. I can only surmise that the process is non-volatile and with the disconnection of the battery sources, the firmware reverted to its original settings.

To my surprise, the system booted straight into Windows 2003 Server and hasn’t NMI’d out since!
Deciding to tempt fate, I downloaded and re-flashed the ESM controller to the latest version, and I am please to report that even with the re-flash it hasn’t (yet) fallen over.

I have made no driver changes to the installation, everything is running on Windows 2003 default drivers with the exception of the Tape stream driver from Windows update. I have installed the McAfee Enterprise 8.0i (Patch 14, latest DAT’s/Engine) on the system and setup IIS in its production configuration. The PowerEdge has mainly been idling since the reinstall (and it’s been lying on its side, as well as counting the system idle process!), but don’t worry. If this holds out for a little longer, I will put those two new SL4BS’s to good use.

All being well, you’re receiving this website from it

Update 15th April 2007: Well, all has been well and you are seeing this website from the PowerEdge 2400. The server has offered impeccable performance since I wrote this article (shame I can’t say the same for Microsoft’s patching downtime requirements so that I could prove the uptime). The server also survived the cold winter without incident, unlike its 2600 counterpart which fell over numerous times. PowerEdge 2400, a workhorse and a graceful lady. Good job Dell.

Update 17th June 2007: All still seems to be good. I updated to McAfee 8.0.0i Patch 15, and a couple of days later had a complete system drop out (one of those black screen, no response to anything, just spinning the electricity meter moments). It wasn’t NMI related though, I suspect a PSU fluke in this case. Aside from this, the system continues to run flawlessly.

Update 31st December 2007: Everything seems to be running just fine with the server although I have concluded that it doesn’t want to run with 2.0 GB RAM in it as it seems to force the System Management Controller to reboot it around every 2 months. I have reduced it to 1.5 GB where it runs comfortably without any unexpected reboots by the SMC. I have also updated the tape firmware to A18.