Dell PowerEdge 2400

System Requirements:

  • Dell PowerEdge 2400

The Problem:

Dell PowerEdge 2400 with 2 Gigabytes of main RAM (4x512MB PC133 ECC, Registered) blue screens out sporadically with the following kernel error:

*** Hardware Malfunction ***
Call your hardware vendor for support
*** The system has halted ***NMI: Parity Check / Memory Parity Error

More Information:

The PowerEdge 2400 is functionally quite a nice little server assembly, it was the last of the PIII Slot 1 generation systems released by Dell in a tower chassis form factor.

When I originally got my hands on the system, it was billed as being non-functional, was equipped with the following specification:

  • 1x PIII 600EB
  • 2GB Registered ECC PC133 (4x512MB) DIMM
  • 3x8GB SCSI 160 Hard Drives
  • 3Com high capacity server 10/100 NIC

The original stated issue with the server was a propensity towards unreliability and interrupt BSOD’s so had been written off by an IT service company also involved with this particular client that was more interested in tendering a new (and unnecessary) £24,000 hardware requisition than trying to fix the problem.

The source of their problem was so spectacularly simple that I had solved it in less than 20 minutes after dusting it down. In attempting to be clever, they had installed the reference drivers for the integrated Intel motherboard NIC released by the Intel Corporation. The higher driver resource files had got stuck on the system and couldn’t be removed, preventing driver regression. Anyone who has ever had any experience with integrated NIC’s on Dell motherboard will know instantly not to use reference drivers on Dell modified hardware. Clearly someone here didn’t.

 

NMI Memory Parity Error

This is the point at which the NMI parity errors started showing through on the system. Initially they would only crop up during low load periods, approximately every 3 days. If the system was being kept busy it never seemed to crop up.
The problem was that the more I cleaned up the messy Windows 2000 install, the more frequent the problems started to become, until I got around to updating all of the systems primary firmware.

  • BIOS A09
  • ESM / Backplane A51
  • Raid Controllers 2.8.1.6098

Interestingly this “strongly recommended” set of updates brought the frequency of the NMI Parity errors to a head, with them now occurring approximately 20-30 second after the boot splash screen has finished animating.

Let us stop here and have a quick look at what a NMI error is.
Put as simply as possible: NMI, or Non-Maskable Interrupt is an Interrupt (commonly referred to as IRQ’s, best described as an ‘attention signal’) which can be generated by the lower level hardware devices in a system. Standard Interrupt ‘signals’ are used in a computer system to to request that the processor pay attention to the initiating device as a priority operation over general data processing tasks. What distinguishes an NMI event is that generally speaking, when a NMI is triggered by a hardware device, the processor (and fundamentally the operating system kernel) is not at liberty to ignore it.

It is because the Windows Kernel has been coded to comply with the inability to ignore the NMI that the Parity error on the PowerEdge doesn’t behave like your average BSOD – there is no memory dump, no automatic reboot, no ACAPI responsiveness. It is the hardware demanding that Windows terminate, rather than Windows deciding that it needs to stop; hence the synonymous NT BSOD or ***** STOP ***** error.

 

The Generic, catch all approach to fixing NMI Memory Parity errors

I’ve o doubt that if you’re suffering from this problem, and have got here through some sort of search engine you’ve waded through numerous posts telling you exactly the same thing already. So without wanting to spend too much time on the generic’s here’s a quick recap.

  1. “It’s a Memory Parity Error, Stupid” – I’ve not seen a T-Shirt with that on yet, but I suspect I will someday. The blatantly obvious cause of the error is that you have actually experienced a Memory Parity Error. Hard to believe I know!In the server system, you are using Registered main RAM (also known as buffered memory). This is a little sub-process performed on the clock which checks to see if there are any parity (mathematically formulated true/false checks on data) errors in the data being fed into RAM. The parity check is designed to prevent errors being passed back into the processor and wasting additional processor cycles or worse causing the system to crash – they should get caught directly in RAM and be fixed through processes present at the hardware level. If you get the NMI Parity Error because you have had a genuine failure in RAM, the system bottles out because it can’t work out how to fix it, so instead of sending gibberish back to the system registers, it panics and suspends processing.
    What causes it, simple, bad RAM. This can be the result of a manufacturing error, badly inserted DIMMs, sub-standard contact between the DIMM pins and the DIMM slot connectors (due to bent pins, damaged motherboard slots, grit, grime and dust), using DIMMs of different quality standards together (always try and match DIMMs to the same manufacturer, model and batch) or it is even possible that you have the DIMMs inserted in the wrong order – always place the largest DIMM’s towards the processor source/sink and the smallest at the end of the array – on the 2400, this is DIMM bank A (farthest from the CPU) followed by B, C and finally D.

    Memory Test the DIMM’s using as many testing metrics as you can find, both environmentally (outside of the operating system and natively (inside the operating system). Enable the full RAM test (POST) in the BIOS if you have this option available to you. Microsoft also have some free tools that you can obtain from Microsoft OCA to do this, and there are plenty of others – including the Win32 and bootable Dell Diagnostics & F10 management partitions on Dell configured disks. Whatever you do, don’t think that you can run one iteration and “it means its OK”. Be prepared to walk away for a day or two and come back to a green panel and THEN move onto the next test.

  2. Start pulling out your DIMMs and sequence test them, see if the crashing only happens with a certain combination, or in a certain slot. If it does, you can bet with a fair level of certainty that you have a real Parity error from either a bad DIMM or a bad DIMM slot.
  3. Update the system. I spend a lot of time reminding people of this. If you don’t have the latest Kernel enhancements provided through Service Packs and patches, how can you expect driver/firmware written after their release to be fully interoperable with the older systems.
    If you update your firmware, then update the drivers to match the firmware revision.
    If at this point you are attempting to run NT 3.51 or NT 4.0 with 2GB of RAM in a modern hardware environment – go get a nice shiny, shrink wrapped version of 2003 Server. OK, so I threw that one in for no particular reason… someone has to keep Mr. Gates in the manner he’s accustom to! Don’t they?
  4. Check that you aren’t overloading your PSU (power supply). If you have a poor flow down your power rails you can upset the components inside a computer.
  5. Update stroppy, meddling kernel-mode system drivers to the latest versions that you can find. I’m going to include Anti-virus applications in this as well, I have heard of people experiencing NMI issue when using Symantec anti virus server products (why? I have to ask) – update the fundamentals or try and test the system with the latest version of the product if you aren’t already rolled out, not forgetting to strip out its lower level driver routines!
  6. Yank the RAID array, throw in a blank disk and reinstall Windows. I know it’s a rather unruly suggestion, however you’ve already spent 72 hours performing constant RAM diagnostics, the idea of spending 39 minutes reinstalling Windows 2003 in a non-destructive manner isn’t that big a deal in the greater scheme of things.
    What is this designed to tell you? If the system still dies under this test with a clean install of Windows, then you are likely looking at a hardware problem rather than a genuine Windows malconfiguration. If it works, try installing your base application rollout, drivers and so on and continuity test the system.
  7. Do NOT in-place upgrade/repair the operating system. It is utterly pointless. If it is a Windows error bite the bullet and spend the time performing a clean install. As you’ll see below, upgrading the in place operating system does no good what so ever.

 

The experiment(s)

Clearly all of the above have been performed on my victim… er test subject and the PowerEdge, running Windows 2000 Server is still experiencing the problems. So how to fix it? I have outlined all the additional steps that I took, and outlined briefly the outcome of each. I’ve listed them primarily to give you some ideas over what you can do if you are experiencing the same problem. Do remember that just because something impacted or didn’t impact my predicament, doesn’t for a moment mean the same will be true for you.

The first and most immediate solution became apparent quite quickly in the generic testing.

  1. Don’t run it with 2GB RAM. My system has 4x512MB ECC DIMMs in it. This fills all four banks and maxes out the RAM capacity of the system. If I take out ANY of the four DIMMs from ANY of the slots, the problem vanishes. Completely. All the RAM passed all the testing I threw at it, but for some reason it seems that the system just didn’t want to address and remain stable with 2048MB in it. At least the system could be quickly made production worth. However… I don’t like leaving things unfinished. It’s not a very professional way to approach the trials that life throws at us.
  2. Put 4x128MB ECC PC133 DIMMs in the system. This was one of the last tests I did, and the system seemed to operate flawlessly with 512MB in it. Proving that the issue is not a bus related one.
  3. Try an non-Microsoft operating system. I have to confess that at this point I was plum out of SCSI drives to use, so I elected to perform testing with Live CD’s instead – it’s not perfect – but several renditions of Linux showed no observable problems, and perhaps more interestingly, no Live CD versions of Windows did either.
    I tested using Windows PE for 2003 Server and Bart PE, and although both use the core of Windows XP, they exhibited no obvious problems and certainly no NMI BSOD’s even with most of the system drivers loaded. Except the PERC 2/Si controller.
  4. Dump the RAID configuration and start from Scratch. Nothing happened except having to pull one of the 512’s out to get it to go through setup without BSOD stalling.
  5. Tape restore the system and in-place upgrade from Windows 2000 Server to Windows 2003 Standard Server. New Kernel, more robust operating system you think? Alas no. At this point 2003 Server became the test operating system.
  6. Replace the PIII 600EB processor with a PIII 733EB processor. I got a nice speed boost, but, alas, not fix to the problem.
  7. Replace the uni-processor PIII 733EB with two PIII 1000 (1GHz) SL4BS’s (making the system dual-processor). I was quite hopeful on this one, but sadly no. Best £19 I ever spent though!
  8. Pull everything from the PCI bus, disable all the integrated BIOS hardware (NIC, USB etc). No change.
  9. Reset the EEPROM (Pull the BIOS battery) – no change
  10. Alternate and re-sequence the use of the RAID controller positions on the backplane – no change.
  11. Replace the RAID controller DIMM – no change

 

Ah!

All seems rather hopeless doesn’t it?

I pulled the server from the farm (again) and brought it somewhere a little more comfortable to work with (again). Cleaned the DIMM slots, cleaned the pins on the DIMMs (again). I have always felt that the NMI error in this case has been something of a misnomer.

I have been gravitating towards the backplane/RAID controller for some time in my other experiments, so primarily out of having no better ideas, I decided to completely disconnect the backplane assembly. I unlinked the ribbon cables and cleaned them, removed all the PCB’s, straightened out the wiring, cleaned off the heavily dust laden drive connectors and cleaned the connector pins on both the daughter board and the motherboard. I then fully cleaned out the inside of the case around the drive bays, reseated the backplane and put the cabling back.

I fired the 2400 up and was immediately presented with a POST warning stating a warning that the ESM firmware revision was out of date.

!!!!****Warning: Firmware is out of date, please update****!!!!

Particularly strange considering everything was long since flashed to the latest version before the system was last powered out. I can only surmise that the process is non-volatile and with the disconnection of the battery sources, the firmware reverted to its original settings.

To my surprise, the system booted straight into Windows 2003 Server and hasn’t NMI’d out since!
Deciding to tempt fate, I downloaded and re-flashed the ESM controller to the latest version, and I am please to report that even with the re-flash it hasn’t (yet) fallen over.

I have made no driver changes to the installation, everything is running on Windows 2003 default drivers with the exception of the Tape stream driver from Windows update. I have installed the McAfee Enterprise 8.0i (Patch 14, latest DAT’s/Engine) on the system and setup IIS in its production configuration. The PowerEdge has mainly been idling since the reinstall (and it’s been lying on its side, as well as counting the system idle process!), but don’t worry. If this holds out for a little longer, I will put those two new SL4BS’s to good use.

All being well, you’re receiving this website from it

 

Update 15th April 2007: Well, all has been well and you are seeing this website from the PowerEdge 2400. The server has offered impeccable performance since I wrote this article (shame I can’t say the same for Microsoft’s patching downtime requirements so that I could prove the uptime). The server also survived the cold winter without incident, unlike its 2600 counterpart which fell over numerous times. PowerEdge 2400, a workhorse and a graceful lady. Good job Dell.

 

Update 17th June 2007: All still seems to be good. I updated to McAfee 8.0.0i Patch 15, and a couple of days later had a complete system drop out (one of those black screen, no response to anything, just spinning the electricity meter moments). It wasn’t NMI related though, I suspect a PSU fluke in this case. Aside from this, the system continues to run flawlessly.

 

Update 31st December 2007: Everything seems to be running just fine with the server although I have concluded that it doesn’t want to run with 2.0 GB RAM in it as it seems to force the System Management Controller to reboot it around every 2 months. I have reduced it to 1.5 GB where it runs comfortably without any unexpected reboots by the SMC. I have also updated the tape firmware to A18.

Recycle Bin Woes: Chomping at the Bit

System Requirements:

  • Windows NT 3.51, 4.0, 2000, XP, 2003

The Problem:

Yes, I admit it, I have hard disks running in my current servers which have been with me almost as long as I have – that either says something about how old I am, or how tragic I am; probably the latter actually.

Ever since I got fed up with it all back in 1999 and had a 9x cull, everything – and I mean everything on my home network has run some form of NT or another, be it 3.51, 4.0, 2000, XP or 2003; even Longhorn (I’m still not calling it “that”).

Having suffered some major server related setbacks in the last week, the short version of a very long and depressing story has been that I’ve needed to reinstall a large number of servers, including two here at home. One of them is a nice cost SCSI320 RAID array that has just been rebuild, but server 3, with its numerous ATA disks, that’s a different story.

Thinking long and hard about it, I can conclude that at least two of the disks would have originally started life as FAT Volumes, and has been slave drives under 98, Millennium, NT4, 2000 Professional, XP Professional, 2000 server and now 2003 server. In between there have been numerous file system changes, back and forward from one to the other through the native conversion tools, or by the modern marvel that is Partition Magic.

The point is, the volumes have had a lot of work done on them, many partition dimension changes, many file system changes and last and not least, many operating system changes. So when it came down to a little impromptu data reorganisation when server 3 was retired from Intranet duties, imagine my surprise when Windows was reporting a completely empty 60GB NTFS partition as using well over 3GB of disk space.

All file systems will have some degree of overhead involved in storing their configuration information, NTFS partitions in particular will always have a certain overhead reserved for the on-disk MFT store. However on a 60GB partition this should be around 95.7MB (100,396,572 bytes) depending upon cluster size. So where has the data gone?

The Fix:

Tragically, this little delight is down to the way in which Windows tracks the recycle bin index for any given volume and every given user account on the system. Every user on the computer is assigned and identified by a GUID value by the SAM when the operating system is installed and when the user account is created. Such a GUID looks like:

S-1-5-21-1377080574-1904069225-4148753390-500

Note the ending value of 500, which means this account is almost certainly the system Administrator account in this example.

For each GUID on the system a MFT wrapper is created under the super hidden folder <drive>:\RECYCLER which contains the file system entries for everything that user holds in the recycle bin.
The meaning for this is clear, Windows maintains a per-user recycle bin schema, not a per system. Therefore each user account can populate the recycle bin with whatever information that user deletes, and it remain hidden on the disk.

The trouble is when a volume gets old, the GUID’s easily become stale as operating systems are reinstalled, Active Directory is added and removed, disk’s are moved between systems and the file systems themselves are altered.
In the case of this particular volume, well over 3GB of data was trapped in the recycle bins of a rather large number of long dead user accounts, which unintelligently Windows and successive operating system reinstallations has never picked up upon.

There are two ways to deal with this, the first of which I am not going to go into much detail on.

  1. You can wipe out stray GUID entries from the RECYCLER folders by seizing ownership of the RECYCLER folder and literally deleting the wanted GUID’s. Remember though, the system will let you delete the information from any user account that is currently not loaded. You can check the GUID’s off against the HKU hive in the registry to be certain if they exist.
  2. The second way, and the way that I am advocating is to pull all the data off of the volume and perform a format using Disk Manager. The logic is very simple, not only does it wipe out everything from the disk, and perform a thorough bad sector check/retest & repair, it also will upgrade and reconstruct the NTFS / MFT on the volume.

The volumes in question in my example were all NTFS from NT 4.0, where as the standard has been updated through 2000, XP and 2003 with additional features such as Shadow Copy support and native OS level Indexing service support. I therefore am at pains to imagine what sort of state the MFT on the volume was in with all the in-place bolt-on’s added by successive Windows releases, not to mention MFT file fragmentation. Formatting the volume will therefore naturally clean the entire nightmare up quite eloquently.

URLScan 2.5 has a Broken Display Icon in the Add/Remove Programs list under Windows 2000 / XP

System Requirements:

  • Windows 2000
  • Windows XP
  • Internet Information Services 5.0
  • Internet Information Services 5.1

The Problem:

This is nothing more than a cosmetic faux pas by Microsoft which bugs me, so if like me you cannot stand disarray in your Add / Remove Programs, this is the (largely pointless) fix for you!

URLScan 2.5 is an updated version of the HTTP filter released to add something resembling some security control to Microsoft Internet Information Services 4.0, 5.0 and 5.1 – by version 6.0 they have supposedly got security down to a fine art…

When the installer was built someone at Microsoft made a mistake in the uninstall routine strings for Add / Remove programs and as a result, the URLScan 2.5 entry looks thus:

Add / Remove Programs faux pas for URLScan 2.5

The error technically exists under NT 4.0 and IIS 4.0, however as NT 4.0 doesn’t draw icons in Add / Remove Programs, the fix is not going to make any difference anyway.

The Fix:

… I told you it was trivial.

The fix is, however, as equally trivial as the problem. Pull up Regedit and travel to:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\IisUrlScan

Alter the value of the DisplayIcon string from:
C:\WINNT\system32\inetsrv\urlscan\urlscan.exe,1

to read:
C:\WINNT\system32\inetsrv\urlscan\urlscan.exe,0

Hit F5 in Add / Remove Programs and there you have it, a working Icon.

Updated release of Microsoft IIS Lockdown 2.1 to include URLScan 2.5

System Requirements:

  • Windows NT 4.0 SP6a
  • Windows 2000
  • Windows XP
  • Internet Information Services 4.0
  • Internet Information Services 5.0
  • Internet Information Services 5.1

The Problem:

Microsoft released the very useful (and necessary) IIS Lockdown 1.0 in August 2001, and updated it to version 2.1 in the November of the same year, adding support for IIS 5.1 and bundling the URLScan 6.0.3547.0 filter ISAPI extension.

The URLScan component was updated to URLScan 2.5 (6.0.3615.0) in an unceremonious fashion in May 2003 to provide an updated ISAPI and, presumably, some IIS 6 support.

It’s not easy to even find URLScan 2.5 in the Microsoft download centre, but it is there, appearing in the search listings as “Setup.exe”.

If you want to bring a new Windows NT4/5.x Server box up to spec you will invariably use IIS Lockdown, which will install URLScan 6.0.3547.0, restart the IIS Services (or restart in the case of NT4). Bring the system backup, uninstall the IIS Lockdown URLScan, repeat the service restart and finally install URLScan 2.5 and – you guessed it – restart the services a third time.

The Fix:

My theory is “why bother” with that, I just reintegrated the new 2003 version of URLScan into the IIS Lockdown 2.1 installer and voila one install, one service reset, give yourself a pat on the back and go make a cup of tea.

IIS Lockdown 2.1.1 C:Amie Edition : Download : 193KB

If you don’t believe it works, the server you are connecting too right now is using it! (assuming this is still a Windows 2000 Server when you read this).

What is new in URLScan 2.5?

Good question, there isn’t any documentation on the 2.5 release (the readme in my 2.1.1 redist is the original version’s). The default configuration script has been expanded with some new variables.

You can now control how and where URLScan logs too – useful as the original version dumps logs in the Inetsrv\URLscan folder in a rather untidy manner.

LogLongUrls=0 If 1, then up to 128K per request can be logged.
If 0, then only 1k is allowed.
LoggingDirectory can be used to specify the directory where the
log file will be created. This value should be the absolute path
(ie. c:\some\path). If not specified, then UrlScan will create
the log in the same directory where the UrlScan.dll file is located.
LoggingDirectory=C:\WINNT\system32\inetsrv\urlscan\logs

A new query string control and maximum file request is also a new feature, allowing you to prevent people from sucking the life out of a persistent HTTP connection.

[RequestLimits]
The entries in this section impose limits on the length of allowed parts of requests reaching the server.
It is possible to impose a limit on the length of the value of a specific request header by prepending “Max-” to the name of the header. For example, the following entry would impose a limit of 100 bytes to the value of the ‘Content-Type’ header: Max-Content-Type=100
To list a header and not specify a maximum value, use 0 (ie. ‘Max-User-Agent=0’). Also, any headers not listed in this section will not be checked for length limits.There are 3 special case limits:
– MaxAllowedContentLength specifies the maximum allowed numeric value of the Content-Length request header. For example, setting this to 1000 would cause any request with a content length that exceeds 1000 to be rejected. The default is 30000000.
– MaxUrl specifies the maximum length of the request URL, not including the query string. The default is 260 (which is equivalent to MAX_PATH).
– MaxQueryString specifies the maximum length of the query string. The default is 4096.