Error: “Error during write buffer commit. Please check all cables and connections. Also verify that proper drive termination is used” while attempting to upgrade the firmware on Dell PowerVault 100T DDS4

System Requirements:

  • Dell PowerVault 100T DDS4

The Problem:

While attempting to upgrade the firmware on a PowerVault 100T DDS4 you receive the following error message from the Dell Windows updater.

error

The firmware updater exits from the session without upgrading the firmware.

More Information:

I have two Dell PowerVault 100T DDS drives, one in a PowerEdge 2600 and the other in the legend that is my PowerEdge 2400. Both systems run Windows Server 2003 and are pretty much vanilla Microsoft setups. The 2600 quite happily takes the firmware updates for the 100T DDS4, while the 2400 always drops out of the update procedure with the error message listed above.

Luckily the drive suffers no ill effects from encountering the error so far as I can see.

 

Obviously do pay attention to the error message. Check your termination and cabling if necessary. However there is a more simple solution. You’re flashing the wrong firmware.

There are two different versions of the 100T DDS4, and in their infinite wisdom (and for some reason) Dell didn’t think to add “v2” to the hardware name. What it boils down to is that there are two firmware kernels, the v8000 and the v9000. If you have an older drive then you have a v8000… and the Windows firmware updater is designed only for the v9000 drive.

 

How do I know which drive I have?

You can either scoot off and look in the PnP ID’s for the system, you could reboot the system and watch the POST… but that means downtime!

The lazy way to do it is to re-run the Windows updater package for the v9000. Before the flash begins you will be prompted with the following dialogue:

DDS4 Version Check

If the Installed Version string begins with an 8 (825B in this case) you have the older v8000 drive. If you have a v9000 drive at this juncture you really do need to check your termination and cabling.

 

Updating the firmware

Once you are secure in the knowledge that you have a v8000, the procedure is quite simply to download the hard drive install package and follow the instructions initiate the flash through the (provided) terminal application.

I have checked from A05 through to the latest A17 (at the time of writing) and while the highest FWID firmware revision seems to be labeled as 825B, the date time stamps on the images are being updated which makes me suspect that Dell simply aren’t updating the build number if they are indeed doing anything with it. In contrast the v9000 series is receiving incremental versions numbering. None the less, grab yourself the latest and greatest and get flashing!

 

Update 31st December 2007: There is now an A18, and while the version number is identical the date stamps have again changed.

Dell PowerEdge 2400

System Requirements:

  • Dell PowerEdge 2400

The Problem:

Dell PowerEdge 2400 with 2 Gigabytes of main RAM (4x512MB PC133 ECC, Registered) blue screens out sporadically with the following kernel error:

*** Hardware Malfunction ***
Call your hardware vendor for support
*** The system has halted ***NMI: Parity Check / Memory Parity Error

More Information:

The PowerEdge 2400 is functionally quite a nice little server assembly, it was the last of the PIII Slot 1 generation systems released by Dell in a tower chassis form factor.

When I originally got my hands on the system, it was billed as being non-functional, was equipped with the following specification:

  • 1x PIII 600EB
  • 2GB Registered ECC PC133 (4x512MB) DIMM
  • 3x8GB SCSI 160 Hard Drives
  • 3Com high capacity server 10/100 NIC

The original stated issue with the server was a propensity towards unreliability and interrupt BSOD’s so had been written off by an IT service company also involved with this particular client that was more interested in tendering a new (and unnecessary) £24,000 hardware requisition than trying to fix the problem.

The source of their problem was so spectacularly simple that I had solved it in less than 20 minutes after dusting it down. In attempting to be clever, they had installed the reference drivers for the integrated Intel motherboard NIC released by the Intel Corporation. The higher driver resource files had got stuck on the system and couldn’t be removed, preventing driver regression. Anyone who has ever had any experience with integrated NIC’s on Dell motherboard will know instantly not to use reference drivers on Dell modified hardware. Clearly someone here didn’t.

 

NMI Memory Parity Error

This is the point at which the NMI parity errors started showing through on the system. Initially they would only crop up during low load periods, approximately every 3 days. If the system was being kept busy it never seemed to crop up.
The problem was that the more I cleaned up the messy Windows 2000 install, the more frequent the problems started to become, until I got around to updating all of the systems primary firmware.

  • BIOS A09
  • ESM / Backplane A51
  • Raid Controllers 2.8.1.6098

Interestingly this “strongly recommended” set of updates brought the frequency of the NMI Parity errors to a head, with them now occurring approximately 20-30 second after the boot splash screen has finished animating.

Let us stop here and have a quick look at what a NMI error is.
Put as simply as possible: NMI, or Non-Maskable Interrupt is an Interrupt (commonly referred to as IRQ’s, best described as an ‘attention signal’) which can be generated by the lower level hardware devices in a system. Standard Interrupt ‘signals’ are used in a computer system to to request that the processor pay attention to the initiating device as a priority operation over general data processing tasks. What distinguishes an NMI event is that generally speaking, when a NMI is triggered by a hardware device, the processor (and fundamentally the operating system kernel) is not at liberty to ignore it.

It is because the Windows Kernel has been coded to comply with the inability to ignore the NMI that the Parity error on the PowerEdge doesn’t behave like your average BSOD – there is no memory dump, no automatic reboot, no ACAPI responsiveness. It is the hardware demanding that Windows terminate, rather than Windows deciding that it needs to stop; hence the synonymous NT BSOD or ***** STOP ***** error.

 

The Generic, catch all approach to fixing NMI Memory Parity errors

I’ve o doubt that if you’re suffering from this problem, and have got here through some sort of search engine you’ve waded through numerous posts telling you exactly the same thing already. So without wanting to spend too much time on the generic’s here’s a quick recap.

  1. “It’s a Memory Parity Error, Stupid” – I’ve not seen a T-Shirt with that on yet, but I suspect I will someday. The blatantly obvious cause of the error is that you have actually experienced a Memory Parity Error. Hard to believe I know!In the server system, you are using Registered main RAM (also known as buffered memory). This is a little sub-process performed on the clock which checks to see if there are any parity (mathematically formulated true/false checks on data) errors in the data being fed into RAM. The parity check is designed to prevent errors being passed back into the processor and wasting additional processor cycles or worse causing the system to crash – they should get caught directly in RAM and be fixed through processes present at the hardware level. If you get the NMI Parity Error because you have had a genuine failure in RAM, the system bottles out because it can’t work out how to fix it, so instead of sending gibberish back to the system registers, it panics and suspends processing.
    What causes it, simple, bad RAM. This can be the result of a manufacturing error, badly inserted DIMMs, sub-standard contact between the DIMM pins and the DIMM slot connectors (due to bent pins, damaged motherboard slots, grit, grime and dust), using DIMMs of different quality standards together (always try and match DIMMs to the same manufacturer, model and batch) or it is even possible that you have the DIMMs inserted in the wrong order – always place the largest DIMM’s towards the processor source/sink and the smallest at the end of the array – on the 2400, this is DIMM bank A (farthest from the CPU) followed by B, C and finally D.

    Memory Test the DIMM’s using as many testing metrics as you can find, both environmentally (outside of the operating system and natively (inside the operating system). Enable the full RAM test (POST) in the BIOS if you have this option available to you. Microsoft also have some free tools that you can obtain from Microsoft OCA to do this, and there are plenty of others – including the Win32 and bootable Dell Diagnostics & F10 management partitions on Dell configured disks. Whatever you do, don’t think that you can run one iteration and “it means its OK”. Be prepared to walk away for a day or two and come back to a green panel and THEN move onto the next test.

  2. Start pulling out your DIMMs and sequence test them, see if the crashing only happens with a certain combination, or in a certain slot. If it does, you can bet with a fair level of certainty that you have a real Parity error from either a bad DIMM or a bad DIMM slot.
  3. Update the system. I spend a lot of time reminding people of this. If you don’t have the latest Kernel enhancements provided through Service Packs and patches, how can you expect driver/firmware written after their release to be fully interoperable with the older systems.
    If you update your firmware, then update the drivers to match the firmware revision.
    If at this point you are attempting to run NT 3.51 or NT 4.0 with 2GB of RAM in a modern hardware environment – go get a nice shiny, shrink wrapped version of 2003 Server. OK, so I threw that one in for no particular reason… someone has to keep Mr. Gates in the manner he’s accustom to! Don’t they?
  4. Check that you aren’t overloading your PSU (power supply). If you have a poor flow down your power rails you can upset the components inside a computer.
  5. Update stroppy, meddling kernel-mode system drivers to the latest versions that you can find. I’m going to include Anti-virus applications in this as well, I have heard of people experiencing NMI issue when using Symantec anti virus server products (why? I have to ask) – update the fundamentals or try and test the system with the latest version of the product if you aren’t already rolled out, not forgetting to strip out its lower level driver routines!
  6. Yank the RAID array, throw in a blank disk and reinstall Windows. I know it’s a rather unruly suggestion, however you’ve already spent 72 hours performing constant RAM diagnostics, the idea of spending 39 minutes reinstalling Windows 2003 in a non-destructive manner isn’t that big a deal in the greater scheme of things.
    What is this designed to tell you? If the system still dies under this test with a clean install of Windows, then you are likely looking at a hardware problem rather than a genuine Windows malconfiguration. If it works, try installing your base application rollout, drivers and so on and continuity test the system.
  7. Do NOT in-place upgrade/repair the operating system. It is utterly pointless. If it is a Windows error bite the bullet and spend the time performing a clean install. As you’ll see below, upgrading the in place operating system does no good what so ever.

 

The experiment(s)

Clearly all of the above have been performed on my victim… er test subject and the PowerEdge, running Windows 2000 Server is still experiencing the problems. So how to fix it? I have outlined all the additional steps that I took, and outlined briefly the outcome of each. I’ve listed them primarily to give you some ideas over what you can do if you are experiencing the same problem. Do remember that just because something impacted or didn’t impact my predicament, doesn’t for a moment mean the same will be true for you.

The first and most immediate solution became apparent quite quickly in the generic testing.

  1. Don’t run it with 2GB RAM. My system has 4x512MB ECC DIMMs in it. This fills all four banks and maxes out the RAM capacity of the system. If I take out ANY of the four DIMMs from ANY of the slots, the problem vanishes. Completely. All the RAM passed all the testing I threw at it, but for some reason it seems that the system just didn’t want to address and remain stable with 2048MB in it. At least the system could be quickly made production worth. However… I don’t like leaving things unfinished. It’s not a very professional way to approach the trials that life throws at us.
  2. Put 4x128MB ECC PC133 DIMMs in the system. This was one of the last tests I did, and the system seemed to operate flawlessly with 512MB in it. Proving that the issue is not a bus related one.
  3. Try an non-Microsoft operating system. I have to confess that at this point I was plum out of SCSI drives to use, so I elected to perform testing with Live CD’s instead – it’s not perfect – but several renditions of Linux showed no observable problems, and perhaps more interestingly, no Live CD versions of Windows did either.
    I tested using Windows PE for 2003 Server and Bart PE, and although both use the core of Windows XP, they exhibited no obvious problems and certainly no NMI BSOD’s even with most of the system drivers loaded. Except the PERC 2/Si controller.
  4. Dump the RAID configuration and start from Scratch. Nothing happened except having to pull one of the 512’s out to get it to go through setup without BSOD stalling.
  5. Tape restore the system and in-place upgrade from Windows 2000 Server to Windows 2003 Standard Server. New Kernel, more robust operating system you think? Alas no. At this point 2003 Server became the test operating system.
  6. Replace the PIII 600EB processor with a PIII 733EB processor. I got a nice speed boost, but, alas, not fix to the problem.
  7. Replace the uni-processor PIII 733EB with two PIII 1000 (1GHz) SL4BS’s (making the system dual-processor). I was quite hopeful on this one, but sadly no. Best £19 I ever spent though!
  8. Pull everything from the PCI bus, disable all the integrated BIOS hardware (NIC, USB etc). No change.
  9. Reset the EEPROM (Pull the BIOS battery) – no change
  10. Alternate and re-sequence the use of the RAID controller positions on the backplane – no change.
  11. Replace the RAID controller DIMM – no change

 

Ah!

All seems rather hopeless doesn’t it?

I pulled the server from the farm (again) and brought it somewhere a little more comfortable to work with (again). Cleaned the DIMM slots, cleaned the pins on the DIMMs (again). I have always felt that the NMI error in this case has been something of a misnomer.

I have been gravitating towards the backplane/RAID controller for some time in my other experiments, so primarily out of having no better ideas, I decided to completely disconnect the backplane assembly. I unlinked the ribbon cables and cleaned them, removed all the PCB’s, straightened out the wiring, cleaned off the heavily dust laden drive connectors and cleaned the connector pins on both the daughter board and the motherboard. I then fully cleaned out the inside of the case around the drive bays, reseated the backplane and put the cabling back.

I fired the 2400 up and was immediately presented with a POST warning stating a warning that the ESM firmware revision was out of date.

!!!!****Warning: Firmware is out of date, please update****!!!!

Particularly strange considering everything was long since flashed to the latest version before the system was last powered out. I can only surmise that the process is non-volatile and with the disconnection of the battery sources, the firmware reverted to its original settings.

To my surprise, the system booted straight into Windows 2003 Server and hasn’t NMI’d out since!
Deciding to tempt fate, I downloaded and re-flashed the ESM controller to the latest version, and I am please to report that even with the re-flash it hasn’t (yet) fallen over.

I have made no driver changes to the installation, everything is running on Windows 2003 default drivers with the exception of the Tape stream driver from Windows update. I have installed the McAfee Enterprise 8.0i (Patch 14, latest DAT’s/Engine) on the system and setup IIS in its production configuration. The PowerEdge has mainly been idling since the reinstall (and it’s been lying on its side, as well as counting the system idle process!), but don’t worry. If this holds out for a little longer, I will put those two new SL4BS’s to good use.

All being well, you’re receiving this website from it

 

Update 15th April 2007: Well, all has been well and you are seeing this website from the PowerEdge 2400. The server has offered impeccable performance since I wrote this article (shame I can’t say the same for Microsoft’s patching downtime requirements so that I could prove the uptime). The server also survived the cold winter without incident, unlike its 2600 counterpart which fell over numerous times. PowerEdge 2400, a workhorse and a graceful lady. Good job Dell.

 

Update 17th June 2007: All still seems to be good. I updated to McAfee 8.0.0i Patch 15, and a couple of days later had a complete system drop out (one of those black screen, no response to anything, just spinning the electricity meter moments). It wasn’t NMI related though, I suspect a PSU fluke in this case. Aside from this, the system continues to run flawlessly.

 

Update 31st December 2007: Everything seems to be running just fine with the server although I have concluded that it doesn’t want to run with 2.0 GB RAM in it as it seems to force the System Management Controller to reboot it around every 2 months. I have reduced it to 1.5 GB where it runs comfortably without any unexpected reboots by the SMC. I have also updated the tape firmware to A18.

Mesh 2200T (Clevo 2200T) only operates in PIO4 disk access mode

System Requirements:

  • Windows 2000
  • Windows XP
  • SIS 630 / 720 Chipset

The Problem:

Yep, it is another ones of those Clevo / Kapok issues! This one was a client’s problem from September 2004 that they were rather desperate to have resolved, and yet no one could come up with an answer.

Essentially the problem boils down to sub-standard performance from the system, which is painfully noticeable at boot time when the system has a lot of disk activity. The reason why this happens is because Windows gets confused with the BIOS/Chipset firmware disk access mode instructions, cannot safely assume it can use Direct Memory Access for data transfer and so gets locked into the processor intensive PIO (mode 4) to do anything.

As usual, this is the result of a badly written set of BIOS firmware.

The Fix:

The 2200T uses an anarchic Phoenix core (1.00.03 10/22/97) with the usual set of extensions to make the modern hardware interconnects work correctly. Before continuing you need to ensure that the Firmware build for the BIOS (this isn’t the core 1997 date) is at the highest possible level; this I believe to be 1.17 built on 28/10/2002.

I am also aware that there is a 1.00.04 core version available in some quarters. I am not able to state the impact of any of these changes on this core. I provide no support or warranty for using the BIOS files below. User them at Your Own Risk.

Version Date Changelog

07/09/2001

27/09/2001

20/11/2001

1. Recognize PIII 1.13GHz CPU (Tualatin core).
2. Support system memory up to 1GB.

30/01/2002

21/03/2002

1. Solve the error message under Windows XP event viewer.

17/05/2002

1. Support “Fn+F6” function key for some NON DDC external display device.

28/10/2002

1. Fix the CD-ROM boot failed sometimes problem for TEAC DW-28E and Samsung SN-608.

Windows Registry Fix:

If you reinstall your system at this point, you may have an XP compatible BIOS, however it will still come up in PIO mode 4. It is SIS who actually came up with a solution for this problem, which is known under Windows 2000 on the iS530/620/630/540 chipset’s.

The fix dupes Windows 2000/XP into using UDMA on the hard disk channels. It isn’t what I would call an ideal solution, however it does at least get the computer moving again.

You can download the SIS DMA fix here: dmapatch.zip (67KB)
All the program is doing is making registry changes on the system. Once you have run the .exe you need to restart the computer before it will come up using DMA.

 

Flash Utility OEM String Syntax:

The information below has been placed here for reference purposes and is not directly related to the Fix.

FP /N=OEMName[,HotlineNo] BIOSFileName
or
FP /O=OEMName[,HotlineNo] BIOSFileName

Note :

  1. The maximum length of OEMName is 16.
  2. The OEMName is case sensitive.
  3. The system manufacturer name read from DMI BIOS interface is OEMName padded with blanks(ASCII 20h) and the length of OEMName+blanks is 16.
  4. If you need space character for OEMName and HotlineNo, Replace it with a special character “^”.
  5. After flash the BIOS, turn power off and turn it on, then you can see the OEMName by using the DEBUG command as following :
    DEBUG
    -D F000:1C00
  6. Some valid examples:
    a.The hotline number for “Test Computer” company is
    123 4567 890. Then the usage for FP is as following:FP /N=Test^Computer,123^4567^890 BIOS.BIN

    b.The hotline number for “MyComputer” company is
    123-4567-890. Then the usage for FP is as following :

    FP /N=MyComputer,123-4567-890 BIOS.BIN

    c.To disable OEMName and HotlineNo : (FP v1.41 or later)

    FP /N=KAPOK,KAPOK BIOS.BIN

  7. /O option will disable the OEMName string display during BIOS POST.

Using the NetGear PS101 Mini Print Server without using NetGear Software utilities

System Requirements:

  • NetGear PS101
  • Windows 2000, XP, 2003

The Problem:

NetGear’s hardware can be small and functional, however as with (in my experience) most hardware companies, they cannot write software to save their life – and in a lot of cases when the OS has the capabilities built in, why on earth do these companies feel the need to duplicate functionality, making their hardware difficult to port up to the next version of Windows, once their application ceases working? Let us face it, it just extends their support burden and forces consumers to upgrade to a new product… oh…

The Fix:

When all is said and done, irrespective of what NetGear say, there are Print server standards, and going off to write your own protocol would just be silly. All their PS101 interface application does is provide an incredibly un intuitive, rather messy system to install printer drivers against the Print Server.

You can configure the device just as easily in a manual mode, and chances are you’ll be able to follow this principles of this guide under Linux, Unix, Windows Vista, Windows NT 7, NT 8, NT 9… well, you get the idea.

What you need to know:

Unlike the NetGear application, Windows wont go and probe your network for the PS101, you have to get hold of it yourself and configure it yourself. To do that you need a couple of pieces of information.

  • The IP address of your PS101 – Ask your router, network admin etc)
  • The Device Name of your PS101 – This is on a sticker on the base plate of the PS101, beneath the Serial Number and above the MAC address, or,once you know the IP address by accessing the web configuration in your web browser http://ip.add.re.ss/.
    Note: If you have manually changed the device name, you will NEED to get it from the web configuration program

Before you head off to begin installing your Print Server, I recommend that you ensure your device is running the latest firmware version, these can be obtained from the NetGear support site.

Once you have those two pieces of information and the correct firmware, you are ready to install the PS101. The following steps are written around Windows XP, the process and procedures are similar, if not the same under Windows 2000 and 2003. Other OS’s and Windows versions may vary. Please be aware that you cannot do this under Windows 9x without third party utilities.

  1. Open Printers and Faxes
  2. Double click Add Printer
  3. Select Local Printer and deselect the Automatically detect and install my Plug and Play printer. Click Next
  4. Highlight Create new Port and from the drop box select Standard TCP/IP Port
  5. Click Next to begin the Port wizard
  6. In the Printer Name or IP Address field type the full IP address of your Printer
    e.g. 192.168.0.200
  7. In the Port Name box type the device name of your PS101 suffixed by _P1
    For example, if your device name is PS380460, your Port Name would be PS380460_P1
  8. Click Next, highlight the Custom radio button and click Settings…
  9. The protocol should be set to RAW and the port to 9100. SNMP should remain disabled. Click OK and finish the wizard
  10. After a slight processing pause, Windows will display the Printer driver selection screen. From this point on, simply install your printer as normal.