Scanning and repairing drive 9% complete – the curse of chkdsk

This article discusses an issue of a computer getting stuck at boot with the message “Scanning and repairing drive 9% complete” with chkdsk hanging at 9%.

The hypervisor was 12 months over-due for a BIOS update. Updating the UEFI should be simple enough, however SuperMicro have a nasty habit of clearing the CMOS during BIOS updates. Why most other OEM’s are able to transfer settings and SuperMicro insists on not is one of only a few gripes that I have ever had with the firm. Yet it is a persistent one that I’ve had with them going back to 1998.

The Fault

After the successful update, I reset the BIOS to the previous values as best I could recall. Unfortunately I also enabled the firmware watchdog timer.

SuperMicro’s firmware level watchdog timer does not operate as you might expect. It requires a daemon or service to be present within the running operating system that polls the watchdog interrupt periodically. If the interrupt isn’t polled, the firmware forces a soft reboot. Supermicro do not provide a driver to do this for Windows, although their IPMI implementation can do so.

After 5 minutes from the POST the hypervisor performed an ungraceful, uninitaited reset. Following the first occurrence I assumed it was completing Windows Update. Subsequent to the second, I was looking for a problem and after the third (and a carefully placed stopwatch) I had a suspicion that I must have turned on the UEFI watchdog.

I was correct and, after disabling it, the issue was resolved.

This particular hypervisor has SSD block storage for VMs internally and large block storage for backup via an external USB 3.1 enclosure – a lot of it. Without giving it any thought, I told the system to

chkdsk <mountPoint> /F

Note that this does not include the /R switch to perform a 5 step surface scan. I told chkdsk not to dismount the volume, but to bundle all of the scans together during the required reboot to scan the C:. Doing it this way meant that I could walk away from the system. In theory this would mean that when chkdsk finished, it would rejoin the Hyper-V cluster on its own and become available to receive workloads.

… and restarted.

 

Scanning and repairing drive 9% complete

chkdsk skipped the SSD storage as it is all configured as ReFS. Under ReFS, disk checking is not required as it performs journaling activities in the background to preserve data integrity. Unfortunately, the external backup enclosure volume was NTFS. It would be scanned – and it was also quite full.

The system rebooted, and sitting at the intermedia chkdsk stage of the NT boot process. It zipped through the SSD NTFS boot volume in a few seconds, before hitting the external enclosure. Within around 5 minutes it had arrived at the magic “9% complete” threshold.

1 hour, 2 hours, 4 hours… 8 hours. That turned into 24 hours later and the message was still the same.

Windows Boot Scanning and repairing drive (F:): 9% complete

Scanning and repairing drive (F:): 9% complete.

Crashing the chkdsk

The insanity of waiting over 24 hours had to come to an end and I used IPMI to forcefully shutdown the server.

After a minute or two, we powered back on. To be met with a black screen of death from Windows after the POST.

The c:\pagefile.sys was corrupt and unreadable. Perform a system recovery or press enter to load the boot menu. On pressing enter, the single option to boot Windows Server 2019 was present, and, after a few moments. Windows self-deleted the corrupt pagefile.sys, recreated it and booted -to much relief.

I then ran

chkdsk c: /f

and rebooted, which completed within a few seconds and marked the volume as clean, with no reported anomalies.

The Windows System Event Log contained no errors (in fact as you might expect, no data) for the 24 hour period that the server had been ‘down’. The were no ‘after the event’ errors added to the System log or any of the Hardware or Disk logs either. for all intents and purposes, the system reported as fine.

 

Trying chkdsk for a second time

I decided to brave running chkdsk on the external enclosure again. Initially in read-only mode

chkdsk F:

Note the absence of the /F switch here.

It zipped through the process in a few seconds stating

Windows has scanned the file system and found no problems.
No further action is required.

Next I ran a full 3-phase scan

chkdsk F: /F

Again, it passed the scan in a few seconds without reporting any errors. So much for the last 24 hours!

 

Analysis

The corruption in the page file indicates that Windows was doing something. The disk array was certainly very active, with disk activity visible (via LED), acoustically and via data from the power monitor on the server all confirming that “something” was happening. Forcibly shutting down the system killed the page file during a write. Had been a 5-step chkdsk F: /f /r scan I could understand the length of time that it was taking.

With chkdsk /f /r – assuming a 512 byte hard drive – the system has to test 1,953,125,000 sectors for each terabyte of disk space. Depending on the drive speed, CPU speed and RAM involved it isn’t uncommon to hear of systems taking 5 hours per-terabyte to scan. This scan was not a 5-step scan, just a 3-step. A live Windows environment could scan the disk correctly in a few seconds.

Resources were not an issue in this system. Being a hypervisor, it had 128GB of RAM and was running with 2018 manufactured processors.

My suspicion is that the problem exists because of a bad interaction between the boot level USB driver and the USB enclosure. The assumption is that Windows fell into either a race condition or a deadlocked loop. During this fault, chkdsk was genuinely scanning the disk and diagnostic data was being tested in virtual memory (i.e. in the page file) but it was never able to successfully exit.

The lesson that I will take away from this experience is that unless it to avoid using a boot cycle chkdsk to perform a scan on a USB disk enclosure.