Sniffing the Parent Partitions Network Traffic in a Hyper-V Virtual Machine

This article discusses a situation whereby you want to monitor/mirror/sniff network port traffic on a Hyper-V Parent Partition inside on of its own child VM’s.

Why would you need to do this?

Under a traditional architecture you have the flexibility to tell your switch to mirror all traffic into or out of Port 6 onto Port 21. You then connect a laptop to Port 21 and promiscuously monitor the traffic coming into that port. Under a modern Converged/Software Defined Network architecture, this will not work.

In a modern Converged Fabric design, physical NICs are teamed. The parent partition on the hypervisor no-longer uses the physical NICs, but logically uses its own synthetic NICs for data transfers.

  1. Link Aggregation/LCAP/EtherChannel will split the traffic at the switch
  2. Teaming/LBFO will split the traffic at the hypervisor
  3. Data security will fire a red flag as you will be monitoring too much unrelated traffic
  4. If you combine them, you will overload the monitoring Port with aggregated traffic, causing performance issues and packet loss
  5. You may impact the performance of tenant VM’s and mission critical services

Fortunately the Parent Partitions own Virtual NICs are identical to the vNICs in any Hyper-V virtual machine. Consequently, you can use the same Hyper-V functionality on the Parent Partition as you would any VM.

 

Requirements

In order to sniff traffic on the Parent Partition you must ensure the following:

  1. The Parent Partition and the VM must be connected to the same Virtual Switch
  2. The “Microsoft NDIS Capture” extension must be enabled on the Virtual Switch (this is enabled by default)
    Enable the Microsoft NDIS Capture Extensions
  3. The monitoring VM should have 2 vNICs. The vNIC used to monitor traffic should be configured onto the same VLAN as the vNIC on the Parent Partition. The monitoring NIC should have all of its service and protocol bindings disabled to ensure that only port mirrored traffic is appearing in the WireShark logs
    Disabling service and protocol bindings on the vNIC
  4. Wireshark, Microsoft NetMonitor or another promiscuous network traffic monitor
  5. If you are in a corporate environment, ensure that you have approvals from your Information Security team. In some jurisdictions port sniffing can be considered an offence

 

Enabling Port Sniffing

You cannot enable Port Sniffing on the Parent Partition using the Hyper-V Manager GUI. Open PowerShell on/to the Parent Partition

Execute Get-NetAdapter

Identify the name of vNIC that you will sniff traffic to/from e.g. vEthernet (Management)

Taking only the value inside the parenthesis "Management" enter the following command

Get-VMNetworkAdapter -ManagementOS 'Management' | Set-VMNetworkAdapter -PortMirroring Source

Substituting WireSharkVm for the name of your monitoring VM. Execute Get-VMNetworkAdapter 'WireSharkVm'

Identify the MAC Address of the vNIC’s that you will use to receive the Port Mirror from the Hyper-V host and enable it as the recipient for the mirror

Get-VMNetworkAdapter 'WireSharkVm' | ?{$_.MacAddress -eq '001512AB34CD'} | Set-VMNetworkAdapter -PortMirroring Destination

If the Parent Partition and VM vNICs are in the same VLAN. You should now be able to sniff traffic inbound to / outbound from the Parent Partition.

 

Disabling Port Sniffing

When using Port Mirroring, remember that it consumes CPU time and network resources on the hypervisor. To disable the port mirror, repeat the above commands substituting ‘None’ as the key-word for the PortMirroring parameter e.g.

Get-VMNetworkAdapter -ManagementOS 'Management' | Set-VMNetworkAdapter -PortMirroring None
Get-VMNetworkAdapter 'WireSharkVm' | ?{$_.MacAddress -eq '001512AB34CD'} | Set-VMNetworkAdapter -PortMirroring None

Scanning and repairing drive 9% complete – the curse of chkdsk

This article discusses an issue of a computer getting stuck at boot with the message “Scanning and repairing drive 9% complete” with chkdsk hanging at 9%.

The hypervisor was 12 months over-due for a BIOS update. Updating the UEFI should be simple enough, however SuperMicro have a nasty habit of clearing the CMOS during BIOS updates. Why most other OEM’s are able to transfer settings and SuperMicro insists on not is one of only a few gripes that I have ever had with the firm. Yet it is a persistent one that I’ve had with them going back to 1998.

The Fault

After the successful update, I reset the BIOS to the previous values as best I could recall. Unfortunately I also enabled the firmware watchdog timer.

SuperMicro’s firmware level watchdog timer does not operate as you might expect. It requires a daemon or service to be present within the running operating system that polls the watchdog interrupt periodically. If the interrupt isn’t polled, the firmware forces a soft reboot. Supermicro do not provide a driver to do this for Windows, although their IPMI implementation can do so.

After 5 minutes from the POST the hypervisor performed an ungraceful, uninitaited reset. Following the first occurrence I assumed it was completing Windows Update. Subsequent to the second, I was looking for a problem and after the third (and a carefully placed stopwatch) I had a suspicion that I must have turned on the UEFI watchdog.

I was correct and, after disabling it, the issue was resolved.

This particular hypervisor has SSD block storage for VMs internally and large block storage for backup via an external USB 3.1 enclosure – a lot of it. Without giving it any thought, I told the system to

chkdsk <mountPoint> /F

Note that this does not include the /R switch to perform a 5 step surface scan. I told chkdsk not to dismount the volume, but to bundle all of the scans together during the required reboot to scan the C:. Doing it this way meant that I could walk away from the system. In theory this would mean that when chkdsk finished, it would rejoin the Hyper-V cluster on its own and become available to receive workloads.

… and restarted.

 

Scanning and repairing drive 9% complete

chkdsk skipped the SSD storage as it is all configured as ReFS. Under ReFS, disk checking is not required as it performs journaling activities in the background to preserve data integrity. Unfortunately, the external backup enclosure volume was NTFS. It would be scanned – and it was also quite full.

The system rebooted, and sitting at the intermedia chkdsk stage of the NT boot process. It zipped through the SSD NTFS boot volume in a few seconds, before hitting the external enclosure. Within around 5 minutes it had arrived at the magic “9% complete” threshold.

1 hour, 2 hours, 4 hours… 8 hours. That turned into 24 hours later and the message was still the same.

Windows Boot Scanning and repairing drive (F:): 9% complete

Scanning and repairing drive (F:): 9% complete.

Crashing the chkdsk

The insanity of waiting over 24 hours had to come to an end and I used IPMI to forcefully shutdown the server.

After a minute or two, we powered back on. To be met with a black screen of death from Windows after the POST.

The c:\pagefile.sys was corrupt and unreadable. Perform a system recovery or press enter to load the boot menu. On pressing enter, the single option to boot Windows Server 2019 was present, and, after a few moments. Windows self-deleted the corrupt pagefile.sys, recreated it and booted -to much relief.

I then ran

chkdsk c: /f

and rebooted, which completed within a few seconds and marked the volume as clean, with no reported anomalies.

The Windows System Event Log contained no errors (in fact as you might expect, no data) for the 24 hour period that the server had been ‘down’. The were no ‘after the event’ errors added to the System log or any of the Hardware or Disk logs either. for all intents and purposes, the system reported as fine.

 

Trying chkdsk for a second time

I decided to brave running chkdsk on the external enclosure again. Initially in read-only mode

chkdsk F:

Note the absence of the /F switch here.

It zipped through the process in a few seconds stating

Windows has scanned the file system and found no problems.
No further action is required.

Next I ran a full 3-phase scan

chkdsk F: /F

Again, it passed the scan in a few seconds without reporting any errors. So much for the last 24 hours!

 

Analysis

The corruption in the page file indicates that Windows was doing something. The disk array was certainly very active, with disk activity visible (via LED), acoustically and via data from the power monitor on the server all confirming that “something” was happening. Forcibly shutting down the system killed the page file during a write. Had been a 5-step chkdsk F: /f /r scan I could understand the length of time that it was taking.

With chkdsk /f /r – assuming a 512 byte hard drive – the system has to test 1,953,125,000 sectors for each terabyte of disk space. Depending on the drive speed, CPU speed and RAM involved it isn’t uncommon to hear of systems taking 5 hours per-terabyte to scan. This scan was not a 5-step scan, just a 3-step. A live Windows environment could scan the disk correctly in a few seconds.

Resources were not an issue in this system. Being a hypervisor, it had 128GB of RAM and was running with 2018 manufactured processors.

My suspicion is that the problem exists because of a bad interaction between the boot level USB driver and the USB enclosure. The assumption is that Windows fell into either a race condition or a deadlocked loop. During this fault, chkdsk was genuinely scanning the disk and diagnostic data was being tested in virtual memory (i.e. in the page file) but it was never able to successfully exit.

The lesson that I will take away from this experience is that unless it to avoid using a boot cycle chkdsk to perform a scan on a USB disk enclosure.