Redesigning the Hardware for the Virtual TV Streaming Server

This article discusses a hardware design change to the Virtual TV Streaming Server discussed in Creating a Virtual TV Streaming Server.

If you are not familiar with the previous setup. The design revolved around an array of TV tuners connected to a 7-port USB 3.0 hub. In turn, this connected to a USB 3.0 controller which was passed through Discrete Device Assignment (DDA) through to a Windows 10 Virtual Machine. This run DVBLogic TV Mosaic, the IP TV streaming software.

 

Virtual TV Streaming Server Meltdown

The solution has run extremely well. There have been no crashes from TV Mosaic, the VM or the Hypervisor. Until last week.

The system missed last Saturdays recording schedules and on Sunday afternoon, wouldn’t initiate playback. On inspection of the VM, one of the Tuners was showing as “unknown” on the TV Mosaic console. The others were all fine. Once this phantom tuner was removed from the console, everything started working again.

Initially thinking that it was related to a coincidental BIOS update on the server, it turned out that the tuner was simply dead. I RMA’d it with DVBLogic – who didn’t challenge my diagnostic or offer any resistance – but I did have to ship it Internationally at my own expense.

A week later, I came to use the system again and, once more, it was dead. A trip to the attic later and the was dead. A multimeter confirmed that the power supply had died, and I begun an RMA process with StarTech this morning.

 

Analysis

If the power supply on the StarTech was defective, it could potentially have caused the fault with the TV Butler tuner. Although this is speculative and unprovable. My main suspicion is that the problems were caused by heat. The attic roof space is uninsulated, and the UK is in the summer period. With temperature in the attic space certainly to have ranged into the 40c’s.

Unlike with the physical TV server that this setup replaced – which had fans. This setup doesn’t. PCIe TV Tuners are intrinsically designed to withstand higher thermal variances than USB ones. The StarTech and TV Butler products are quite simply basic consumer devices. It is possible that this factor led to both of their demises.

There was a power outage mid-week last week, and the StarTech itself was not sitting on the server UPS – but it was on a surge protector. It is my belief that this did not contribute to the issue.

 

Hardware Redesign

The brief for the redesign is simple

  1. Remove essential electrical components from the attic
  2. Minimise space use
  3. Minimise electrical consumption (as everything will now be powered through the UPS)
  4. Do not clutter up the backplane of the server with dongles

 

Power

To accommodate #1, #2 and #3 the USB Hub is going to be eliminated from the design. The TV Tuners will now connect directly to the DDA USB controller. In order to do this, the dual port controller will need to be replaced.

After deliberating on whether to get an externally powered or bus powered 4-port controller, I chose a , bus powered card. A risk, given my previous experience here. The DG-PCIE-04B reviewed better than a similarly priced externally powered one. The decider was that it uses a NEC chipset and not a RealTek/SiS (i.e. cheap) chip. Finally, the fact that each of the ports had its own voltage management and fuse circuit is a valuable quality safeguard.

 

Patch Panel TV

To satisfy design brief #4, the USB TV Tuners will need to be mounted away from the server. To achieve this, I am going to mount the Tuners in the patch panel.

Using a set of keystone jacks. A USB lead will run between the USB controller and the Patch panel; simply mounting to the TV Tuners held in the patch panel.
TNP USB 3.0 Keystone Jack Image

The patch panel happens to be near the ceiling, directly above the TV aerial distributor for the house. Using 4m coaxial cable, the aerial feed can route through the existing ceiling cable run and clip neatly into the TV Tuners.

The Amazon order consisted of

  • 1x
  • 1x Pack of 5
  • 4x Rankie USB 3.0 Type A Male to Male Data Cable, 3m (Server – Patch Panel)
  • 3x Ex-Pro White Coax F Plug Type – to – Male M Coax plug Connection Cable Lead – 4m (Aerial distributor – TV Tuners)

 

Installation

The installation was extremely simple.

  1. Replace the existing 2 port USB controller with the 4 port one
  2. Clip the USB 3.0 keystones into the patch panel
  3. Run cables between the USB controller and the front profile (base) of the USB keystones
  4. Passing the USB controller through to Hyper-V
    1. Shutdown the Virtual Streaming TV Server VM
    2. Get the Device Instance Path from the Details tab > Device instance path section in Device Manager e.g.
      PCI\VEN_1912&DEV_0014&SUBSYS_00000000&REV_03\4&1B96500D&0&0010
    3. Use PowerShell to dismount the USB Controller from the Hypervisor and attach it to the VM
$vmName = 'TvServer'
$pnpdevs = Get-PnpDevice -PresentOnly | Where-Object {$_.InstanceId -eq 'PCI\VEN_1912&DEV_0014&SUBSYS_00000000&REV_03\4&1B96500D&0&0010'}
$instanceId = $pnpdev.InstanceId
$locationPath = ($pnpdevs[0] | get-pnpdeviceproperty DEVPKEY_Device_LocationPaths).data[0]
Write-Host "    Instance ID: $instanceId"
Write-Host "    Location Path: $locationpath"

# Disable the Device on the Host Hypervisor
Disable-PnpDevice -InstanceId $instanceId -Confirm:$false

# Wait for the dismount to complete
Start-Sleep -s 15

# Dismount the Device from the Host Hypervisor
Dismount-VmHostAssignableDevice -locationpath $locationPath -Force

# Attach the PCIe Device to the Virtual Machine
Add-VMAssignableDevice -LocationPath $locationpath -VMName $vmName

# Note: You may need to reboot the Hypervisor hosts at this point.
# If the VM's device manager informs you that it can see the controller, but is  unable to initialise
# the controllers USB Root Hub. A reboot should fix it.
  1. Clip the DVBLogic TV Butler TV Tuners into the patch panel USB keystone jacks using the inside (top) port on the keystones
  2. Start the TV Server VM
Photograph of USB Tuners mounted in patch panel
The patch panel now has three USB ports – the left-most TV Butler is missing as the RMA replacement has not yet arrived.

Photograph of USB Tuners mounted in patch panel Photograph of USB Tuners mounted in patch panel

The Virtualised Windows 10 Streaming TV Server came back online and there hasn’t been any instability caused by the bus-powered USB controller. The TV Butler’s are warm to the touch, have plenty of air-flow and the ambient temperature can be monitored via existing sensors in the room.

With any luck, I will not need to revisit this project for quite some time!

Scanning and repairing drive 9% complete – the curse of chkdsk

This article discusses an issue of a computer getting stuck at boot with the message “Scanning and repairing drive 9% complete” with chkdsk hanging at 9%.

The hypervisor was 12 months over-due for a BIOS update. Updating the UEFI should be simple enough, however SuperMicro have a nasty habit of clearing the CMOS during BIOS updates. Why most other OEM’s are able to transfer settings and SuperMicro insists on not is one of only a few gripes that I have ever had with the firm. Yet it is a persistent one that I’ve had with them going back to 1998.

The Fault

After the successful update, I reset the BIOS to the previous values as best I could recall. Unfortunately I also enabled the firmware watchdog timer.

SuperMicro’s firmware level watchdog timer does not operate as you might expect. It requires a daemon or service to be present within the running operating system that polls the watchdog interrupt periodically. If the interrupt isn’t polled, the firmware forces a soft reboot. Supermicro do not provide a driver to do this for Windows, although their IPMI implementation can do so.

After 5 minutes from the POST the hypervisor performed an ungraceful, uninitaited reset. Following the first occurrence I assumed it was completing Windows Update. Subsequent to the second, I was looking for a problem and after the third (and a carefully placed stopwatch) I had a suspicion that I must have turned on the UEFI watchdog.

I was correct and, after disabling it, the issue was resolved.

This particular hypervisor has SSD block storage for VMs internally and large block storage for backup via an external USB 3.1 enclosure – a lot of it. Without giving it any thought, I told the system to

chkdsk <mountPoint> /F

Note that this does not include the /R switch to perform a 5 step surface scan. I told chkdsk not to dismount the volume, but to bundle all of the scans together during the required reboot to scan the C:. Doing it this way meant that I could walk away from the system. In theory this would mean that when chkdsk finished, it would rejoin the Hyper-V cluster on its own and become available to receive workloads.

… and restarted.

 

Scanning and repairing drive 9% complete

chkdsk skipped the SSD storage as it is all configured as ReFS. Under ReFS, disk checking is not required as it performs journaling activities in the background to preserve data integrity. Unfortunately, the external backup enclosure volume was NTFS. It would be scanned – and it was also quite full.

The system rebooted, and sitting at the intermedia chkdsk stage of the NT boot process. It zipped through the SSD NTFS boot volume in a few seconds, before hitting the external enclosure. Within around 5 minutes it had arrived at the magic “9% complete” threshold.

1 hour, 2 hours, 4 hours… 8 hours. That turned into 24 hours later and the message was still the same.

Windows Boot Scanning and repairing drive (F:): 9% complete

Scanning and repairing drive (F:): 9% complete.

Crashing the chkdsk

The insanity of waiting over 24 hours had to come to an end and I used IPMI to forcefully shutdown the server.

After a minute or two, we powered back on. To be met with a black screen of death from Windows after the POST.

The c:\pagefile.sys was corrupt and unreadable. Perform a system recovery or press enter to load the boot menu. On pressing enter, the single option to boot Windows Server 2019 was present, and, after a few moments. Windows self-deleted the corrupt pagefile.sys, recreated it and booted -to much relief.

I then ran

chkdsk c: /f

and rebooted, which completed within a few seconds and marked the volume as clean, with no reported anomalies.

The Windows System Event Log contained no errors (in fact as you might expect, no data) for the 24 hour period that the server had been ‘down’. The were no ‘after the event’ errors added to the System log or any of the Hardware or Disk logs either. for all intents and purposes, the system reported as fine.

 

Trying chkdsk for a second time

I decided to brave running chkdsk on the external enclosure again. Initially in read-only mode

chkdsk F:

Note the absence of the /F switch here.

It zipped through the process in a few seconds stating

Windows has scanned the file system and found no problems.
No further action is required.

Next I ran a full 3-phase scan

chkdsk F: /F

Again, it passed the scan in a few seconds without reporting any errors. So much for the last 24 hours!

 

Analysis

The corruption in the page file indicates that Windows was doing something. The disk array was certainly very active, with disk activity visible (via LED), acoustically and via data from the power monitor on the server all confirming that “something” was happening. Forcibly shutting down the system killed the page file during a write. Had been a 5-step chkdsk F: /f /r scan I could understand the length of time that it was taking.

With chkdsk /f /r – assuming a 512 byte hard drive – the system has to test 1,953,125,000 sectors for each terabyte of disk space. Depending on the drive speed, CPU speed and RAM involved it isn’t uncommon to hear of systems taking 5 hours per-terabyte to scan. This scan was not a 5-step scan, just a 3-step. A live Windows environment could scan the disk correctly in a few seconds.

Resources were not an issue in this system. Being a hypervisor, it had 128GB of RAM and was running with 2018 manufactured processors.

My suspicion is that the problem exists because of a bad interaction between the boot level USB driver and the USB enclosure. The assumption is that Windows fell into either a race condition or a deadlocked loop. During this fault, chkdsk was genuinely scanning the disk and diagnostic data was being tested in virtual memory (i.e. in the page file) but it was never able to successfully exit.

The lesson that I will take away from this experience is that unless it to avoid using a boot cycle chkdsk to perform a scan on a USB disk enclosure.

Performance impact of 512byte vs 4K sector sizes

When you are designing your storage subsystem. On modern hardware, you will often be asked to choose between formatting using 512 byte or 4K (4096 byte) sectors. This article discusses whether there is any statistically observable performance difference between the two in a 512 vs. 4K performance test.

NB: Do not get confused between the EXT4 INODE size and the LUN sector size. The INODE size places a mathematical cap on the number of files that a file system can store, and by consequence how large the volume can be. The sector size relates to how the file system interacts with the physical underlying hardware.

QNAP Sector Size selection
Sector Size selection on QNAP QTS 4.3.6

Method

  • A QNAP TS-1277XU-RP with 8x WD Red Pro 7200 RPM WD6003FFBX-68MU3N0 drives running firmware 83.00A83 were installed with 8 drives in bays 5 – 12
  • Storage shelf firmware was updated to QTS version 4.3.6.0923, providing the latest platform enhancements
  • A Storage Pool comprising all 8 disks in RAID 6 was configured, ensuring redundancy
  • A 4GB volume was added allowing QNAP app installation so that the systme could finish installing
  • The disk shelf was rebooted after it had completed its own setup tasks
  • RAID sync was allowed to fully complete over the next 12 hours
  • Two identical 4096 GB iSCSI targets were created with identical configurations apart from one having 512 byte and the other 4k sector sizes
  • SSD caching was disabled on the storage shelf
  • 2x10Gbps Ethernet, dedicated iSCSI connections were made available through two Dell PowerConnect SAN switches. Each NIC on its own VLAN. 9k jumbo frames were enabled accross the fabric
  • A Windows Server 2016 hypervisor was connected to the iSCSI target and mounted the storage volume. iSCSI MPIO was enabled in Round Robin mode. Representing a typical hypervisor configuration
  • The two storage LUNs were formatted with 64K NTFS partitions (recommended for dedicated VHDX volumes)
  • A Windows 10 VM was migrated onto each of the targets and the test performed using Anvil’s Storage Utilities 1.1.0.20140101. The VM had no live network connections. The Super Fetch and Windows Update services were disabled, preventing undesirable disk I/O. The VM was not rebooted between tests, had no other running tasks and had been idling for 6 hours prior to the test
  • No other tasks, load or data were present on the storage array

 

512 vs. 4K Performace Results

The results of the two tests are shown below.

Anvil Storage Utilities Screenshot with 512bytes results
Anvil Storage Utilities Screenshot with 512bytes results

"Anvil

IOPS
512 byte 4K 4K Diff 4K Diff % +/-
Read Seq 4MB 417.45 403.3 -14.15 -3.51
4K 3001.56 3164.56 163 5.15
4K QD4 6021.45 6006.7 -14.75 -0.25
4K QD16 24228.16 24062.61 -165.55 -0.69
32K 2742.39 2807.47 65.08 2.32
128K 2628.86 2620.8 -8.06 -0.31
Write Seq 4MB 233.2 230.79 -2.41 -1.04
4K 2090.79 2165.45 74.66 3.45
4K QD4 5976.18 5983.65 7.47 0.12
4K QD16 8254.84 7874.67 -380.17 -4.83

 

Analysis and Recommendations

The results show that there is little difference between the two. Repeating the tests multiple times showed that the figures for both the 512 byte and 4K LUNs are within the margin of error of each other. A bias towards 512 byte was consistently present, but was not statistically significant.

The drives in the test disk array are 512e drives. 512e is an industry transition technology between pure 512 byte and pure 4K drives. 512e drives use physical 4K sectors on the platter, but that the firmware uses 512 byte logic. A firmware emulation layer converts between the two. This creates a performance penalty during write operations due to the computation and delay of the re-mapping operation. Neither sector size will prevent this from occurring.

My recommendations are

  • If all of your drives are legacy 512 byte drives, only use 512
  • Should you intend to mount the LUN with an operating system that does not support 4K sectors. Only use 512
  • In situations where you have 512e drives, you can use either. Unless you intend to clone the LUN onto 4K drives in the future, stick with 512 for maximum compatibility
  • Never create an array that mixes 512 and 4K disks. Ensure that you create storage pools and volumes accordingly
  • Where all of your drives are 4K, only use 4K

 

WorkFolders Folder shows Sync Error even though its contents are fully synchronised

WorkFolders allows you to perform policy based file HTTPS synchronisation between corporate servers and BYOD devices or teleworker devices. This article discusses a workaround to a problem where an anonymous sync error appears on a directory despite all of its contents synchronising successfully.

Outline of the Problem

Assume the following directory structure

C:\Users\CompanyUser\WorkFolders\Documents\WorkFolders Test

The following files/folders are present within WorkFolders Test:

WorkFolders Test\Problem Folder
WorkFolders Test\Problem Folder\File in Problem Folder.docx
WorkFolders Test\This file is OK.txt

Windows Explorer will display a Green circle with a tick for file/folder object that is synchronised and a Red circle with a cross for a faulted file/folder. After synchronising, the sync results will display as follows:

WorkFolders Test\Problem Folder [CROSS]
WorkFolders Test\Problem Folder\File in Problem Folder.docx [TICK]
WorkFolders Test\This file is OK.txt [TICK]

No errors are displayed in the Control Panel WorkFolders applet. There are no related errors in the clients WorkFolders Management/Operational Event Viewer logs. No relevant errors are present in the file servers SyncShare Operational/Reporting logs.

WorkFolders Error Screenshot - outer folder
The parent folder shows that its sub-folder has a sync error
WorkFolders Error Screenshot - inner folder
The contents of the error’d folder are however correctly synchronised.

Analysis

Although WorkFolders is indicating that the issue is being caused by the “Problem Folder” directory. The issue is being caused by the “File in Problem Folder.docx”.

The following symptoms will be true:

  1. Renaming “Problem Folder” will not fix the issue
  2. Altering the filename of “File in Problem Folder” will not fix the issue
  3. Changing the “File in Problem Folder.docx” file extension (e.g. to .txt) will not fix the issue
  4. Opening and saving the “File in Problem Folder.docx” will not solve the issue
  5. Moving “File in Problem Folder.docx” out of “Problem Folder” will clear the sync error, but the error will immediately migrate to the new location
  6. Rebooting the client computer will not help
  7. Restarting the server will not help
  8. The file does not have any connected temp files or lock files associated with it in the client file system

 

There is nothing wrong with the file itself. It is not corrupt, pay-loaded with a virus or violating any policy. It is my (unproven) belief that the record for the file in the WorkFolders synchronisation database is corrupt. Performing any of the above steps will not alter the record in the WorkFolders client database, thus the problem cannot be ameliorated.

 

Fixing the problem

One you have identified the problem file(s). You can use one of the methods below to correct the error.

Save the file as a completely new file

  1. Open the file in its associated editor (e.g. Microsoft Word for docx files)
  2. File > Save as…
  3. Save the file in its original location, but with a different file name (do not overwrite the original)
  4. Delete the original file
  5. Allow WorkFolders to re-sync
  6. Rename the new file as required

This approach is easy for an end-user to perform, but can be very time consuming if you are troubleshooting a large number of such issues. It requires you to know which file is causing the problem in the first place.

 

Compression

  1. Compress the file using Windows Compressed folders (Right click > Send to… > Compressed (zipped) file)
  2. Delete the original file
  3. Wait for the folder to re-sync and clear the error
  4. Extract the original file from the zip back into the desired location

This method will create a new record in the WorkFolders synchronisation database and the error will not reappear. You can use the technique to fix an entire folder structure without having to first identify the problem file. It is also easy for an end-user to perform.

 

Move the file

  1. Move the file outside of the WorkFolders monitored file system. For example, move the file to C:\ or into the Recycle Bin
  2. Allow the original folder to re-sync and clear the error
  3. Return the original file to its original location and allow it to re-sync

Again, this method will create a new synchronisation record. In a managed environment this may be harder for an end-user to perform due to permissions. It is however easier for an administrator to perform as you can cut/paste the entire file structure out and then back into the WorkFolders sync root.

If you use this method, remember to move it to a location within the same drive letter. If you do, the move will preserve permissions, file dates and will not physically copy the underlying data to the new location (just update the MFT).