Performance impact of 512byte vs 4K sector sizes

When you are designing your storage subsystem. On modern hardware, you will often be asked to choose between formatting using 512 byte or 4K (4096 byte) sectors. This article discusses whether there is any statistically observable performance difference between the two in a 512 vs. 4K performance test.

NB: Do not get confused between the EXT4 INODE size and the LUN sector size. The INODE size places a mathematical cap on the number of files that a file system can store, and by consequence how large the volume can be. The sector size relates to how the file system interacts with the physical underlying hardware.

QNAP Sector Size selection
Sector Size selection on QNAP QTS 4.3.6

Method

  • A QNAP TS-1277XU-RP with 8x WD Red Pro 7200 RPM WD6003FFBX-68MU3N0 drives running firmware 83.00A83 were installed with 8 drives in bays 5 – 12
  • Storage shelf firmware was updated to QTS version 4.3.6.0923, providing the latest platform enhancements
  • A Storage Pool comprising all 8 disks in RAID 6 was configured, ensuring redundancy
  • A 4GB volume was added allowing QNAP app installation so that the systme could finish installing
  • The disk shelf was rebooted after it had completed its own setup tasks
  • RAID sync was allowed to fully complete over the next 12 hours
  • Two identical 4096 GB iSCSI targets were created with identical configurations apart from one having 512 byte and the other 4k sector sizes
  • SSD caching was disabled on the storage shelf
  • 2x10Gbps Ethernet, dedicated iSCSI connections were made available through two Dell PowerConnect SAN switches. Each NIC on its own VLAN. 9k jumbo frames were enabled accross the fabric
  • A Windows Server 2016 hypervisor was connected to the iSCSI target and mounted the storage volume. iSCSI MPIO was enabled in Round Robin mode. Representing a typical hypervisor configuration
  • The two storage LUNs were formatted with 64K NTFS partitions (recommended for dedicated VHDX volumes)
  • A Windows 10 VM was migrated onto each of the targets and the test performed using Anvil’s Storage Utilities 1.1.0.20140101. The VM had no live network connections. The Super Fetch and Windows Update services were disabled, preventing undesirable disk I/O. The VM was not rebooted between tests, had no other running tasks and had been idling for 6 hours prior to the test
  • No other tasks, load or data were present on the storage array

 

512 vs. 4K Performace Results

The results of the two tests are shown below.

Anvil Storage Utilities Screenshot with 512bytes results
Anvil Storage Utilities Screenshot with 512bytes results

"Anvil

IOPS
512 byte 4K 4K Diff 4K Diff % +/-
Read Seq 4MB 417.45 403.3 -14.15 -3.51
4K 3001.56 3164.56 163 5.15
4K QD4 6021.45 6006.7 -14.75 -0.25
4K QD16 24228.16 24062.61 -165.55 -0.69
32K 2742.39 2807.47 65.08 2.32
128K 2628.86 2620.8 -8.06 -0.31
Write Seq 4MB 233.2 230.79 -2.41 -1.04
4K 2090.79 2165.45 74.66 3.45
4K QD4 5976.18 5983.65 7.47 0.12
4K QD16 8254.84 7874.67 -380.17 -4.83

 

Analysis and Recommendations

The results show that there is little difference between the two. Repeating the tests multiple times showed that the figures for both the 512 byte and 4K LUNs are within the margin of error of each other. A bias towards 512 byte was consistently present, but was not statistically significant.

The drives in the test disk array are 512e drives. 512e is an industry transition technology between pure 512 byte and pure 4K drives. 512e drives use physical 4K sectors on the platter, but that the firmware uses 512 byte logic. A firmware emulation layer converts between the two. This creates a performance penalty during write operations due to the computation and delay of the re-mapping operation. Neither sector size will prevent this from occurring.

My recommendations are

  • If all of your drives are legacy 512 byte drives, only use 512
  • Should you intend to mount the LUN with an operating system that does not support 4K sectors. Only use 512
  • In situations where you have 512e drives, you can use either. Unless you intend to clone the LUN onto 4K drives in the future, stick with 512 for maximum compatibility
  • Never create an array that mixes 512 and 4K disks. Ensure that you create storage pools and volumes accordingly
  • Where all of your drives are 4K, only use 4K

 

iSCSI MPIO Recommendations & Best Practice on Windows Server

System Requirements:

  • Windows Server 2008 Storage Server
  • Windows Server 2008 R2 Storage Server
  • Windows Server 2012, 2012 R2, 2016

The Problem:

I needed to outline some of the general thinking relating to exactly how a practitioner should logically and physically understand MPIO, however most of the discourse on the subject skips a fair amount of the obvious questions that people starting out with the technology may be asking (or trying to answer). I therefore present some thinking on the subject of understanding MPIO optimisation and best practice for iSCSI.

The information presented in this document is intended for those who are new to the concept of iSCSI and MPIO and is not intended to be product specific.

More Info

Multi-Path Input/Output or MPIO is a server technology that usually sits on the storage side of load balancing, failover and aggregation technologies. If you are getting into SAS, iSCSI or Enterprise RAID solutions where it is most commonly used (encountered), then you this may (or may not) help you with understanding what MPIO is any why it (possibly) isn’t what you think it is!

The document is written from the perspective of an iSCSI user where it can be conceptually a little harder for new users to understand the best way to approach MPIO.

Logically understanding what MPIO is all about

So you have 2x1Gbps ports in a MPIO team, that means you’ve got a 2Gbps link right? Wrong. That isn’t what is going on with MPIO.

MPIO (and in fact pretty much the majority of balancing and aggregation technologies) doesn’t double the speed, but it does roughly double the bandwidth available to the system. Confused? Think of it like this:

You own a car. The car has a top speed of 70mph and not one mph more. You get on a one way, single track road in a country where there are no speed limits. You are now happily driving along at 70mph. Some bright spark at your local council decides that you should be able to drive at 140mph, so they cut down the trees on one side of the road and add a second one-way carriage way, going in the same direction as the first.

Can your car now drive at 140mph because of the new lane? No. The public official is wrong. Your engine can only offer you 70mph. The extra lane doesn’t help you, but it does help the guy in the car next to you also driving at 70mph arrive at the other end of the road at the same time. It also means that when you encounter a tractor ambling along in your lane, you have somewhere else to go without slowing down.

This is fundamentally what MPIO is doing. So why isn’t it a 2Gbps link? Basically, because networking technology is a serial communications medium and by adding a second lane and calling it a faster way to get data to the end of it, you get into the different world of parallel communications. Under parallel communications you have to split (fragment) information into smaller pieces and push it down each one of the wires to the destination. This in turn infers the need to have more complicated buffer/caching designs to store information as part of a strategy that is designed to be able to cope with each section of the data arriving at a different time, it arriving all at the same time, in a different order than intended or of course, it not arriving at all. Something known as clock skew.

To fix this, you need to introduce overhead either to synchronise delivery to be reliable (thus slowing it down and reducing error tolerance) or adding overheard mechanisms designed to deconstruct, sequence, wait for or re-request missing or corrupt data sections and track timing – alls something that you really don’t want in an iSCSI or SAS environment where response time (latency) is king. Consequently, there is a diminishing return on how much of this parallel working you can derive a benefit from in any system, including an MPIO system. iSCSI MPIO, if correctly configured, will offer something at around about the boundary between worth-while and not bothering in the first place. Yet it is important to understand that it will not be a 100% increase in performance, neither will likely be a 50% increase, but more realistically something around the 30-40% mark.

Performance is only one of the intended design considerations for MPIO, and in that it is not the primary consideration. The primary consideration is for fault tolerance and reliability.

In a correctly designed iSCSI system, independent NICs connected to more than one switch and usually to more than one controller on the storage side and more than one server on the host side. If one of these fails, in a correctly implemented system, your production service probably won’t even notice. You can even be as bold to perform live switch re-wiring on iSCSI systems without impacting the client services involved – although it should be stressed that this is for bragging rights and in practice should not be attempted.

To summarise, MPIO allows you to get twice (+) as much data down to the end of the link, but you cannot get it there any faster. In general, if you can avoid using fragmented streams, you will reap the maximum benefit. The obvious approach here is that each “lane” should be using unrelated data: instead of carving up a single video file and pushing little bits of it down each lane one bit at a time (MPIO can do this), one lane is used for the video and the second lane is used for literally anything and everything else. This is a simplification of what MPIO generally does, however in practice is offers a good way to get your head around it.

Techniques

So how does MPIO carve up the traffic?

There are broadly speaking 4 different paradigms for carving up MPIO traffic

Failover/Redundant In this mode, one link is active, while the other is passive i.e. up, but not doing anything. If the first link fails, the second path takes over and all existing traffic streams continue to receive the same bandwidth (% of the total available pie) on the same terms as before. This would give us a completely separate road that can only be used in emergencies. It may not be as fast or robust, or it may be identically spec’d and just as capable. A failover design may or may not return traffic to the first channel once it becomes available once again.
Round Robin This mode alternates traffic between channel 1 and channel 2, then goes back to channel 1, channel 2 and so on. Both links are active, both receive traffic in a slight skew as the data is de-queued at the sender. This offers the 2 lane analogy used above with each 70mph car getting to the end at roughly the same time.
Least Queue Depth This puts the traffic into the channel that has the least amount on it (or to be more accurately about to go onto it). If one channel is busier than the other (e.g. the large video file) then it will put other traffic down the second channel, allowing the video to transfer without needing to slow down to allow new traffic to join, delaying its delivery. There are many different algorithms that exist on how this is achieved, including varieties that use hashing to offer clients consistent paths based upon Layer 2 or Layer 3 addresses.
Path Weighting Weighted paths and least blocking methods assess the state/capabilities of the channels. This is more useful if there are lots of hops between source and destination, multiple routes between a destination or different channels have different capabilities. For example, if you have iSCSI running through a routed network, then there could be multiple ways for it to get there. One route may go through 5 routers and another 18 routers. Generally, the 5 router path might be preferable, provided the lower hop route genuinely gets the data there faster. Equally the weighting could be based upon the speed of the path through to the recipient or finally, if channel 1 is 10Gbps and channel 2 is only 1Gbps, then you might prefer the 10Gbps path to be used with a higher preference. Usually, a lower weighting number means a higher preference. This would be the equivalent of a 70mph road with a backup road with a max speed of 50mph. You know that it will get you to the destination, but you can guarantee that if you have to use it, it will take longer.

So, more lanes equal more stuff then?

Sounds simple doesn’t it? Just keep throwing lanes into the road and then everyone gets to travel smoothly at 70mph.

In principle, it is a nice idea, but in practice it doesn’t actually work in most iSCSI implementations.

For starters, server grade network card (which you should be using for MPIO, and not client adapters) are expensive and server backplanes can only accept a finite amount of them. Server NIC’s also consume power and power consumes money! Keep that in mind if you do decide to throw extra ports at an iSCSI solution.

The reality is that if you have an MPIO solution that will allow you to experiment with more than 2 NIC adapters in a MPIO group, you will likely see the performance gain rapidly tail off. In turn it will actually wind up presenting you with steadily worsening performance, not the increase that you are expecting.

Attempting to MPIO iSCSI traffic across 4x 1Gbps NIC’s actually offers worse read and write speeds for a virtual machine than 2x 1Gbps under a Hyper-V environment (see tests E and F below). The system starts to waste so much time trying to break apart and put back together each lane’s worth of traffic that it just doesn’t help the hypervisor.

Where a 4 NIC configuration is beneficial is actually in providing you with a “RAID 6” MPIO solution. Here you can have 2 active and 2 passive adapters – remember in an idealised scenario they could be 2x10Gbe and 2x1Gbe with a hard-coded preference for the 10Gbe and a method of failing traffic back to the 10Gbe. Just be aware that you can only use the 10Gbe set OR the 1Gbe set at the same time, not one port from each. The exception to this rule is for hashing based channel assignment as these offer more paths to “permanently” assign data into, without the overhead of path swapping or de-fragmentation of traffic.

Some DSM’s (effectively a OEM specific MPIO driver under Windows, such as Dell Host Integration Tools [HIT] or NetApp Host Utilities) logically limit a MPIO to two active NIC’s if the storage controller is only exposing 2 usable NICs back to the HIT instance. Dell EqualLogic Host Integration Tools (the EqualLogic DSM) will grab the first two paths it finds and shutdown any others into a passive state, no matter how hard you try to start them up.

What should a MPIO network “look” like?

Ultimately this is down to what you want to get out of the MPIO solution and within the bounds of what your hardware vendor will support.

There are effectively three schools of thought here (I won’t comment on which is right because as you’ll see, it isn’t that simple)

MPIO is about Meshing

If you see MPIO is a mesh then 2 NIC’s in a server connecting to 2 NIC’s in a storage appliance equals a mesh where each NIC has a path to the other. This is more aligned with how you probably already think about Ethernet networks.

MPIO is about Pathing

If you see MPIO in this model it is simple about more than one line being drawn between two different end points, with no line crossing or adding any complexity, complication and confusion. This is more aligned with how you likely currently think about SAS, Fibre Channel and hard drive wiring.

MPIO is about Redundancy

This is the purest of the three views. It sees the complexity and overheads associated with MPIO as being a problem – there will always be some sort of increase in latency, a drop in some aspect performance by trying to squeeze more bandwidth out of MPIO. This view attempts to keep the design simple, run everything at an unimpeded wire speed but maintain the failover functionality afforded by MPIO.

The three schools of thought are outlined in the diagram below.

Why not Meshing?

When you start out with MPIO, you may be tempted towards implementing option 1. After all, your Server NICs (circles) are likely connected to a switch, as is your storage array (squares). The switch allows you to design to this topology and if you allow the MPIO system to have knowledge over all possible permeations of connectivity, the system will highly redundant, making it very robust.

Yes and no! Yes, it is very robust, but at this point in your implementation, how do you know which path traffic is taking? How do you know that it is optimised? What is stopping Server NIC1 and Server NIC2 from both talking to storage NIC1 at the same time? If they do that, then they have to share 1Gbps of bandwidth between them while Storage NIC2 is left idle. Suddenly all of your services will have intermediate bursts of speed and infuriating drops in performance. The more server NICs that you add, the faster the decrease in performance will be. With 4 Server NICs, there is nothing to stop the MPIO load balancer from intermittently pushing the data from all 4 Server NICs towards a single Storage NIC.

In a Round Robin setup, in a full Mesh design (as shown in #1) it will likely order the RR protocol in the order that you gave the system access to the paths. Given the following IP Addresses

Server: 192.168.0.1, 192.168.0.2
Storage: 192.168.0.11, 192.168.0.12

The RR table could like this

  1. 192.168.0.1 -> 192.168.0.11
  2. 192.168.0.2 -> 192.168.0.11
  3. 192.168.0.1 -> 192.168.0.12
  4. 192.168.0.2 -> 192.168.0.12

Or it could like like this

  1. 192.168.0.1 -> 192.168.0.11
  2. 192.168.0.1 -> 192.168.0.12
  3. 192.168.0.2 -> 192.168.0.11
  4. 192.168.0.2 -> 192.168.0.12

In both examples you either have two different sets of traffic being sent from the same Server NIC concurrently or received by the same Storage NIC concurrently. This is going to undermine performance, not improve it (this is outlined in Mbps terms in the tests shown later in this document).

In a failure situation, the performance issue is exacerbated

  • If #3 fails, then nothing changes in performance or bandwidth.
  • If #2 fails then the total bandwidth available to the system halves and all services contend using the first link.
  • If #1 fails then as with #2, all services suffer with contended bandwidth, however the system also has the overhead of MPIO to further reduce performance.

What benefit is there to MPIO operating in scenario #1? In this failed state, should one of the Storage NICs also fail, the system will continue to operate. In #2 if the working Storage NIC fails, the entire system will fail despite the fact that the Storage NIC on the second path is actually working. It is up to you and your design as to whether you think that the performance hit that you will experience is worth this extra safeguard? In a highly secure system, mission critical or safety system it may be worth the extra overhead.

There are however some middleware layers that can manage this for you. Dell Host Integration Tools (HIT), does, for example, attempt to undertake some management of these types of situations, optimising the mesh by putting the links that will cause overhead into a failover only state, while maintaining the optimal number of active mesh links. In my experience though, the HIT solution is not able to perfectly manage the optimal risk. It does not provide any consideration over redundant NIC controllers. For example, if you have 2 physical Dual Port NICs in your Server with the intention of one port from each NIC making up the active “pair”, Dell HIT is not able to detect or be programmed to ensure that the active paths are prioritised around ensuring that the correct controller is being used. In my experience, it will tend to bunch them together onto the same physical NIC controller, leaving the second controller idle.

Fixing this problem requires an additional layer of complex, expensive and usually proprietary middleware logic, further impacting performance and increasing cost. Therefore, industry best practice is to avoid thinking of iSCSI MPIO as being a Full or even a Partial Mesh, but instead think of it as offering independent channels akin to those shown in #2. It is for this reason that virtually all iSCSI MPIO vendors insist that each Server -> Storage NIC pair exist on its own logical IP subnet as this completely negates the possibility of interweaving the MPIO paths while also ensuring that any subnet-local issue (such as a broadcast or unicast storm) is only likely to take down one of the subnets, not both.

iSCSI as part of a Virtual Network Adapter, Converged Fabric LBFO Team

Since the release of Windows Server 2012, Microsoft have allowed to be hinted at the idea of using iSCSI through Converged Fabric* Load balancing Failover (LBFO) teams — as long as the iSCSI NICs are Virtual and they connect through a Hyper-V VM Switch which itself backs onto a Windows Server LBFO team. Even the venerable Aidan Finn has hinted at it. I have, however, never seen a discussion of it being attempted online, neither have I ever seen it benchmarked.

To be clear over what we are talking about when I say a Virtualised, Converged Fabric, LBFO Team:

  1. 4x 1Gbps Ethernet physical adapters
  2. Grouped into a Windows Server 2016 LBFO Team, appearing to Windows as a single logical network adapter called “ConvergedNIC”
  3. “ConvergedNIC” is connected to an External Virtual Switch called “ConvergedSwitch”
  4. A Virtual Machine Network Adapter is created on the Hypervisor’s Parent Partition (ManagementOS) and this is assigned to the correct VLAN, given an IP address and hooked up to the iSCSI Target
  5. 4 physical NICs, no MPIO, 1 logical NIC

So, does it work?

Yes! It does work and it appears to be stable and even usable; but with some sacrifice in performance (keep reading for some benchmark numbers as “test A” below). I have however had test VMs running under this design for nearly a year without any perceivable issues in either VM or hypervisor stability.

* If you are not familiar with the Concept of a Converged Fabric: A Converged Fabric is a data centre architecture model in which the concept of 1 NIC = 1 Network/Subnet/VLAN/Traffic Type is abandoned. Instead, NICs are usually pooled together into Teams with multiple traffic types, Networks, Subnets and VLANs being allowed to use any of the available bandwidth within the team. Quality of Service (QoS) algorithms are used to ensure that priority traffic types are defined (such as iSCSI in this example), ensuring that the iSCSI system is never starved for bandwidth by someone performing a large file transfer across the team. A Converged Fabric architecture is considered to be more efficient, lower cost and offer better failover reliability than traditional methods in which entire 1GbE or 10GbE NICs could be left idle, waiting for traffic that while high bandwidth, may be infrequent. A Converged Fabric architecture allows other users/systems to benefit from the available bandwidth when not needed by its primary application. It can also offer the primary application additional bandwidth in some situations.

If you have an 8 NIC Hypervisor setup with 2 physical iSCSI NICs, 2 physical production network NICs, 1 physical heartbeat NIC, 1 physical live migration NIC, 1 management network NIC and 1 out of bounds management NIC, then you are paying to power but to not derive much of any benefit from NICs 4-8 due to how infrequently they are used. If this sounds familiar to you, then you should consider migrating to a Converged Fabric design.

Quantifying Best Practice

So far, this article has discussed MPIO, meshing, pathing and redundancy as well as a quick detour into using converged fabric LBFO for iSCSI connections. So let’s look at some numbers that underpin these approaches.

Tests were undertaken using the following hardware configuration:

  • Dell EqualLogic PS4110x running firmware 9.1.1 R436216, with 2 active 1GbE NIC’s on a single controller
  • Dell PowerEdge P630 with 8x1GbE adapters (4x Broadcom NetXtreme and 4x Intel I350 adapters) with 9K Jumbo Frames correctly enabled
  • Windows Server 2016
  • Switching on Cat6a cabling via 2x Cisco Catalyst 2960-48’s
  • The 64K block, GPT formatted, 3TB target LUN was setup as a CSV and the nodes were in a Cluster with a second identical node idling as a second cluster member (CSV-FS has a natural performance hit compared to NTFS)

7 tests were performed as outlined in the following table

Physical Paths
Active NICs
Test Description Active Passive Intel Broadcom LBFO Team Dell HIT MPIO Mode
A
4 NIC in LBFO Team, No MPIO
4
0
0
4
Y
N
n/a
B
4 NICs, fully meshed, RR
8
0
2
2
N
N
Round Robin
C
2 NICs, no mesh (point to point)
2
2
2
0
N
N
Round Robin
D
1 NIC only (control test)
1
1
1
0
N
N
n/a
E
4 NICs, fully meshed, LQD
8
0
2
2
N
N
Least Queue Depth
F
4 NICs, partial mesh, RR
4
0
2
2
N
N
Round Robin
G
2 NICs, no mesh (point to point) with EqualLogic Host Integration Tools
1
1
2
0
N
Y
Least Queue Depth

If you are more visual, the following diagram summarises the above in a graphical format

The Results

The following table summarises the read/write performance of each test on Sequential 4MB reads as outlined through “Anvil’s Storage Utilities”, version 1.1.0, build 1st January 2014. all tests were performed on the same Windows 10 Enterprise VM without rebooting in between each test and without performing any other activities on the VM disk.

The results below are ordered by test, from the test offering the best performance to the test offering the worst performance, using the Read MB/s column as the sort index.

Sequential 4MB (Read)
Sequential 4MB (Write)
Test
Response (ms)
MB read
IOPS
MB/s
Control Deviance (%)
Response (ms)
MB written
IOPS
MB/s
Control Deviance (%)
C
30.4791
1052
32.81
131.24
32.17
21.7266
1024
46.03
184.11
70.25
F
39.801
804
25.13
100.50
1.21
468.9896
772
2.13
8.53
-92.11
D
40.2814
796
24.83
99.30
0
36.9883
1024
27.04
108.14
0
A
51.3782
624
19.46
77.85
-21.60
89.5977
1024
11.16
44.64
-58.72
G
60.7197
528
16.47
65.88
-33.66
23.8047
1024
42.01
168.03
55.38
E
273.9667
120
3.65
14.60
-85.30
1010.7556
360
0.99
3.96
-96.34
B
404.65
80
2.47
9.89
-90.04
964.766
376
1.04
4.15
-96.16

Response (ms) = Lower is better
MB read/written = Higher is better
IOPS = Higher is better

Control Deviance (%) = the positive or negative impact in MB/s performance compared to the single NIC, no MPIO control test (test D).

Test A | Converged Fabric LBFO

The Microsoft dream of virtualising everything does hold up – at not being completely terrible. Sitting in the middle of the table, using a fully converged fabric, virtualised setup across 4 NICs resulted in a 22% reduction in read speed compared to a single NIC and a 59% reduction in write speed.

There may be some improvements to made by creating multiple Virtual iSCSI interfaces connected to the virtual switch, however these were not tried. Based upon the current view of the technology, while it works and offers a data centre design simplification, that simplification factor is not worth the performance sacrifice.

Test B | Round Robin, Full Mesh

This test proves that viewing an iSCSI setup as a full mesh and throwing NICs at the proverbial problem is going to do nothing to help you. Your iSCSI should be configured in a 1:1 “path” setup between initiator and target. Any additional NICs should be put into “Round Robin with subset” i.e. made to be passive fail-over adapters. That is a 90% and 96% reduction in respective read/write performance!

Test C | Round Robin, 1:1 Paths

This test proves how you are supposed to use iSCSI. Two, non-crossing paths allows for a full bandwidth connection down each path between the initiator and the target. This configuration provided an increase in performance over a single adapter and was the only test that provided improvements to both read and write metrics.

Test D | Control

This was the baseline control test for this experiment. 1 NIC talking to 1 controller port. Nothing complicated here.

Test E | Least Queue Depth, Full Mesh

This test repeated Test B, but changed the MPIO model from RR to LQD to see if it made any difference. Read performance was slightly better than under RR, but was still 85% worse than the control test.

Test F | Partial Active Mesh

This test looked to see whether having a partial active mesh made any difference. There was a very small 1% increase in read performance from this, but a significant write penalty. In practice, you cannot push/pull 2Gbps to/from a 1Gbps source, so the design is not conducive towards improved speed under a synthetic load.

Test G | Least Queue Depth, 1:1 Paths

Test G was a genuine surprise. I was expecting to see Dell EqualLogic Host Integration Tools (HIT) version 4.9 offer an increase in performance, not a decrease. However, repeating the test yielded the same results. In my experience, this has never usually been the case, with VM’s feeling more responsive with HIT installed compared to not. Experience suggests to me that something else was at play here, perhaps the HIT version being poorly optimised for Windows Server 2016, or the Dell stack getting grumbly about it using a retail Intel I350-T4 adapter instead of a Dell one. Dell HIT forces the use of pathing no matter what you try and set all other adapters into passive mode. It used LQD as the MPIO algorithm. Evidently this resulted in an increase in writes but a reduction in read performance, be it not as high as without HIT being installed.

Although not shown in the results above, HIT did help improve performance in some of the Anvil Tests. The long queue depth tests resulted in higher IOPS figures for both read and write values by a small margin. None of the other tests yielded such an improvement.

Conclusion

As you can see from these results. There is only one way that you should be conceptually thinking about your iSCSI environment – 1:1, point to point paths. Anything over and above this should be set to being passive/failover/offline in order not to impact performance.

General Subnet Recommendations

Subnet recommendations go hand in hand with this, but you should note are generally made by the storage vendor — and you should follow their advice. I have encapsulated the general recommendations/requirements of a number of providers in the table below. The subnet count column is in essence a statement that for each NIC on the storage device, there should be a dedicated subnet (and ideally broadcast domain/VLAN) back to the iSCSI server.

Vendor Subnet Count Source
Dell (Non-EqualLogic)
2 View
Dell EMC
2 View
Dell EqualLogic
1 View
Microsoft
2 View
NetApp
? I couldn’t find any guidance from an official source. There is community evidence of both being used by end-users
NetGear
2 View
QNAP
2 View
Synology
2 View

As you can see, with the exception of Dell EqualLogic which provides a middleware solution known as the Host Integration Tools (HIT) to cope with this, most vendors are quite specific on the use of a “single path” logical topology for server/storage connectivity — aka one subnet per storage appliance NIC.

General Advice

I will end this piece with some general advice and tips for working with MPIO. It isn’t exhaustive, but they are some quick observations from experience of using the technology for many years. Some of them are obvious, some of them might help you avoid a head scratcher.

  1. If you are using an enterprise iSCSI solution, follow the vendor’s advice, forget anything you read on the Internet. Everyone is a know-it-all on the internet and there are plenty of “I’m a Linux user so I know best” screaming matches about how EqualLogic are wrong about the recommendations for EqualLogic’s own hardware. I’m pretty sure that EqualLogic… uh, tested their stuff before writing their user manual.
  2. If you are using an enterprise solution and the vendor offers a DSM (MPIO driver), use it. Dell HIT vs the generic Microsoft DSM for Windows Server is noticeably faster, but only works will Dell SAN hardware (naturally). Also ensure that you keep you DSMs up to date.
  3. Follow you vendor’s guidelines with respect to subnets. If in doubt, drop them an email. You’ll usually find them quite accommodating.
  4. Unless your vendor has expressly told you to, you do not MPIO back from the storage system – i.e. don’t team, MPIO, load balance etc on the storage side. Do it all on the server initiating the request.
  5. Stick to two port/1:1 path MPIO designs. If you need more create multiple pairs and have each on different networks going to different storage systems so that the driver knows where to send traffic explicitly while maintaining isolation.
  6. If you want to think about your MPIO as a meshing design, it has to be meshed for redundancy, not active links (unless your system needs to keep living, breathing human beings alive and do so at all costs).
  7. With iSCSI and SAN MPIO, try and avoid network hops (routers).
  8. All ports in a group must be the same type, speed and duplex.
  9. Disable port negotiation and manually set the speed on the client and switch, this will make failover/failback processes faster for your redundant paths.
  10. Use VLAN’s as much as possible (try and avoid overlaying broadcast domains across a shared Layer 2 topology).
  11. Use Jumbo Frames as much as possible unless the iSCSI subnet involves client traffic.
    Hint: Your iSCSI subnet should not involve client traffic!
  12. Ensure that your NIC drivers and firmware are kept up-to-date
  13. Disable all Windows NIC service bindings apart from vanilla IPv4 on your iSCSI networks. For example, Client for Microsoft Networks, QoS Packet Scheduler, File and Printer Sharing for Microsoft Networks etc. If you aren’t using it, disable IPv6 too on the iSCSI interfaces to prevent IPv6 node-chatter.
  14. In the driver config for your server grade NIC (because you are using server grade NIC’s, right?) max out the send and receive buffer sizes on the iSCSI port. If the server NIC has iSCSI features that are relevant (such as iSCSI offloading), enable them.
  15. When you are building a Windows Server, script the MPIO install, enable MPIO during the script and set the default policy as part of the build process —- then patch and REBOOT the system before you even start configuration. If I had a £1 for every time I’d had to rescue someone from not doing that and then not REBOOTING…
  16. If you are using a SOHO/SME general purpose commodity NAS, if (and only if) you have a UPS, disable Journaling and/or Sync Writes on your iSCSI partitions/devices. There is a benefit, but remember if you are hosting SMB shares on a commodity appliance you actually do want Journaling running on those volumes.
  17. Keep your NAS/SAN firmware up to date.
  18. Keep your storage system and iSCSI block sizes, cluster and sector sizes optimised for the workload. Generally this means bigger is better for virtualisation storage and video. 256/64K, 128/64K or 64/64K depending on what your solution can offer.
  19. Keep volumes under 80% of capacity as much as possible.
  20. Use UPS’s: Remember, iSCSI and SAS are hard drive/storage protocols. They are designed to get data onto permanent storage medium just like RAID controllers. RAID controllers have backup batteries because you do not want to lose what is in process in the RAID controller cache when the power goes out. Similarly, you need to think of your iSCSI and External SAS sub-systems much the same as you would a RAID sub-system.
  21. If you have a robust UPS solution, enable write caching and write behind/write back cache features on your storage systems and iSCSI mounted services to gain extra performance benefits. Be mindful that there is risk in this if your power and shutdown solution isn’t bullet proof.
  22. Test it! Build a test VM and yank a cable out a few times. You’ll be glad you sacrificed a Windows install or two to ensure it is right when you actually pull an iSCSI cable out of a running server… Believe me I know what a relief that is.

Netgear ReadyNAS Duo v2 as a Windows Server Backup Target across SMB while allowing differencing in the backup type

System Requirements:

  • Netgear ReadyNAS Duo v2 or any SMB capable NAS
  • Windows Server 2008, 2008 R2, 2012, 2012 R2

The Problem:

One of the most frustrating “features” of Windows Server since the release of Windows Server 2008 has been the backup set. Windows Server Backup added support for backing up to SMB, however only if you perform a full, rather than incremental or differential backup of the host server.

The main problem with this is the time it takes to perform the backup. Depending on the size of the disk array involved, a normal backup job can take tens of hours, even days. If you want to run the backup job daily and the job is taking more than a day to complete while saturating the network, then it is not a very effective backup solution.

Yet the real power in using the network in the first place is the fact that it permits the distribution of the backup to a remote location without the need to go and physically disconnect a drive and carry it. The drives can also be a lot further away than with USB, eSATA or firewire. In another building or in another country.

Further more, the array is more expandable than a typical USB disk. The maximum supported size of this little ReadyNAS duo v2 is 2x4TB in RAID 0, resulting in 8GB of storage. I could also run it in RAID 1 if I needed higher levels of data security. This is much better than a typical USB disk. With a 4 or 8 bay NAS, you can even grow the array by adding new drives and expand the VHDX file according to you needs (up to the limit of the native NAS file system or the NTFS volume limit). Devices with more bays also allow for additional RAID types and associated data security such as RAID 5, 6 or 10.

In many situations, you can use iSCSI for this purpose. Most high end and Enterprise NAS storage and SAN solutions are designed to provide thick or thin provisioned iSCSI targets which you can easily mount via the iSCSI initiator in Windows. An iSCSI mounted drive in Windows is – at least as far as Window is concerned – presented as a local disk and therefore you can perform a differencing backup under the control of Windows Server Backup (WSB).

So what can you do if you have a consumer grade NAS appliance or an old model device that does not expose iSCSI services? A device such as the Netgear ReadyNAS Duo v2? While the v2 version has an unofficial iSCSI Target plugin, this does not work on the v2 model and so having a very low power, ARM based NAS lying around with 6TB of disks in it, it seems a shame to relegate it to the dustbin.

More Info

Storage virtualisation is the answer.

Simply put, Windows Server Backup (WSB) cannot itself perform differential backups to a SMB share, however a SMB share (even a SMB 2.0 share) can host a virtualised storage disk… and Windows can mount a virtualised disk across SMB. Once mounted, WSB is agnostic to the underlying disk location or the fact that it is stored on a SMB share as Windows presents the disk as being locally attached and abstracts the ‘what’ and ‘were’ entirely to the virtualisation layer.

The Test

If you are going to attempt this, I strongly recommend that you enable Jumbo Frames on the device as you may be able to squeeze a 10-20 Mbps of additional write speed out of the device.

View: Netgear ReadyNAS Duo V2 and Jumbo Frames

  1. Ensure that your machine can connect to the ReadyNAS over SMB i.e. \\<ipAddress\<shareName>
  2. Create a share on the NAS for the backup. Create a dedicated one so that you minimise SMB file system update requests to the share. As will be mentioned below, this causes between a 10-20Mbps loss of performance even if nothing is actually happening in Windows Explorer.
  3. Disable as many services as you can on the NAS. The less work the CPU is doing and the more free RAM, the better this will be.
  4. Open PowerShell on Windows 8, 8.1, 10, 2012, 2012 R2 or 2016 and enter:
    New-VHD –Path “\\<ipAddress\<shareName>\Backup.vhdx” –SizeBytes 4096GB

    Substitute the 4096GB (4TB) with the size that you require. This will create a dynamically expanding Hyper-V Virtual Hard Drive on the ReadyNAS

  5. In PowerShell issue the following command to mount the VHDX file
    Mount-VHD -Path "\\<ipAddress\<shareName>\Backup.vhdx"

    Now use DiskPart or Disk Manager to initialise, partition and format the disk. Remember to format it using 64K sectors as this will be important to preserve performance for the large files involved.

    Alternately you can execute the entire initialisation and mounting process in PowerShell using:

    Mount-VHD -Path "\\<ipAddress\<shareName>\Backup.vhdx" -Passthru |Initialize-Disk -Passthru |New-Partition -DriveLetter B -UseMaximumSize
    Format-Volume -DriveLetter B -FileSystem NTFS -NewFileSystemLabel "Backup Disk" -AllocationUnitSize 65536 -Confirm:$false -Force

    This will create a 64K NTFS partition called “Backup Disk” and mount it on B:\ using the VHDX file found on the ReadyNAS

  6. Now, if you attempt to use Windows Server Backup you will be able to create a differencing disk backup set.

Does it work?

Windows Server Backup (WSB) certainly accepts the disk without any complaints and is dutifully able to create the first Normal copy after which it is able to easily perform the delta-backup as would be familiar for an incremental or differential backup type. So yes, it does work. It tricks Windows into accepting the SMB target.

Performance is however a sticking point.

To apply some context: According to a “Legit Reviews” review of the WD Red 5400rpm 3TB drives in the NAS, each drive should easily have been able to manage a write speed of 80MB/s or 640Mbps at a minimum – with something around 147MB/s or 1176Mbps being expected for sequential writes.

Source: WD Red 3TB NAS Hard Drive Review (Page 3)

 

Creating a Hyper-V VHDX file and writing linear zero’s to it across the network results in a write speed variance of between 445Mbps and 495Mbps on the wire (55.6MB/s – 61.25MB/s). The highest that I saw it peak at was 537Mbps in burst.

Performing a backup onto the drive took 23 hours and 2 minutes with the MTU set to 1500 bytes with the average bit rate being approximately 420Mbps – 430Mbps for the backup. Particularly painful for the first normal backup. This is however comparable to the performance of a USB 2.0 drive.

So we can safely conclude that the bottleneck is not the drives. The bottleneck is the ReadyNAS Duo v2. Other, newer devices with more CPU horsepower, more RAM, larger NIC buffers, native Jumbo Frame offload support and more NICs (as well as more drives) should be able to offer better performance.

As an interesting side observation, having a Windows Explorer session sitting open to a SMB share and doing nothing slowed the zeroing process by 10-20Mbps on its own. This highlights the impact of having the NAS CPU performing other actions and its impact of write performance.

Reality Check

There are however some problems here. Windows Server Backup is not aware that this is a virtual drive, it expects the drive to perform and present like a physical hard drive and it will treat it as such consequently

  1. It is going to have poor support for and tolerance of power management (suspend and standby).
  2. It is going to have little to no tolerance for an unreliable network connection i.e. never try to do this over wireless or an unstable Internet connection.
  3. It is going to be extremely susceptible to power outages. You really should use UPS on the NAS, switch(s) and the source machine to prevent data corruption during a power outage. Note that the important part here is that the source machine stops the backup and dismounts the VHDX in the time it spends on the UPS. After that it can all turn off quite happily.
  4. Windows is not going to automatically mount the VHDX. WSB will not do this for you. You will have to either ensure that it mounts as boot or schedule it to mount before the backup.
  5. Windows is going to need to ensure that it cleanly dismounts the VHDX during shutdown and power management operations. WSB is not going to do this for you either.
  6. Use write caching on the NAS and on the host operating system at your risk i.e. definitely have a UPS if you want this performance benefit.
  7. If you need to perform a bare metal recovery of the server, the extra steps of getting the VHDX mounted in the boot recovery environment may prove frustrating.
  8. While VHDX in Server 2012 R2 can technically be ued as a shared medium, you should probably avoid even contemplating trying to share one VHDX between multiple WSB hosts. Create one VHDX for each server.
  9. The current maximum size of a VHDX is 64TB. If this is an issue 1) why are you using a consumer grade NAS? 2) you need a SAN 3) you shouldn’t be using WSB

It should be noted that all of the above postential disadvantages also apply to some degree to the use of iSCSI. The advantage of this approach is that you get the data virtualisation advantage where as with iSCSI your NAS would have to expose this i.e. you can literally just pickup the VHDX and move it to a new HDD, Array, NAS or SAN and WSB isn’t going to care or even notice.

So what can you do? My suggestion is this: do not use the WSB UI to schedule the backup. Use task scheduler and the WSB command line tool WBAdmin.exe to perform the backup in a PowerShell script. Something like the following:

Mount-VHD -Path "\\192.168.0.100\Backup\Backup.vhdx"

Start-Sleep -s 60       # Wait 60 seconds for the disk to come online

C:\Windows\System32\wbadmin.exe start backup -backupTarget:B: -allCritical -include:C: -systemState -vssFull -quiet

Start-Sleep -s 120      # Wait 120 seconds for the disk to go offline

DisMount-VHD -Path "\\192.168.0.100\Backup\Backup.vhdx"

When task scheduler fires the script it will mount the VHDX, wait 60 seconds to allow the file system to mount, perform the backup, wait 120 seconds for the backup sub-system to shutdown and then cleanly dismount the VHDX.

Optimisation and issues

The 256MB RAM, ARM based ReadyNAS Duo v2 was never intended for these kinds of workloads and that does show. Most of the issues encountered with it are simply as a result of the low power, low resource hardware specification.

I have already covered the need to:

  • Use Jumbo Frames on the NIC
  • Do not use wireless connections to mount the VHDX
  • Use 64K sectors on the NTFS volume
  • Optionally use the write cache setting on the ReadyNAS
  • Optionally enable write caching and prevent buffer flushing on the volume as exposed via host operating system

To this I will add the following:

Do not use VHD files, only use VHDX. VHDX are far,far safer to use over SMB compared to VHD as they have error correction and handling built-in. consequently, there is a reasonable chance that the file will actually survive a disconnect of the network cable or power of the source or destination as a result. This does however restrict you to using Windows 8/Server 2012 or higher at the expense of Windows Server 2008/2008 R2

Only use 1Gbps or 10Gbps networks. Do not use 802.11 wireless and do not use 10/100 Fast Ethernet.

Use server grade NICs in your devices if you can

Use MPIO and multiple switches if you can spare/afford the hardware

Keep your VHDX defragged just like any other NTFS formatted hard drive

If you have managed switches, consider preventing broadcast and multicast traffic from reaching the NAS. This will reduce CPU load a little although it will prevent NetBIOS discovery and may impact other services.

Do not use the NAS for anything else, especially small SMB file storage. Client access will degrade the write performance and consume CPU time. In particular do not leave the NAS SMB mount point mounted as a network drive as this also holds a SMB session open with the Linux Samba service.

Do not use dynamically expanding VHDX files. Using a dynamically expanding VHDX file was in reality fine (if you accept the limitations of the device). It took nearly 4TB of data without incident, however the use of dynamically expanding disks is itself inefficient. Dynamic disks have a performance penalty associated with them as the disk head is constantly being told to zero the trailing 12MB of the VHDX file to permit future growth of the VHDX. There are also associated writes to the metadata of the VHDX to update the file boundary markers. In trying to squeeze every last bit of performance out of the ReadyNAS Duo v2, I wanted to use a fixed size VHDX file to see if it was any more performance efficient.

One of the first issues encountered was on the length of time it takes to allocate and deallocate space from the Linux disk journal. Allocation is proportionately faster then deallocation, however on attempting to allocate 5.4TB of disk space to a singe VHDX file, it would take the system an extended period to process and the VHDX creation process on Windows would timeout, causing the VHDX to be corrupted. At this point the VHDX would be deleted by Windows. This storage deallocation could take upwards of 20 minutes to appear as released in the ReadyNAS web UI.

Looking at ‘top’ in the SSH session, it was clear that the CPU was the culprit, capping out at 100% throughout the entire operation before dropping down to <1% once the journal had been updated.

After some trial and error, I found that with the web UI closed, only necessary services running, SSH logged out and no active Windows Explorer sessions open, I could allocate 2TB at a time without it causing a timeout.

The following script can thus be used to create the VHDX at 2TB, expand it to 4TB and then expand it to the desired 5.3TB (the maximum size of the ReadyNAS volume I was using was 5.4TB).

New-VHD -Path "\\192.168.0.100\Backup\Backup.vhdx" –Fixed –SizeBytes 2TB
Resize-VHD –Path "\\192.168.0.100\Backup\Backup.vhdx" –SizeBytes 4TB
Resize-VHD –Path "\\192.168.0.100\Backup\Backup.vhdx" –SizeBytes 5.3TB

Remember, this script is creating a Fixed Size VHDX file. Consequently it is going to pre-zero each sector on the disk instead of performing a constant 12MB zeroing chase at the end of the file. This means that it will take an extremely long time to complete (especially at only ~470Mbps) i.e. over 24 hours! So I suggest that you copy and paste all three lines at once into the PowerShell buffer and walk away. Once it has finished chewing over all three lines, mount the disk, partition it and format it as outlined earlier in the article.

Note: There are a coupe of utilities out on the Internet that can create a fixed size VHDX from free space without performing the zeroing operation. You can save yourself a lot of time using such tools however you should NEVER use them in a production or in a shared environment due to reasons of data safety, privacy and security.

After the allocation of the space in the Journal and during the zeroing process the CPU use remains high, running constantly at 100% with about 20MB of RAM showing as free out of the 256MB total. This proves that the sub 500Mbps cap on the transfer speed is being caused by the CPU and not the disks. You must thus be realistic about the capability of the appliance or pay for more robust, more capable hardware.

You can technically also disable journalling on the volume using SSH, however you must ensure that you have a UPS wired into the NAS and the UPS can perform a controlled shutdown of the NAS if you try and use it. I elected not to do this.

tune4fs -O ^has_journal /dev/sda0
e4fsck –f /dev/sda0
sudo reboot

Final Results

If you have read this and the Jumbo Frames article on the ReadyNAS Duo v2, I am sure that you might be interested to hear what the cumulative impact of all of the performance tweaks and optimisation’s was.

With write caching enabled on both the NAS and the Windows Server and buffer flushing disabled on Windows Server, plus all of the other tweaks listed, backup throughput rose to a fairly consistent 560Mbps – 590Mbps with bursts up to 638Mbps. That equates to 70MB/s – 73.75MB/s and 79.75MB/s at burst. While nowhere near the capability of the drives themselves, it is at least now tantalisingly close to the benchmark value for the drives random write performance test and network write performance is nearly 200Mbps faster.

Performing the backup job (which without any optimisation’s on a Dynamic VHDX took 23 hours and 2 minutes) with all optimisation’s enabled – and actually a significantly larger workload due to the addition of VM state backups in the job – took some 16 and 47 minutes. A considerable improvement! That works out at around 200GB per hour.

Most importantly, when the job ran again the next evening, it took less than 30 minutes thanks to it only having to backup file differences.

So why is this? It is predominantly related to the fixed size VHDX file. The higher throughput is being achieved because the ReadyNAS CPU is sitting at around 5% – 30% idle during the ~600Mbps copy. The Linux file system sees the write process as constituting changes that are internal to the VHDX file and the file itself isn’t growing, therefore the file system driver on the NAS has significantly less work to do. It is instead NTFS on the backup server that is processing the MFT updates into the file allocation table of the VHDX completely transparently to the NAS. This means that the CPU work has been transferred to the backup server, resulting in a performance increase (and a slightly cooler, less power consuming NAS).

0xefff0003 New-IscsiTargetPortal : Connection Failed on iSCSI Initiator client when conecting to a newly created iSCSI Target hosted on a Windows Server 2012 file server after changing server NIC configuration

System Requirements:

  • Windows Hyper-V Server 2012
  • Windows Hyper-V Server 2012 R2
  • Windows Server 2012

The Problem:

About 3 weeks ago, I completed the physical hardware installation of redundant NIC’s in a Hyper-V cluster that was backed onto a Windows Server 2012 server iSCSI SAN. The additional physical NIC’s were installed on the clients and communication between nodes worked as expected. The ports on the new NIC were placed into a new private address range of 192.168.100. Some were also removed from an existing multi-port NIC in the 192.168.254/24 range.

A couple of weeks later it came time to change the iSCSI SAN targets on the clients to use a the new adapters to move from 254/24 to 100/24.

New-IscsiTargetPortal -TargetPortalAddress 192.168.100.1

With the correct firewall and chap settings, it should have connected. Instead it returned

New-IscsiTargetPortal : Connection Failed.
At line:1 char:1
+ New-IscsiTargetPortal -TargetPortalAddress 192.168.100.1
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (MSFT_iSCSITargetPortal:ROOT/Micro
soft/...CSITargetPortal) [New-IscsiTargetPortal], CimException
+ FullyQualifiedErrorId : HRESULT 0xefff0003,New-IscsiTargetPortal

The firewall’s were OK, ping was OK. The DNS Connection Suffix, DNS Server (or lack of) were OK and NetBIOS over TCP/IP was disabled.

I could remove and reconnect the server using the original address without any problems.

More Info

Without wishing to be verbose on this one, the simple answer is that it appears to be a bug / limitation / “feature” of the iSCSI Target component of Server 2012. It was not a client issue.

The problem was that Windows had not been rebooted since standing up the new multi-port NIC (some 3 weeks prior). Yes, it was rebooted to put the hardware in, but once the NIC heads on the adapter had been configured it had not been rebooted subsequently.

It would appear that Storage Manager in Server 2012 does not force the iSCSI Target driver subsystem to re-parse the available adapter list.

In going into Server Manager > File and Storage Services > (right click the Storage server offering the iSCSI LUN) > iSCSI Target Settings

The list contained a number of network addresses that were REMOVED 3 weeks ago, but none of the NEW IPv4 or IPv6 addresses assigned to the new NIC were available.

Closing and re-opening Storage Manager made no difference.

The Fix

I sighed and being forced into unexpected maintenance on the cluster storage back end shutdown the cluster, updated drivers and firmware, cleared out Windows Update and rebooted.

After the reboot, all of the new addresses were available in Storage Manager and the redundant ones had disappeared.

So quite simply, reboot (or fully restart all iSCSI services).