Overview of MPIO Recommendations & Best Practice on Windows Server
- Windows Server 2008 Storage Server
- Windows Server 2008 R2 Storage Server
- Windows Server 2012, 2012 R2, 2016
I needed to outline some of the general thinking relating to exactly how a practitioner shold logically and physically understand MPIO, however most of the discourse on the subjet skips a fair anount of the obvious questions that people starting out with the technology may be asking (or trying to answer). I therefore present some thinking on the subject of understanding MPIO.
Multi-Path Input/Output or MPIO is a server technology that usually sits on the storage side of load balancing, failover and aggregation technologies. If you are getting into SAS, iSCSI or Enterprise RAID solutions where it is most commonly used (encountered), then you this may (or may not) help you with understanding what MPIO is any why it (possibly) isn't what you think it is!
Logically understand what MPIO is all about
So you’ve got 2x1Gbps ports in a MPIO team, so you’ve got a 2Gbps link right? Wrong. That isn’t what is going on with MPIO.
MPIO (and in fact pretty much the majority of balancing and aggregation technologies) doesn’t double the speed, it roughly doubles the bandwidth. Think of it like this:
You own a car. The car has a top speed of 70mph and not one mph more. You get on a one way, single track road in a country where there are no speed limits. You are now happily driving along at 70mph. Some bright spark at your local council decides that you should be able to drive at 140mph, so they cut down the trees on one side of the road and add a second one-way carriage way, going in the same direction as the first.
Can your car now drive at 140mph because of the new lane? No. The public official is wrong. Your engine can only offer you 70mph. The extra lane doesn’t help you, but it does help the guy in the car next to you also driving at 70mph arrive at the other end of the road at the same time. It also means that when you encounter a tractor ambling along in your lane, you have somewhere else to go without slowing down.
This is fundamentally what MPIO is doing. So why isn’t it a 2Gbps link? Basically, because networking technology is a serial communications medium and by adding a second lane and calling it a faster way to get data to the end of it, you get into parallel communications. Under parallel communications you have to split (fragment), buffer information, be able to cope with each section of the data arriving at a different time, it arriving all at the same time, in a different order than intended or of course, it not arriving at all.
To fix this, you need to introduce overhead in order to re-sequence, wait for or re-request data sections - and overhead is something that you really don’t want in an iSCSI or SAS environment where response time (latency) is king. Consequently, there rapidly becomes a diminishing return on how much of this parallel working you can derive a benefit from. MPIO, if correctly configured, will offer something at around about the boundary between worth-while and not bothering in the first place actually falls.
To summarise, MPIO allows you to get twice as much data down to the end of the link, but you cannot get it there any faster. In general, if you can avoid using fragmented streams, you will reap the maximum benefit. The obvious approach here is that each “lane” should be using unrelated data. So instead of carving up a single video file and pushing little bits of it down each lane one bit at a time (MPIO can do this), one lane is used for the video and the second lane is used for literally anything and everything else. This is a simplification of what MPIO generally does, however in practice is offers a good way to get your head around it.
So how does MPIO carve up the traffic?
There are broadly speaking 4 different paradigms for carving up MPIO traffic
In this mode, one link is active, while the other is passive i.e. up, but not doing anything. If the first link fails, the second path takes over and all existing traffic streams continue to receive the same bandwidth (% of the available pie) on the same terms as before. This would give us a completely separate road that can only be used in emergencies. It may not be as fast or robust, or it may be identically spec’d and just as capable. It may or may not return traffic to the first channel once it becomes available once again.
This mode alternates traffic between channel 1 and channel 2, then goes back to channel 1, channel 2 and so on. Both links are active, both receive traffic in a slight skew as the data is de-queued at the sender. This offers the 2 lane analogy used above with each 70mph car getting to the end at roughly the same time.
Least Queue Depth
This puts the traffic into the channel that has the least amount on it (or to be more accurately about to go onto it). If one channel is busier than the other (e.g. the large video file) then it will put other traffic down the second channel, allowing the video to transfer without needing to slow down to allow new traffic to join, delaying its delivery.
Weighted paths and least blocking methods assess the state/capabilities of the channels. This is more useful if there are lots of hops between source and destination, multiple routes between a destination or different channels have different capabilities. For example, if you have iSCSI running through a routed network, then there could be multiple ways for it to get there. One route may go through 5 routers and another 18 routers. Generally, the 5 router path might be preferable provided the lower hops genuinely gets the data there faster. Equally the weighting could be based upon the speed of the path through to the recipient or finally, of channel 1 is 10Gbps and channel 2 is only 1Gbps, then you might prefer the 10Gbps path to be used with a higher preference. Usually, a lower weighting number means a higher preference. This would be the equivalent of a 70mph road with a backup road with a max speed of 50mph. You know that it will get you to the destination, but you can guarantee that if you have to use it, it will take longer.
So, more lanes equal more stuff then?
Sounds simple doesn’t it? Just keep throwing lanes into the road and then everyone gets to travel smoothly at 70mph.
In principle, it is a nice idea, but in practice it doesn’t actually work.
For starters, server grade network card (which you should be using for MPIO, not client adapters) are expensive and server backplanes can only accept a finite amount of them. Server NIC’s also consume power and power consumes money! Keep that in mind.
The reality is that if you have an MPIO solution that will allow you to experiment with more than 2 NIC adapters in a MPIO group, you will likely see the performance gain rapidly tail off and then actually wind up presenting you with steadily worsening performance, not any increase.
Attempting to MPIO iSCSI traffic across 4x 1Gbps NIC’s actually offers worse read and write speeds for a virtual machine than 2x 1Gbps under a Hyper-V environment. The system starts to waste so much time trying to break apart and put back together each lane’s worth of traffic that it just doesn’t help the hypervisor.
Where a 4 NIC configuration is beneficial is actually in providing you with a “RAID 6” MPIO solution. Here you can have 2 active and 2 passive adapters – remember in an idealised scenario they could be 2x10Gbe and 2x1Gbe with a hard-coded preference for the 10Gbe and a method of failing traffic back to the 10Gbe. Just be aware that you can only use the 10Gbe set OR the 1Gbe set at the same time, not one port from each.
Some DSM’s (effectively a OEM specific MPIO driver under Windows) logically limit a MPIO to two active NIC’s. Dell EqualLogic Host Integration Tools (the EqualLogic DSM) will grab the first two paths it finds and shutdown any others into a passive state, no matter how hard you try to start them up.
What should a MPIO network “look” like?
Ultimately this is down to what you want to get out of the MPIO solution and within the bounds of what your hardware vendor will support.
There are effectively two schools of thought here (I won't comment on which is right because as you'll see, it isn't that simple)
MPIO is about meshing
If you see MPIO is a mesh then 2 NIC’s in a server connecting to 2 NIC’s in a storage appliance equals a mesh where each NIC has a path to the other.
MPIO is about pathing
If you see MPIO in this model it is simple about more than one line being drawn between two different end points, with no line crossing or adding any complexity, complication and confusion.
Is there an advantage to either?
In a 4x 1Gbe NIC design where there are 2 NIC’s in a server and 2 NIC’s in a storage device, if one of the NIC’s (let’s say on the storage side) fails, both server NIC’s will still have access to the storage array. In this scenario however, each server NIC will no longer be able to achieve 1Gbps. At worse, each NIC will only send to the storage at 0.5Gbps. Alternatively, one could get 0.8Gbps while the other gets 0.2Gbps, but the total can never go over 1Gbps in this scenario.
If there was no meshing and we viewed the topology as simple paths, if one of the NIC’s failed, it is true to say that effectively the second NIC in the server is now wasted, useless, however the traffic in the first NIC will remain 1Gbps. There is no “benefit” from a bandwidth point of view. In this scenario however, with a mesh design, if one of the NIC’s in the server now failed as well, we would still be able to use the solution at 1Gbps. If the active server NIC failed in the simple path view, we would no longer have any working paths what so ever and so our solution will have just failed.
Why is thinking about MPIO as a mesh discouraged?
The Internet is full of people stating that this is a bad idea, although they do not use this language. To entertain this discussion elsewhere, people usually talk in terms of IP subnets (especially in relation to iSCSI).
The reason why most of the discourse suggests that meshing is a bad idea is that a meshed MPIO solution by definition uses either a single subnet or it uses routers. Neither of which are “generally” considered conducive towards good iSCSI design.
To understand why routing is considered bad is fairly easy, iSCSI isn’t designed to be a distributed, latency tolerant solution – it is a hard drive on a Cat6/fibre cable encapsulated in TCP/IP/Ethernet. Usually you would use an intermediate staging network to move data between iSCSI networks, you would not transfer between iSCSI networks directly. You can route it of-course, but the limitations start to increase, the risks start to increase and so too does the cost and complicity of the solution when you look at it on a geo-diverse scale. A single hop within a routed SAN won’t hurt, however most designers would generally try and avoid it for smaller to medium size environments.
Single Subnets on the other hand are frowned upon for a different reason – loss of control of the environment. The issue stems from the multi-homing problem on IP networks.
If a computer is connected to the same network using more than one adapter, which adapter should it send from/to?
IP routing at Layer 3 will make a decision based upon a routing metric (weighting). If the routing metric for all candidates is the same it looks at the interface metrics, if the interface metrics are uneven, it will use the LOWEST value adapter, forcing all traffic out of a single adapter at the expense of the bandwidth available on the other. If the metrics are equal, the system will usually round-robin the traffic through each adapter in turn; but not always. It very much depends upon the solution being used.
If you have a scenario with 2 server NICs and 2 storage NICs, how can you guarantee that Server NIC 1 is sending to Storage NIC 1 while also ensuring that if there is a connectivity problem it can still use the other path if it needs to. In turn the same set of logic requirements also need to be accommodated for with server NIC 2 and storage NIC 2. Quite simply, you can’t really achieve this; not without a middleware solution.
The solution in this situation is to force traffic from Server NIC 1 to only send to the IP of Storage NIC 1 and then Server NIC 2 to send to Storage NIC 2. At which point you have added more middleware complexity/configuration steps and lost the one benefit of having the mesh in the first place.
The next logical step is to hard wire a path between each NIC on the server and each NIC on the storage device so that it fails down to the second route, but now you are back to the original problem – how can you ensure that the server (where the MPIO logic lives) is actually using the least used NIC on the storage device and isn’t accidentally forcing most of its traffic to a single, over-loaded storage NIC. Also, how do you define the preferred route?
The answer again is: it’s very difficult not to and with few exceptions, most do not even try.
The final issue here stems from understanding that most MPIO is one sided. The MPIO occurs at the server and not on the storage system. The storage system has no idea that it’s participating in a MPIO group, it’s just being asked for data X on port Y. If your storage is using Layer 3 to return traffic to the server, it literally will be unable to ascertain any difference between sending the data back out of NIC 1 or NIC 2 because both NIC’s will have to send the data up to a router and both of those have the same routing weight (1 hop). In essence, there will be no predictable discrimination about how the storage device will respond on replying and could easily block up a single NIC. If the solution is responding over Layer 2 (i.e. MAC address ARP lookup without any routing involved), there is some chance of more predictability here up until both storage NIC’s have both server NIC MAC addresses in the ARP cache and once again we are back to the operating system deciding what to do and how to do it.
General Subnet Recommendations
Subnet recommendations are thus generally made by the storage vendor. I have encapsulated the general recommendations/requirements of a number of providers in the table below. The subnet count column is in essence a statement that for each NIC on the storage device, there should be a dedicated subnet (and ideally broadcast domain/VLAN).
|?||I couldn’t find any guidance from an official source. There is community evidence of both being used by end-users|
As you can see, with the exception of Dell EqualLogic which provides a middleware solution known as the Host Integration Tools (HIT), most vendors are quite specific on the use of a “single path” logical topology for server/storage connectivity aka one subnet per storage appliance NIC.
I will end this piece with some general advice and tips for working with MPIO. It isn't exhaustive, but they are some quick observations from experience of using the technology for man years. Some of them are obvious, some of them might help you avoid a head scratcher.
- If you are using an enterprise iSCSI solution, follow the vendor’s advice, forget anything you read on the Internet. Everyone is a know-it-all on the internet and there are plenty of “I’m a Linux user so I know best” screaming matches about how EqualLogic are wrong about the recommendations for EqualLogic’s own hardware. I’m pretty sure that EqualLogic… uh, tested their stuff before writing their user manual.
- If you are using an enterprise solution and the vendor offers a DSM (MPIO driver), use it. Dell HIT vs the generic Microsoft DSM for Windows Server is noticeably faster, but only works will Dell SAN hardware (naturally).
- Follow you vendor’s guidelines with respect to subnets. If in doubt, drop them an email. You'll usually find them quite accomodating.
- Unless your vendor has expressly told you to, you do not MPIO back from the storage system - i.e. don't team, MPIO, load balance etc on the storage side. Do it all on the server initiating the request.
- Stick to two port MPIO designs. If you need more create multiple pairs and have each on different networks going to different storage systems so that the driver knows where to send traffic explicitly while maintaining isolation.
- With iSCSI and SAN MPIO, try and avoid network hops (routers).
- All ports in a group must be the same type and speed.
- Disable port negotiation and manually set the speed on the client and switch, this will make failover/restore processes faster.
- Use VLAN’s as much as possible (try and avoid overlaying broadcast domains accross a shared Layer 2 topology).
- Use Jumbo Frames as much as possible unless the iSCSI subnet involves client traffic.
- Disable all Windows NIC service bindings apart from vanilla IPv4 (Client for Microsoft Networks, QoS Packet Scheduler, File and Printer Sharing for Microsoft Networks etc. If you aren't using it disable IPv6 too to prevent IPv6 node-chatter) on your iSCSI networks.
- In the driver config for your server grade NIC (because you are using server grade NIC's, right?) max out the send and receive buffer sizes on the iSCSI port. If the server NIC has iSCSI features that are relevant, enable them.
- When you are building a Windows Server, stick the MPIO install, enable MPIO script and set the default policy as part of the build process ---- then REBOOT the system before you even start configuration. If I had a £1 for every time I’d had to rescue someone from not doing that and then not REBOOTING…
- If you are using a SOHO/SME general purpose commodity NAS, if (and only if) you have a UPS, disable Journaling and/or Sync Writes on your iSCSI partitions/devices. There is a benefit, but remember if you are hosting SMB shares on a commodity appliance you actually do want Journaling running on those volumes.
- Keep your NAS/SAN firmware up to date.
- Keep your storage system and iSCSI block sizes, cluster and sector sizes optimised for the workload. Generally this means bigger is better for virtualisation storage and video. 256/64K, 128/64K or 64/64K depending on what your solutuion can offer.
- Keep volumes under 80% of capacity as much as possible.
- Use UPS's: Remember, iSCSI and SAS are hard drive/storage protocols. They are designed to get data onto permanent storage medium just like RAID controllers. RAID controllers have backup batteries because you do not want to lose what is in process in the RAID controller cache when the power goes out. Similarly, you need to think of your iSCSI and External SAS sub-systems much the same as you would a RAID sub-system.
- If you have a robust UPS solution, enable write caching and write behind/write back cache features on your storage systems and iSCSI mounted services to gain extra performance benefits. Be mindful that there is risk in this if your power and shutdown solution isn't bullet proof.
- Test it! Build a test VM and yank a cable out a few times. You'll be glad you sacrificed a Windows install or two to ensure it is right when you actually pull an iSCSI cable out of a running server... Believe me I know what a relief that is.
Article Published: Tuesday, 11 April 2017
Article Revision Date: Tuesday, 11 April, 2017
This site is not associated with the Microsoft Corporation. The information on this page is provided as is and is free for those who visit to use. Microsoft Operating Systems, Internet Browsers and applications are the property of the Microsoft Corporation. Windows software patches and updates are the property of the Microsoft Corporation and are provided through the hard work and dedication of the Microsoft Security, Operating System, and Application development teams.