OK, so I've been doing the iSCSI thing for a long time, but recently we had a need to build a RHEL5 VM with access to a legacy Fiber Channel Apple Xserve RAID. Fortunately our trusty Cisco MDS 9216 w/IP Services Blade can act as an iSCSI to fiber channel gateway, however it does have limits. It only support 3 iSCSI sessions per port, and thus with our 4-port blade we only have 12 total iSCSI-to-FC sessions. We already use some sessions for iSCSI access to legacy FC tape libraries, so we didn't have enough sessions to connect all of our VMware ESX hosts to both halves of the array (the Xserve is effectively two dumb FC-to-RAID controllers stuffed in a single chassis) and really, only this one host needed access. Oh yeah, we wanted good performance as well. Read on for how we did it.
So our first idea was simply to simply add two additional virtual NIC's to the VM, give them both IP address, and configure each one to connect to a different iSCSI port on the MDS by specifying the interfaces and portals on with iscsiadm. Basically the idea was this:
Linux Interface iSCSI Interface MDS iSCSI Port
--------------- --------------- -------------------
eth1 10.10.1.10 iface0 iSCSI2/1 10.10.1.20
eth2 10.10.1.11 iface1 iSCSI2/2 10.10.1.21
Each of the iSCSI ports on the MDS would be zoned to allow each iSCSI session to see both the upper and lower controllers in the Apple Xserve RAID. Combine that with dm-multipath, we'd have two connections each to each half of the Apple Xserve RAID which should be load balanced.
Well, that was the plan, but for some reason, this just didn't work. We actually managed to get it all setup, and it seemed OK, we could create a striped LVM (across both halves of the Xserve RAID), format the volume (ext4 w/journal) and run IOzone, but we kept getting strange iSCSI noop timeouts that were triggering path failures and causing poor performance. We were seeing about 130MB/sec writes, and 60-70MB/sec reads, not to bad for writes on this older array, but not even close on the read performance we know we can get. We couldn't really figure out what was going on until we ran a packet trace and realized that Linux was sometimes sending packets with eth2's source IP address via eth1 and vice versa. I don't think there's anything technically wrong with this, but it seemed to be confusing the Cisco MDS. When the Cisco would see a packet from "wrong" NIC it appeared to try to send it's response to that NIC rather than the actual NIC, and this caused all sorts of problems.
We tested with only a single NIC and it worked perfectly, but we weren't sure how to proceed from there. We thought we could probably do some magic with policy routing, but we decided the next best setup would be to use Linux bonding to bond the two virtual NIC's to a single IP address. The Linux bonding feature is quite mature and flexible, so the thought was to bond the two NIC, and make four connections, two to each MDS port, from the same IP and MAC address.
Linux Interface iSCSI Interface MDS iSCSI Port
------------------ --------------- -------------------
bond0 (eth1, eth2) default iSCSI2/1 10.10.1.20
10.10.1.10 default iSCSI2/2 10.10.1.21
We decided to use the Linux bonding mode 6, balance-alb, since this balances the load of both transmit and receive packets in a round-robin fashion. To make this work reliably we had to allow "Promiscuous" mode on the virtual switch in VMware.
With the bond0 interface configure we then used iscsiadm to create connections to both ports of our MDS, once again giving us a total of four connections to the Xserve RAID, one to the upper and lower controller via each port, but now the MDS sees all these connections from the same IP address. We kick off our iozone benchmarks and, not only are our timeouts gone, but write performance jumps to 150-160MB/sec, and read performance tops out around 210MB/sec, pretty close to maxing out dual gigabit links, and reasonably impressive for a virtual machine using software iSCSI to talk via a iSCSI-to-FC to a 5+ year old array that still uses PATA drives. I was pretty happy with the result.
Thursday, January 7. 2010 at 16:21 (Reply)
Interesting. If your apple controllers and eth{1,2} were on separate subnets, would have alleviated the 'response from wrong NIC' anomaly? Also, I assume the bonded solution does not employ dm-multipath. If either one of the NICs or one of the controllers failed, what would be the failover result?
Wednesday, January 13. 2010 at 10:06 (Reply)
The bonding solution still uses dm-multipath. Basically I connected each fiber channel port on the Xserve RAID (which it has two) to two of iSCSI ports on the MDS. I then configure OpenISCSI to initiate a connection to all four of these iSCSI ports, two 1Gb iSCSI connections for each 2Gb fiber channel port. Since the bonded ethernet link is connected to vSwitch that is fully redundantly connected to the network, I have multiple paths to the storage array. No single switch or port failure can take down connectivity.
Of course, the XserverRAID is not a fully redundant device, unless you use software RAID on the host to create mirrored volumes between both controllers. It's mainly for disk backups and other bulk storage that doesn't have 100% uptime requirements.