This post might be extremely long. I like to document things just in case I can spare other people the pain that I went through.
Ok, so you just bought an iScsi SAN right. So you should learn a bunch about iScsi I would think. If you are using iScsi with vmware you NEED to read this.
If you are doing 3.5 then read this one http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_iscsi_san_cfg.pdf
So, did you read it? I mean go READ IT. Trust me! What I gleaned from those guides is this: (copied from http://www.yellow-bricks.com/2008/07/21/queuedepth-and-whats-next/
…. often overlooked is Disk.UseLunReset and/or Disk.UseDeviceReset. ESX defaults to Disk.UseLunReset=1 and Disk.UseDeviceReset=1. This means that when a SCSI bus is reset all SCSI reservations are cleared, not for a specific LUN but for the complete device. This is useful when one uses local storage, but within a VMware environment most companies utilize a SAN and you don’t want to disrupt the entire SAN when it’s not necesarry. You can set this via the commandline, powershell and via VirtualCenter:
- VirtualCenter -> Configuration Tab -> Advanced Settings -> Disk -> Disk.UseLunReset=1 , Disk.UseDeviceReset=0
- Get-VMHost | Set-VMHostAdvancedConfiguration -Name Disk.UseDeviceReset -Value 0
- Commandline -> esxcfg-advcfg -s 1 /Disk/UseLunReset
Commandline -> esxcfg-advcfg -s 0 /Disk/UseDeviceReset
The next thing I learned from the guide is I think you want to change the Disk.MaxLUN parameter. It defaults to 255 but unless you are planning on having that many luns setting it to say 50 is more reasonable and will make your ESX boot quicker as well as scan LUNS quicker.
In the 3.5 guide it mentions removing the vmfs-2 module (not so in the 4 manual) but it is easy to do as laid out in this post. http://www.yellow-bricks.com/2009/03/13/disabling-the-vmfs-2-module-exploring-the-next-generation-of-esx/
Basically run this command. esxcfg-module -d vmfs2
It also mentions somewhere (I couldn’t find it again) that if you only made changes to an iSCSI LUN then you only need to rescan that one (just right click the vmhba33 and rescan)
OK, on with the show
Get access to the dell\equallogic support site. You will need it for firmware and software.
- Create an EqualLogic Support Account
- Click on “Login” (upper right hand side of page
- Click “Request Account”
Our setup: 3 IBM 3850 M2 with 2 processors, 64 GB of RAM. 2 nics four ports (etherchannel bonded on incoming Cisco like this) 2 nics four ports for iscsi traffic. And a nic for the service console, and another gig nic for vmotion. Nics are Intel PRO/1000 dual port controllers
What we bought: two PS6000XV, two Dell PowerConnect 6224 Ethernet Switch’s ( We probably should have gone with the 6248 for more ports but this is sufficient)
Two Stacking Modules for our Ethernet Switches. Everywhere I read recommends stacking the switches as the way to go.
A lot of cat 6 network cable of two colors. We refer to one switch as our Red switch and another as our Orange switch. This way we have a visual check of whether every system has redundant paths.
Ok, so the hardware arrived. Rack and stack them along with the switches. Connect the stacking cables between the switches. How should you do that well look here on page 32
Manual for Dell Power Connect 6224
I don’t know who names things over at Dell, but they seriously need to revisit their taxonomy.
Ok, so now everything should be cabled up. You want redundant paths from end to end.
Ok, configure the switch. But how should we configure it? Yet again how Dell names things leaves me less than amused. Of course the guide that you need to configure your dell power connect 6224 for use with equallogic is named none other than…. wait for it…. Dell EqualLogic Configuration Guide…. yup thats it.
Ok where is it? You can find it here http://sites.google.com/site/mellerbeck/Home/Dell_EqualLogic_Configuration_Guide.pdf?attredirects=0&d=1
So on page 22 (26) you can get the step by step commands to configure that switch, can Dell make it any harder to find this stuff?
I got a hold of a couple of other manuals for this switch that are pretty useful.
I did everything but enable the Cut-Through Switch Forwarding.
And here is the manual if you need to configure LAG’s http://sites.google.com/site/mellerbeck/Home/Dell-PowerConnect-How_to_configure_LAG_LACP-1.0.pdf?attredirects=0&d=1
Here is a list of some manuals as well http://support.dell.com/support/edocs/network/pc62xx/en/index.htm
Double check your firmware on the Switch 220.127.116.11 seems to work. Power connect 6224 Firmware is hiding here http://ftp.dell.com/firmware/
So from the Dell Equallogic Configuration Guide I gleaned that stacking seems very recommended. Also, some main requirements are enabling Flow Control. For Flow control on the 6224 it is a global setting. I have set it and not seen it go out to all ports. So I disabled an enabled it again. And it went out to most ports (most would read active a few inactive) after disabling and enabling it again it seems to have finally stuck. A next important point is no STP functionality on switch ports that connect end nodes (end nodes are your ESX boxes or the SAN) which in my case is pretty much it! So I disabled STP on everything. (Why disable STP? It can cause delays that can cause your fail-overs to not work!) If you must use STP they recommend Rapid STP which to turn that on is in the config guide. Then finally, enable Jumbo Frames.
This is what my switch looks like, if you know it should be different please let me know! Since its hard to get straight answers with these switches for some reason.
Ok, so your switches are configured.
Next, turn on one of your SAN’s (If you happen to have two, this way you can name it otherwise you have to locate the serial num which is impossible) and run the setup wizard. RAID 50 seems to be the defacto for most people. Add it to the group. Upgrade the firmware on your SAN (can download from the equallogic support site). Do it at the beginning and get it out of the way now
Read the release notes for your firmware, here it is for the 4.2.1 http://sites.google.com/site/mellerbeck/Home/110-6024-EN-R1_RNotes_V4.pdf?attredirects=0&d=1
The supported configuration Limits is something to pay attention to! page 4 (8)
What I gleaned from the Release notes are a few Reg edits that you want to make to anything attaching to the SAN.
Increase the value of the TimeOutValue parameter (HKEY_LOCAL_MACHINE/SYSTEM/
CurrentControlSet/Services/Disk/TimeOutValue) to at least 60 seconds (the default is
10 seconds). Make sure to set the type to DWORD and enter the value (60) in decimal.
This will increase the timeout period for all disk I/Os for the disk class driver (in contrast to driver-
specific Registry parameters that affect only your iSCSI initiator). You must reboot the server for
the change to take effect.
If you install the VMware tools it will automatically make this change (I believe, it doesn’t hurt to check)
For any environment (clustered or not) using multi-path I/O, make the following changes:
– Add or modify the UseCustomPathRecoveryInterval key, and set its value to 1.
– Add or modify the PDORemovePeriod key, and set its value to 120.
– Add or modify the PathRecoveryInterval key to 60, or half the value of the
PDORemovePeriod key, if it exists.
These also look they are added by a vmware tools install as well.
This is also an interesting tip for Vista or Server 2008.
Accessing PS Series Groups Using iSCSI Initiators on Microsoft Vista or Windows Server
2008. To support group access from initiators running on these operating systems, you must enable
ICMP echo requests for ICMPv4, or for ICMPv6, if using an IPv6 initiator.
Turn on the other SAN. Name it. Add it to the group. RAID 50,
Alrighty, now I’m really gonna make your eyes bleed. Configuring VMware vSphere Software iSCSI with Dell EqualLogic PS Series Storage.pdf
***** UPDATE they released a newer version of this guide (I have to wonder if I influenced it with this blog post) it lays out more clearly a one to one setup of virtual nics to physical nics for multipathing. You can find it here
(You could also google that up) So following that guide we create a virtual switch with jumbo frames, and created 8 vnics (the maximum), then assigned network adapters, then associate the vmkernel ports to the physical adapters, enable iscsi, then finally bind the VMkernel ports. At the end of the PDF there is a scripting section that will help automate this.
OK, on second thought don’t create 8 virtual nics. The equallogic has a limitation of 512 iSCSI connections. So every time I was creating a volume I was eating up 24 connections. So now I am reducing this to 4 virtual nics to match the four physical ports, probably makes more sense that way anyways. I did learn that each pool can have 512 connections so maybe what I should have done was create separate pools. But once we added all of our space into the default pool this wasn’t an option any more. To learn more about Storage Pools you could read this http://sites.google.com/site/mellerbeck/Home/DeployingPoolsandTieredStorageinaPSSeriesSAN..pdf?attredirects=0&d=1
And the explanation seems to be
“In the case of your FC Storage VMWare knows that the end point is two different WWNs so it knows it has different paths. In the case of iSCSI you are connect to just our Group IP so it doesn’t know if the path is Fully Redundant. This seems to be the same with All iSCSI storage from what we can tell.”
So now lets put the SAN through its paces. First thing I think you should test is all of the redundant paths. So create a volume and attach to it. I put a VM on it and started it up. I started pulling connections from a least dangerous to most dangerous. So first I simulated a dead nic by pulling the ethernet to it. Next I tested pulling power to the slave switch. Next, I tested pulling power to the master switch and seeing how long failover took and whether the VM’s survived. Oh you might want to make sure you install these patches on your vsphere boxes beforehand https://www.equallogic.com/enewsletter/technnote_0072009.html or else your volume might not stay connected. Finally I pulled power to the master controller on the Equallogic. All of these test helped verified that the redundant paths were working as advertised.
Ok, now that that is verified lets start stress testing the infrastructure. But first off install the Dell Equallogic SAN HeadQuarters so we can get some visibility into the pounding we are giving the SAN. You download SAN HeadQuarters from the support site. Unfortunately its an EXE that is named Setup.exe, in fact just about everything you download is named Setup.exe grrr….. bad form, bad form. Ok, so to use SAN HeadQuarters you will need to add a read only SNMP community name. So through Group Configuration of your Equallogic, click the SNMP tab, and then click add to add one. Now when you install SAN headquarters you will use that public SNMP community name.
So now you can see your SAN activity, here is a snapshot of my SAN being really bored.
If that hasn’t scarred you enough you can always read part two of this post.
So a couple of addendum’s. EqualLogic recommends: Linked here http://bit.ly/6Iw3ot
Removing all services other than TCP/IP from the SAN interfaces and unchecking the “Register this connection’s address in DNS” box anyway, to reduce the possibility of any non-iSCSi traffic from traversing the SAN interface. Also, change the Binding Order of the network cards to have the SAN one last. This will cause the server to report the first bound interface (usually the public/LAN interface) to DNS first, making that the preferred interface for the server.
If you are installing SQL or Exchange on an iSCSI volume you want to set service dependencies http://bit.ly/5mDqWn
When choosing between basic and dynamic disk, only server 2008 is supported for iSCSI dynamic disk otherwise choose basic http://bit.ly/8vVjHY
This is a decent guide on server 2003 on iSCSI http://sites.google.com/site/mellerbeck/Home/DeployingMicrosoftWindowsServer2003inaniSCSISAN..pdf?attredirects=0&d=1
One important thing to take from the guide is aligning partitions even for the OS partition. I talk a lot more about that here
I would also recommend doing this to all VM’s (Unless someone tells me otherwise??)
We have found that on Windows file servers, the system automatically updates the “Last Access Time” field of the directory entry for each file touched, which can prove to have a very relevant impact on snapshot and replication utilization.
To disable Last Access Time handling on an NTFS filesystem, add the following key to the Registry on your server:
And set it to 1
It requires a reboot of the host to take effect.