These are my notes for investigating EqualLogic Replication and (hopefully) how to optimize it somewhat! First a Sizing Replication Space Doc from equallogic
Please, please, please, share any thoughts or comments!
First, how to set it up, sometimes a video is worth a thousand words.
Then again sometimes words are nice too 🙂 In this Doc http://sites.google.com/site/mellerbeck/Home/equallogic_SRM.pdf?attredirects=0&d=1
This is for setting site recovery manager but you have to setup replication first. This lays it out
Step 1: Configure Replication Partnership between Protected Site’s PS Series Group and Recovery Site’s PS Series Group
1. From the Protected Site Group Manager click on Replication Partners and then, under the
Activities tab, click on Configure partner.
2. Enter in the Group Name of the Recovery Site (case sensitive), the Group IP address and
a description. Click Next.
3. Enter in any contact information in the next screen and click Next.
4. On the next screen, there are two password fields. The first field is the Password for
partner. This is the password that the primary group will give to the recovery group when
establishing a connection for replication. The second field is the Password obtained from
partner. This is the password that the primary group expects to receive from the partner.
Enter in the information and click Next.
5. Configure Delegated space. This is space that is created from local free space on the
Protected Site group to store replicas from the partner. If there are going to be no replicas
sent to the Protected Site this value can be left at 0. Choose the amount of delegated space,
the storage pool which the space will come from, and click Next.
6. Verify all of the information is correct and click Finish.
7. This creates the partnership between the Protected Site and the Recovery Site. Now follow
the same steps to configure the partnership between the Recovery Site and the Protected
Site. When configuring the Recovery Site, take into account the number of volumes that
are being replicated and the total space. For example, four 100GB volumes being
replicated with 200% reserve space plus a little buffer will require 1TB of delegated space
on the Recovery Site.
Step 2: Configure Datastore Volumes for Replication
1. From the Protected Site Group Manager, click on a Datastore volume that you wish to
protect. In the Activities tab, click on Configure replication.
2. In the next screen, choose the Replication partner that this volume will be replicated to.
Site Recovery Manager 1.0 supports only one Recovery Site, so all Datastore volumes
should be replicated to the Recovery Site that was configured in step 1. Select the
percentage of Replica reserve on the partner. This is the amount of space on the
Recovery Site that is reserved for this volume to replicate. The default and recommended
amount is 200%. The next field is the Local replication reserve. This is space reserved
on the local group to track replication changes and to keep fast failback snapshots. The
default is 100%. There is also a check box to Allow temporary use of free pool space. If
this is checked, then it will use free space to track changes if the local replication reserve is
not enough. Choose your options and click Next.
3. In the next screen, there are two ways to create the initial data transfer. Automatic will
push the replica across the WAN to the Recovery Site group. Manual will allow you to
offload the data to physical media and transfer it to the remote site. There is also a check
box to Keep failback snapshot. This is not selected by default, but for Site Recovery
Manager failback scenarios, this should be checked. This will save time in case of a
failover and the PS Series Group at the Protected Site is still available to failback to. See
the section on Failback for more information. Make your selection and click Next.
Once you start messing with Replication you will probably start to wonder what causes things to need to be replicated.
The first thing that can cause this is
We have found that on Windows file servers, the system automatically updates the “Last Access Time” field of the directory entry for each file touched, which can prove to have a very relevant impact on snapshot and replication utilization.
To disable Last Access Time handling on an NTFS filesystem, add the following key to the Registry on your server:
And set it to 1
It requires a reboot of the host to take effect.
The second thing they suggest is turning off mailbox management for Exchange 2003 or daily defragmentation in exchange 2007.
This got me thinking of other things that I might be able to do to limit writes to the volumes. I was theorizing that if you were using VCB to do your vmware backups that the snapshot that it created would maybe significantly increase replication traffic? If this were so then it might make sense to create a snapshot volume and then redirect the snapshots there http://www.vladan.fr/how-to-create-snapshot-in-another-location-then-the-vms-folder/ or http://kb.vmware.com/kb/1002929
With this same logic it might make sense to also change the location of the swap file as laid out here. http://www.vladan.fr/how-to-specify-different-swapfile-location-in-vmware-esx-35/
Another tool that can help with replication is the use of a WAN optimizer. This sentance is very interesting
|Does the PS Array encrypt or compress replication data|
|No, the Dell EqualLogic PS Array series does not compress or encrypt data in any way. It transfers the raw disk blocks across the network for replication. If compression or encryption is needed in an installation, it will need to be provided by VPN encryption or a WAN accelerator device. Examples of WAN accelerator devices are from companies like Riverbed, Certeon, Bluecoat, etc.|
If you use a Riverbed, this is how to configure it
|Riverbed devices causing “partner down” status during replication|
|The inline placement of Riverbed devices can result in failed attempts to start replication of volumes. The symptoms seen are:
1. “partner down” status on some replication attempts while other replications to the same partner are working fine.
The reason for this behavior is a combination of Riverbed and Equallogic optimizations. A brief summary of the problem follows:
The Riverbed device will substitute its MAC address for packets going through it. The Equallogic array has a routing optimization whereby it will create cached routes based on incoming MAC addresses. The result is we have host routes to a remote replication partner that are directed to the Riverbed device.
When we try to establish connections to the remote replication partner after this route is established we expect the Riverbed to act as a proper router. It does not. It will drop TCP SYN packets that have the Riverbed as the destination MAC.
The Equallogic firmware will flush cache routes after 20 minutes at which time you may see some replications begin again but the bad cached route is reentered and the problem resumes for subsequent replications.
The solution to the problem is to install “fixed target rules” on the Riverbed device. The following procedure was supplied by Riverbed Tech Support and installed on the Riverbed devices:
“To workaround the problem, setup fixed target rules on the Riverbed Steel Head (SH) for traffic between Equallogic group ip address to point to the far end group.
To setup the fixed target rule, you can go to GUI -> Setup -> Optimization Service -> In-path rules -> Add New Rules session:
* Type: Fixed-Target
Click on “Add Rule” button
You will need to define similar rule on the remote SH with reverse source/destination subnet for the connection made from remote end to this location.
This is a really interesting post on one persons experience with riverbed and equallogic http://www.stevenjbradbury.com/component/content/article/43-front-page-article/74-riverbed-model-shortfalls
What is good to note is that he is getting 95% data reduction (which I think I am going to need since I seem to have really high Data amounts being sent out)
After 2 weeks of data from our Steelhead 2050 install, I’m super impressed with what the steelheads are capable of…but a warning, make sure you size the device properly for the amount of data you’re going to be moving. The steelheads we just installed are reducing data over the WAN link by 95%, but now we’re running into a disk I/O bottleneck. The steelhead disks cannot keep up with the SAN on the LAN side.
- SATA disk performance is bad, especially with only 4 SATA spindles. Using default SDR and adaptive SDR mode we see very high disk pressure, data reduction over the WAN is awesome, 95% reduction for equallogic SAN replication. But the stealhead disks cannot keep up with the SAN, over the course of a week replication times on the equallogic degraded and disk pressure on the steelheads went up. Turning on SDR-M cut replication times in half, but obviously hurt data reduction since we’re only using 6GB of RAM instead of 400GB of disk.for $40,000 units, why is SATA used?
- Right now the model sizing is done by number of TCP connections, WAN throughput, and storage capacity. I don’t need more TCP connections, or WAN throughput or storage capacity, I need faster disk. What’s the point of spending another $40,000 to jump up a model that only gives me two more SATA disks and a bunch of stuff I don’t need?
- Why not build units just dedicated for site to site data replication?
- Fast disk/the ability to use your own DAS
- No RSP BS (seriously, who the hell is going to run a VM on one of these?)
- More memory
- Less TCP connections
This post and graph show it really well! http://www.stevenjbradbury.com/component/content/article/43-front-page-article/71-riverbed-install
This is another post that lays out EqualLogic does replication. http://opengarble.blogspot.com/2009/04/equallogic-replication.html
Right now I am troubleshooting my VM’s one by to try and figure out which is causing the most replication. Moving them on to a replicated volume and seeing if they are the cause. I will also be testing changing the last file write registry setting to see if it helps any.
Another post that lays out similar findings to this one http://www.interworksinc.com/blogs/bfair/2009/09/01/unnecessarily-large-equallogic-snapshots
This is an interesting post on using lessfs http://www.lessfs.com/wordpress/?page_id=172 for more efficient backup.