OK, it's not quite three years, but it's only a few months short, and suddenly I've received several emails and a couple of comments wanting followups to my
Equallogic entry from back in early 2008, so, here we go. For the most part I probably haven't posted any more about them because I don't really think things have changed that much. The array just sits there, does it's thing and we live with it, which is good, but I'll try to outline some of our experiences with the product since we purchased our first one in 2006 and 3 more in 2007.
You might want to see this followup, especially regarding reliability. The Good
- Reliability -- Although I'm afraid to post this, so far, we have experienced exactly ZERO service issues with our 4 storage arrays. No drive failures, no controller failures, no downtime, no problems at all. The arrays are always there, always working, just as you would expect them to be. Compared to our EMC arrays, which experienced controller failures, power supply/UPS failures, and multiple drive failures, not to mention a software failure that caused downtime and data loss, these things have been rock solid. We're pretty happy with no failures.
- Easy Administration -- To say that administering an Equallogic array is easy is almost an understatement, this thing takes virtually no time to administer. We've upgraded firmware a couple of times, no major problems. Equallogic took several of our UI suggestions, such as making it a little more deliberate to actually delete a volume (a volume now has to be taken offline first, and deleting the volume is very clear). This adds a little overhead to the delete operation, but I think it's a good trade, it was far too easy to delete a live volume before, now you must be deliberate to actually destroy data, just just a few clicks, but clicks that make you think.
- Improved Features -- Equallogic has released firmware versions which added additional features to the array, like a staged firmware upgrade so that the array is only offline for about 15 seconds during firmware upgrades, firmware upgrades via the GUI, improved replication, and RAID 6 and thin provisioning. RAID 6 was a big one for us because that allowed us to feel comfortable with using the "no-spares" option and thus gave us back 8 drives worth of space and IOPS (2 drives per array). This was previously a huge down side to Equallogic, our previous storage allows global hot spares, but with Equallogic, each shelf had to have it's own spares which was a lot of overhead.
The not-so-Good
- Performance -- The array can provide reasonable speeds for sequential operations, but still seems to lack the horsepower required to perform really well with high IOP workloads. This is noticeable primarily with read latency which can be quite poor on an array with a consistently moderate I/O load. For example, our PS400E hosts approximately 20 virtual machines. These VM's have fairly low IOP loads individually, but combined on average they create about 160 IOPS per second during idle periods, and double or triple that during the workday. Because they are VM's performing different tasks it's a very random I/O pattern. When the IOPS load hits above 250, read latency increases significantly. At 150 IOPS read latency is <6ms, but at 200-300 IOPS it can be as high as 12ms or more. This can have very significant impact on loads that perform lots of small read I/O, like copying lots of small files or sequential database reads, but even just basic tasks like opening event logs or listing directories are noticeably slower. This seems to be far worse than even our old EMC array. I wonder if it's due to a relatively small cache by today's standard perhaps causing more I/O to be pushed to the drives than other arrays. The array also seems to be unable to cache writes significantly as even a moderate write load has a very negative impact on read latency as well. This is most noticeable on our PS400E because it has a lot of space and a large number of VM's sharing it. I suspect this wouldn't be nearly as bad if we had two arrays that were half the size but with twice the spindles. Once again, larger caches would be nice. The PS3900XV (15K SAS drives) are much better, but still provide higher latency (3.5 ms) than the old EMC fiber channel drives on anything other than very light load.
We ended up having to add memory to many of our VM's and a couple of hosts to allow more OS caching more as an attempt to offset the poorer read latency and get performance on par with our previous storage solution (EMC CX700). I'm not really sure if this is a bad thing, but I think it says something about the EQ array possibly not being as efficient at caching as the CX, which actually had much slower drives.
- Snapshot/Replication -- We pretty much had to abandon our use of storage based snapshots and replication, which we previously made heavy use of on both our Netapp and EMC boxes. The storage and network overhead of Equallogic's solution is simply too much to bear from a cost perspective. Had we purchased enough storage to use the Equallogic snapshot solutions the in the same way we had used previous products, we could have just purchased the other products, and their software licenses, and still saved money. To give you just one example, we have always snapped our VMFS volumes that store our virtual machines. We have a 1TB volume which houses just a few (3) fairly heavy use VM's. With the EMC array this volume would use about 50GB of snap reserve each day, which means if we wanted to store a weeks worth of snapshots we would only need 1.35TB of space. That's a 35% overhead which seemed like a reasonable trade for the ability to have a weeks worth of copies of these critical VM's that could be restored quickly in the event of a problem. With the Equallogic array, this same snapshot strategy used over 500GB EACH DAY!! That means to keep 7 days of snapshots for our 1TB volume we would need to allocate 4.5TB of space, a 350% overhead.
We once again ended up working around this issue with the OS. Most of our systems are Linux so we simply started using LVM snapshots instead of storage array snapshots. This works fine for the Linux systems and is far more space efficient, but requires a reasonable amount of administration overhead and scripting. For the Windows systems we still take nightly snaps of the volumes, but beyond that we rely on backups and VSS.
Replication really wasn't any different, just too much bandwidth when we attempted to replicate our Oracle database (which we had previously done for years with EMC MirrorView/A). When we finally put in a WAN acceleration solution last year we tried the replication again, and it worked much better, but we decided to just license and use Oracle DataGuard, and rsync the applications. We also make light use of a VMware based replication product that uses VMware snapshots to replicate data. These options are all far more bandwidth friendly that the Equallogic replication. I'm sure Equallogic replication would be fine across a LAN, and fast WAN link with limited changes, but it's still far too much bandwidth for little reward.
- Scripting -- I still find the CLI for the Equallogic horribly bad, especially for automation. We wrote scripts which do the handful of things we needed, but this is a real weak point. I had hoped to make our scripts a more functional toolkit, but since we ended up making far less use of the Equallogic snapshots and replication that we did with our previous arrays due the their significant overhead as mentioned above, it ended up being not worth the effort.
- Technology -- I'll say just one thing, 10GB ethernet. I still don't think Equallogic offers a 10GB ethernet solution for their storage arrays. I know they'll tell me that this isn't important because a single array can only deliver about 200MB/sec of bandwidth anyway, but that's only assuming one array. With a volume spread across multiple arrays it's quite easy to get 400MB/sec or more, but only with load balancing from the host. With a single 10GB connection on a host it's exceptionally difficult to configure the system to get the full throughput available when the storage array only has 1GB connections. This is especially true in VMware which has no real load balancing to speak of. If I'm purchasing a VMware host with a 10GB connection do I only want 120MB/sec throughput to my VM's? Of course not. Equallogic, wake up!! You competition offers 8GB (Fiber Channel) and 10GB (Other vendors iSCSI) interconnections and is getting faster, you need to as well. At least they do have a really cool SSD solution.
I could probably find more stuff, but that's it for now. The basic truth is, based on the functionality we currently use, I could have purchased a much cheaper solution that would have been just as effective, although probably more complex to setup and administer. I'm not sure if we'll stick with Equallogic. The Dell purchase has certainly caused them to loose some of their luster in my eyes, and they certainly don't seem to be innovating much right now. When we go shopping next time (probably still a year or two away, we've focused on storage management to contain growth this year) I'll certainly survey the landscape for a move and shaker in the storage space. Any suggestions?
Thursday, June 4. 2009 at 15:25 (Reply)
The snapshots are atrociously large with the 16MB chuck allocations and we are looking at mitigating this through the host as well.
Please continue to post when you have changes and find a working solution. I had hopes for the company... until Dell purchased them. Don't get me wrong, I'm an Austin Texas-Ex and all, but Dell sure has screwed up enough companies to date, they could have left this one alone.
Sterling.
Thursday, June 4. 2009 at 16:06 (Reply)
Since we no rely on so many hosts based solutions for snapshots and replication we're strongly considering just going be to cheap, boring storage. Boxes with plenty of raw performance but limited features can be had for a song nowadays.
That being said, one company that does appear to be innovating at first glance is Pillar Data. Their application aware storage, the ability to assign priority to application, goes so far as to place higher priority data closer to the edges of the physical disks, where access is faster. They also include huge caches, something which I've never understood why others don't do, memory is relatively cheap. I'm still researching them, but at first glance they seem impressive.
Only weakness is no current support for RAID 6, which would make me a little nervous if I went with drives larger than 750GB because of BER but a recent conversation with their CTO says RAID 6 is coming later this year.
Anyone else out there using or know anything good or bad about Pillar Data?
Monday, January 11. 2010 at 10:26 (Reply)
not got the kit yet so cant comment on performance, but an added pro to the list is EQL loyalty to the customer. we orderd kit in December, expecting delivery for early Jan. for some reason the can meet the delivery date (i didnt ask) they have cancled the order and re orderd a ps6000xv at no additional cost.
I wonder what the real perfomance difference / gain we have got over the unfortunat delay. Another bonus the new array is only going to be three days after the original delivery date.
Monday, June 8. 2009 at 12:06 (Reply)
thanks for sharing !
I was wondering if EqualLogic showed any signs of wanting to resolve the snapshot space issue ?
In your previous post, it seemed that EQL support implied it was "by design" and not meant to evolve.
BTW, your captcha is quite hard to read !
Monday, June 8. 2009 at 14:46 (Link) (Reply)
I suspect it may also have something to do with fragmentation. Based on how I believe their technology works, if you take a snapshot, and then data is written to a "page" they allocate a completely new page and keep the old one as the snapshot. This leads to the volumes kind of fragmenting all over the array, but 16MB fragmentation isn't so bad. However, with small "pages" the fragmentation overhead could quickly become significant, and volumes with lots of random I/O could have a lot of small fragments spread all over the disk.
I actually think even their current concept causes fragmentation. I haven't had time to prove it, but it appears that, as you make snapshots, since new data is always written to new space, a volume has a tendency to become very spread out on the spindles, increasing access latency. Perhaps their technology "defragments" these pages, but it certainly "feels" like my volumes get slower as they age, and my theory is that they get fragmented.
I theorize this because I don't seem to get the same amount of degradation from volumes where I don't make heavy use of snapshots. Now, the degradation is not huge, just a few milliseconds of access time, but that's very noticeable on small block random access.
It's too bad though, Equallogic still has so many positives, easy administration, solid hardware, enterprise features, and they really do "get" iSCSI in that their array is designed to make the most of it. But in three years other vendors have closed the gap, and Equallogic products haven't advanced very much except for bigger capacities and of course SSD, but everybody's doing that. You can't differentiate yourself with those features.
Monday, June 8. 2009 at 17:55 (Link) (Reply)
Thanks for the feedback and for your business over the last few years. Most importantly, your account team will be speaking with you shortly to make sure you are getting the technical support necessary.
As for some of your other comments, rest assured that the EqualLogic engineering team at Dell is innovating. We do try to listen, and work to provide high quality products. As with any product line, there is always room for improvement. I can’t comment on all of the coming improvements that may address your concerns in this public forum. What I can do is confirm that EqualLogic arrays with 10GbE interfaces are planned for later this year. You’d want to talk to your sales team for further roadmap information.
Thanks again for your business. I hope you will continue to use the EqualLogic PS Series as your storage platform.
Sincere regards,
Eric Schott
EqualLogic Product Management
Dell
Monday, June 8. 2009 at 18:42 (Link) (Reply)
I'm sure 10Gb ethernet is coming, but it wouldn't be innovative, just an improvement. Innovative is what Equallogic did to start with, building and array that expands by simply adding units. Allowing data to move between array without interruptions with just a few clicks. I want to see some more innovation in the actual product, you know, things like this:
1. QoS -- Allow volumes to be prioritized over other volumes. As more and more systems share the same spindles this becomes more important and several vendors are adding this.
2. Smart Placement -- Another feature which goes improves response times by placing frequently used data where it can be accessed the fastest.
3. Thin Provision Space Reclamation -- at least one vendor is offering this. Right now thin provisioning is only marginally useful because, if a volume is ever filled, it stays allocated even if the data is removed. The idea here is to use sdelete, or another tool to write zeros to deleted data. The storage array then frees these zero'd blocks back to the thin provisioned pool.
4. More Caching -- RAM is fairly inexpensive but can be a huge boost to performance, some vendors have figured this out and are including large caches, as much a 16GB or more per controller. This can significantly lower IOP loads. I think even the new 6000 series Equallogic arrays are only using 4GB, my CX array had that back in 2004.
Since some are already doing all of these things I'm not even sure we can consider those "innovative", but it wouldn't hurt.
Tuesday, June 9. 2009 at 01:48 (Reply)
Go with NetApp.
John
Tuesday, June 9. 2009 at 07:56 (Link) (Reply)
I'm not saying I would never go back, but it put such a sour taste in my mouth that I've barely spoken to them since. Interestingly, my Equallogic sales rep at the time of my purchase is now the NetApp rep for my area.
Tuesday, June 9. 2009 at 03:49 (Reply)
EqualLogic right now is playing catchup with what happens in the virtual & OS world (support for HyperV, vSphere, Win2008, Win7, and so on).
While this is necessary, core innovation seems frozen. Maybe a lack of manpower ? Shouldn't be, with Dell behind them...
Out of the 4 suggestions above :
1- somewhat less needed with virtualisation, as there are fewer volumes. Besides, VMware can do it at the VM level (disk shares). Would be nice of course.
2- that one would be really needed AND competitive, especially with the SSD PS arrays. eMC is the only one to do it with FAST and VMax (I might have missed smaller players though), besides EqualLogic might already have a good foundation to add it. Aren't they already doing transparent load-balancing ?
3- a must. Can't be much hard to integrate this feature in the HIT.
4- more cache is always nice, but is it really an apple-to-apple comparison ? You can't have more than 16 spindles behind an EQL module, unlike an CX one. I also clearly don't see these 2 products (CX4 and PS array) playing in the same field, CX4 being a higher tier.
Tuesday, June 9. 2009 at 08:19 (Reply)
Here's my responses to your comments:
1- somewhat less needed with virtualisation, as there are fewer volumes. Besides, VMware can do it at the VM level (disk shares). Would be nice of course.
Interestingly, I'm not convinced that virtualization doesn't make this more important. First, QoS isn't just priority, it's also about data placement. For example, Pillar Data's solution places high priority data closer to the outer edges of the disk where it's faster to access. VMware can't do that.
Also, the VMware "disk shares" is very weak in my opinion. For one the the "disk shares" is relevant to only to other VM's on a specific host or resource pool. A VM with a disk share of 10,000 on a host with 10 VM's, with the other 0 VM's having a priority of 2000, would still have less priority that a single VM on a host by itself. Simply moving VM's around can change the effective priority of the I/O for a given VM.
4- more cache is always nice, but is it really an apple-to-apple comparison ? You can't have more than 16 spindles behind an EQL module, unlike an CX one. I also clearly don't see these 2 products (CX4 and PS array) playing in the same field, CX4 being a higher tier.
But we're talking about innovation here. I wasn't comparing the Equallogic solution to a CX4, I was comparing it to my old CX700, a head unit purchased almost 5 years ago. Actually, the fact that you can only have 16 spindles is exactly why they NEED more cache. Other vendors can put more spindles which helps spread out the IOPS across a larger set of disks.
The Equallogic solution has only 16 spindles (at most) per cache even though their array can hold 8TB or more. Having 8TB of disk with 4GB worth of cache is pitiful. Yes you can spread this out across multiple arrays, but that means you have to buy many smaller arrays, which has a lot of additional costs with the Equallogic way since you have to pay for the controllers for each shelf. For other vendors it's fairly inexpensive to add disk shelfs to get more IOPS from the same cache controller. As you have more data, or lots of VM's shraring an array, this is a serious problem.
Tuesday, June 16. 2009 at 09:38 (Reply)
I have NetApp and Equallogic.
As NetApp prices are laughable at times, we decided to load our 2000 users onto Exchange and use snapshots instead of nightly tape based backup as we were struggling with the backup window. (8pm to 8am)
Our Exchange snapshots are almost 100% of the Exchange data.
So for a 2TB Exchange system, we have 5 daily snapshots totalling 10TB. And then at the weekend when users aren't around we take a tape backup to purge the logs and not snapshot.
We are still looking into why the snapshots are this big, but "copy on write" snapshots might mean that as users open a mailbox the whole area is tagged as written?
We still aren't sure whether Symantec Enterprise Vault is also possibly writing into the stores.
It's frustrating. I'm not sure that Dell know what to do with Equallogic. It's all about the software, and I'm not sure Dell have enough developers to compete with the army of coders at NetApp.
Like you, we still like Equallogic kit despite it's short comings but are slowly realising that despite what the Dell chap just posted, proper innovation and "fixes" aren't coming. It's a travesty, as it could turn a good product into a great product.
For me, the large snapshots are the #1 area that needs attention.
I'm going to do a similar test on NetApp, and I will take snapshots and post back my findings if anyone is interested.
Hugh
Tuesday, June 16. 2009 at 10:11 (Link) (Reply)
The biggest cause of snapshot space usage I've seen is the nightly Exchange online defrag run. This obviously moves a lot of blocks and can caused 10's of gigabytes of snapshot usage in a matter of 30 minutes or so. The only workaround I've come up with is to configure Exchange to only run the online defrag on the weekend, not nightly. That helps to keep the snapshot space at least partially under control, but it's not a silver bullet or anything.
The honest truth is, Equallogic snapshots are virtually useless for any moderately changing database backend, we see this same issue for Exchange, Oracle, and MSSQL databases.
I suspect the Netapp will be far better as it seems to use a much smaller allocation size for snapshot writes. I don't know the actual size, but as I noted above, we saw snapshot growth orders of magnitude less than the Equallogic on both Netapp and EMC Clariion CX arrays.
I agree, this is a huge issue, but I'm not sure how they can address it. Based on my understanding of their storage technology it's pretty much built around this "page" allocation method. That's how they move volumes between arrays and spread data out across multiple arrays efficiently and with minimal resource overhead. Going to a smaller page size would likely cause a performance hit or lower the limits on the number of snapshots because of resource constraints since they'd have to track many more changes per snapshot.
I hope they have a solution to this issue, but I'm becoming doubtful, it hasn't changed at all in the last 3 years because it's simply part of their design. It's really a killer because snapshots are such a basic component of all storage systems nowadays and their ridiculous storage requirements pretty much destroy their cost advantages over Netapp and other competitors.
We probably have at least another year before we'll really need to do anything for our Tier 1 storage so we'll wait and hope, but I'm not holding my breath.
Friday, July 17. 2009 at 14:31 (Reply)
From my understanding, ZFS has variable block sizing so snapshots and replications would be small. And it's a regular x86 server so you can add a 10g interface with no problems.
My only concern is ease of use.
Sunday, June 28. 2009 at 14:51 (Reply)
since real unbiased information is so scarce when it comes to comparing SAN solutions I am really really thankful for your site and comments.
We are a small IT company that is looking for a centralised storage solution for our 35-45 virtual (Hyper-V) and physical servers right now and so far it has been very hard to get real info as to the pros and cons of different storage solutions. since we cannot afford buying every product available just to get to know where they might not be so perfect we need to find as much unbiased info we can get.
So far we have looked at solutions from HP (MSA2000i G2), Lefthand(P4000), Supermicro (SX4), Datacore San Melody, Dell Equallogic (P4000XV) and Netapp (FAS2020). Wer are also thinking about building the SAN ourselves using either Openfiler or open-e (the latter is also driving the Supermicro SX4). Since we are on a tight budget but do have high performance requirements for SQL, Exchange etc. we do have a bit of a problem with most vendors.
The Dell Equallogic P4000XV is in our budget as is the Supermicro SX4. Netapp went to great lenghts to keep up but seems to be a very costly solution in the long run (once you are dependent on them they won´t go that far I guess). The Equallogic snapshot problem you mentioned could be a problem for us because the P4000XV can only team up with one more unit. After that one has to go with the much much more expensive P6000 series and that one is way way over our budget.
As for measuring performance, do you know of any tool that will quickly and reliably test a SAN for future workloads regarding latency, IOPs and Read/Write performance ? Currently we don´t have that much experience with centralised storage and we are wondering if a single SAN can keep up with 40-50 servers or if we should go with multiple cheap san solutions (Openfiler etc.) for just a few servers each.
Thanks and keep up the great work here (and in your job !)
Alex
Sunday, June 28. 2009 at 20:55 (Link) (Reply)
We loved Netapp but the cost to keep them was simply too expensive. We've been reasonably happy with Equallogic, and I still think there's a chance they could keep us, but the snapshot issue is pretty huge as we really miss using snapshots like we used to.
If you don't plan to keep many snapshots of the same volume at a time the EQL solution is probably OK, however, if you're planning to take more than a few daily snapshots I think you'll find them to be space hogs.
We actually just built a "SAN" with Openfiler and an Enhance Technology RS16i. We purchase the shelf and 16 1.5TB drives all for just over $7,000. The box is capable of great throuput (480-600MB/sec) even with just one shelf, but it's random I/O isn't very great. Great place for disk based backups and bulk storage though.
There are plenty of tools for testing the SAN, I'm a Linux command line type of guy so command line tools are my favorite. For quick testing I use 'iozone' with a single record size (usually 8k) and usually do a run with both IOP and then throughput mode with varying parallel workers. IOzone is also available for Windows.
If I really want to do some good testing I use 'fio', a command line benchmark that can be configured to perform almost any type of I/O you want and can record the I/O patter for 100% repeatable testing. It's written Jens Axboe, the block layer maintainer for the Linux kernel.
If you're using Windows I guess IOmeter is pretty good (and is available for Linux too). It can simulate quite a diverse set of loads but for some reason I don't like it.
Monday, June 29. 2009 at 05:28 (Reply)
thanks for the information. Do you think that random I/O on the RS16 is weak because of Openfiler or because of the use of SATA instead of 15krpm discs ?
Any other shortcomings of Openfiler compared to a "real SAN" you would like to share ? For instance problems with snapshots or integration with vmware etc.
Thanks and have a nice day !
Wednesday, July 1. 2009 at 21:17 (Reply)
The controllers in high end array's are generally very well tuned by the builders and those builders spend a significant amount of R&D on desining IO elevators that recognize disk access patterns and tune the array to perform as well as possible even with multiple writers. They attempt to minimize head movement, and are well tuned for their exact hardware.
The Openfiler/RS16 setup is heavily dependent on the Linux IO schedules which are quite good, but far more generic, and can easily be fooled when there are multiple random readers.
I base this on test that show that the Equallogic and RS16 array provide similar random IOP performance with one or two random readers, but as you ramp the number of simultaneous random reader up, the Equallogic array maintains it's performance while the Openfiler/RS16 starts degrading quickly.
I'm sure that write semantics also work against the Openfiler/RS16 combo as well. The Equallogic and other dual controller arrays generally utilize reasonably large non-volatile write caches which means they can group writes together but still tell the hosts the data is already on disk. The Openfiler/RS16 combo has to commit data to disk much more aggressively since a power loss or malfunction would risk loosing too much data. This has a negative impact on the IOP performance.
We get excellent sequential read and write performance from the Openfiler/RS16 combo, but the random IOP performance is nowhere near as good as the Equallogic. We can compensate for this somewhat by using a very large cache on the Openfiler server which can provide huge amounts of read cache. We still think the Openfiler solution is quite good, but we don't think it will meet our high uptime, high IOP requirements.
Monday, October 19. 2009 at 15:16 (Reply)
We are a 50+ VMhost (over 350 guest) VMWare site that boots all VMhosts as well as 25 Citrix Xen servers from our CX3-80FC (which also houses our Oracle and SQL databases)100Tb+ total.
Mgt is listening to the sales guys from Dell Vars and getting all glossy eyed (pushing the ps6000xv).
Wondering what your total storage size is and if you have any experiences with Oracle or booting from your eql arrays?
Monday, October 19. 2009 at 21:50 (Link) (Reply)
We currently have a total of about 20TB of usable space on four EQL arrays and we have a reasonable about of Oracle ERP data (production and development about 1.5TB total) and something like 500GB of MSSQL data.
I have no doubt that an EQL solution could be configured to support your environment and provide sufficient performance, but if you make any use of snapshot beware of the extra space overhead compared to EMC, and it's questionable that they could "outperform them tenfold".
Thursday, November 5. 2009 at 16:46 (Reply)
We are in the process of making a decision between NetApp and EQL, and in trying to make that decision it's nearly impossible to find people that professionally address pros and cons of these systems. If they do have something to say, they don't back it up with any facts or any details, which is what makes your various posts on EQL refreshing.
Again, thank you!
Friday, November 6. 2009 at 16:57 (Reply)
He said that with replication, it IS block level (actually a min of 256kb can be sent) and that it's not bandwidth intensive. Care to share a bit more detail on your replication experiences?
And on snapshots, do you feel like Exchange is the only culprit or have you had problems with non-exchange data too?
I hope I'm not taking up too much of your time with this request, but we are trying to make the best business decision we can, and trying to wade through all the BS is tough. You seem like a knowledgeable and independent resource, hence my request.
If you like, you're welcome to respond to me via email if you don't want to post your details online.
Saturday, November 7. 2009 at 00:01 (Link) (Reply)
Let me start with replication. Yes, it's true that the EQL replication is not as bad as their snapshots, but I still wouldn't consider it great. We were replicating an Oracle database that was about 150GB. We had been replicating this for years with no WAN acceleration product or anything over a 3GB WAN link using a FCoIP gateway between our EMC storage systems and using MirrorView/A on our Clariion. The Oracle database used 8K blocks, and the MirrorView/A product appeared to replicate the same block size that the LUN was allocated in (in our case 4K). So, changing an 8K block would replicate an 8K block, with a little overhead, I'd guess an 8K change would take about 10K worth of bandwidth.
OK, now how did that work with Equallogic? Obviously it was a database, so blocks were being changed all over the place and every 8K change would replicate 256K worth of data so I'd say EQL only used 25x more bandwidth. OK, in the real world it wasn't quite that bad, sometimes a few Oracle blocks were changed in the same 256K EQL block, but I'd still estimate it took 8-10x the bandwidth of the EMC solution. It was so much, it was completely unworkable as a solution for us.
Just to drive it home, with the EMC solution we were replicating the database every 5 minutes, although sometimes, during very heavy periods, the replication would get behind and lag 20-30 minutes. With the EQL setup? Well, the 8AM replication cycle would usually get done sometime around lunch time, and we'd be lucky to get another cycle by 6PM. That's right, we went from multiple replications an hour, to two during the business day. I suppose if your application is always writing new, large chunks of data, it wouldn't make much difference, but my experience is that this isn't the norm, but the exception.
As far as snapshot space, I would consider it a problem with anything. It's certainly more of a problem with applications like Exchange and SQL databases, but I'd still consider it an issue with simple file servers. It was a big deal with our VMFS volumes for our VMware infrastructure as well. For example, we have a file server that is used by our Sales and Marketing department. This server has 1.2TB of shared data, but typically sees only about 5-10GB of changes per day. How big were daily EQL snapshots on this server? Typically 200GB or MORE!!! With the EMC it was maybe 10-20GB.
I guess my point is, the EQL snapshots are ALWAYS bigger than the competition. Sure, on a 100GB server with a low change rate, maybe it would only be 1-2GB a day, but a competition might only be 100-200MB a day. Which one is going to be more space efficient? Which solution will let you keep more snapshots longer no matter the change rate? It will always be any vendor other than EQL.
To be kind, they're "defragment" argument is a bunch of bunk in my opinion. Sure, this helps a little with the daily change rate, but this is only short term. To keep this benefit you have to keep defragging, and guess what, the defragging pass will obviously make your snapshot huge. I would say defragging probably helped reduce the size maybe 10-20% for a few days, but that's still huge compared to everyone else.
To be totally honest, I'm disappointed to hear that EQL is not being upfront with their shortcomings in this area. EQL used to know what they were good at (ease of use, best of class iSCSI support) and what their weaknesses were (snapshots that are usable and fast, but not as flexible and efficient as others). One of their biggest pluses was that they were honest about not being everything to everbody or best-of-breed across the board. Based on your comment it sounds like Dell might be putting an end to that honest side that was so attractive, instead turning into the same marketing machine that is so frustrating in the storage world. I supposed I can understand it, NetApp has taken to calling snapshots "backups" (I actually attended a NetApp presentation on Oracle where they kept asking "how nice would it be to backup your database every 5 minutes -- except they were referring to snapshots, which, IMHO, are not backups) and I'm pretty sure EMC has always been a marketing company that happened to sell storage products. Maybe with all that pressure in a cut-throat market, they don't know what else to do.
I'd say, remember EQL for what they are, a "all-in-one" with the best iSCSI implementation, easiest setup and administration, excellent block performance, and great integration and monitoring tools to go with it. If snapshots and replication are critical components, well, you'll have to weigh that, but we've learned to live with other solutions there and have probably still saved money over the Netapp product. I haven't shopped Netapp in years, so maybe they're more cost competitive now, but when we purchased, the EQL solution was about half the price of Netapp with replication. If the price spread was closer, it would be tougher to decide.
Monday, November 9. 2009 at 09:44 (Reply)
Monday, November 9. 2009 at 11:39 (Link) (Reply)
I'm not aware of it having any such feature. They have what they call their "intelligent" volume placement, which I was pretty hard on in my original review. Basically, you can set a volume so that it "prefers" to be on a RAID-50, RAID-10, RAID-5, or RAID-6. Assuming you have these different RAID types and enough space in the pool it will move the volume to the arrays containing this RAID type. There's is also an "automatic" type that's supposed to use heuristics to automatically place volumes on the RAID type that is ideal based on their load, but it only works if you use DIFFERENT RAID TYPES (like some arrays that are RAID-50, and others that are RAID-10). Simply having different speed arrays doesn't work. For example, I have a 7.2K SATA array, and a 15K SAS array, but they're both RAID-50 so I'm out of luck with this feature, I have to place my volumes manually. If there's anything "intelligent" about any of that, I'm hard pressed to find it.
As far as I know, there is NO WAY to prioritize any volume on the same array/set of arrays over any other volume on the same/set of arrays. If I'm mistaken on that, I'd love to know about it. I suppose this could be a feature that I'm unaware of with the 6000 series (I only own 5000's).
What some vendors are doing with QoS quite impressive. Things like actually taking the placement of data on the physical disk into account (latency vs throughput changes based on where data is placed on the platters), as well as the amount of cache allocated, etc. My opinion is EQL has nothing here, except marketing claims about "intelligent" data placement which is actually just a lame heuristic that puts busy volumes on a faster RAID group.
Now it sounds like I'm bashing EQL too much, which is strange because we still really like our arrays, but I'm also trying to be honest. Make me live with a NetApp for a while and I'll probably find bad stuff to say about them as well, although, if it's like my previous experience, it will be more about the company and it's tactics than the actual hardware.
Thursday, November 19. 2009 at 10:31 (Reply)
I am very much interested in your scripting tools. I need to list snapshots for a given volume and set them on and offline, but I am battling with the odd output formatting of the telnet interface.
Do you see any possibility to publish your script collection?
Thanks!
Friday, November 20. 2009 at 16:10 (Link) (Reply)
#
# Parse the output of the "snapshot show" command and return the list of
# snapshots.
#
# Takes three arguments:
# A pointer to a hash that will hold an array of snapshot names.
# The volume name to be used as the hash key.
# A pointer to a list that contains the output of the "show snapshots" command
# as returned from the DoCommandOnArray() function.
#
# Returns:
# Nothing useful
#
# Here is typical output from a "snapshot show" command:
#
# Name Permission Status Schedule Connections
# --------------------------- ---------- ----------- -------- -----------
# test-1-2004-03-18-11:34:33. read-write online 0
# 1
# test-1-2004-03-18-11:42:45. read-write offline 0
# 10
# test-1-2004-03-18-13:25:43. read-write offline 0
# 13
# test-1-2004-03-18-13:30:28. read-write offline 0
# 16
#
# This subroutine pulls the list apart and creates a hash with one entry per
# volume. The hash key is the volume name and the value is a reference to an
# array that contains the other items for that volume (size, snapshots, status,
# permission and connections). It also takes care of the fact that the data in
# some columns ("Name" and "Status" in particular) can wrap across multiple
# lines.
#
sub ParseSnapShowCmd(\$@) {
my ($OutHashElement_ref, $InList_ref) = @_;
my $FSkip = 1;
my ($Line, $Name, $Perms, $Status, $Sched, $Conns);
foreach $Line (@$InList_ref) {
#
# Trim terminators and toss empty lines.
#
chomp($Line);
next if $Line =~ /^$/;
#
# Ignore the column headings.
#
if ($FSkip) {
$FSkip = 0 if ($Line =~ /^--------/);
next;
}
#
# Parse the line.
#
my @Cols = unpack("A27xA10xA11xA8xA11", $Line);
if ($Cols[0] =~ /^ /) {
#
# If the first field begins with two spaces, this is an
# overflow of one or more columns from the previous row.
#
$Cols[0] =~ s/^\s+//g; $Name .= $Cols[0];
$Cols[1] =~ s/^\s+//g; $Perms .= $Cols[1];
$Cols[2] =~ s/^\s+//g; $Status .= $Cols[2];
$Cols[3] =~ s/^\s+//g; $Sched .= $Cols[3];
$Cols[4] =~ s/^\s+//g; $Conns .= $Cols[4];
}
else {
#
# This is a new row.
#
push (@{$$OutHashElement_ref}, $Name ) if ($Name);
($Name, $Perms, $Status, $Sched, $Conns) = @Cols;
}
}
#
# Write out last entry, if pending.
#
push (@{$$OutHashElement_ref}, $Name) if ($Name);
return;
}
Thursday, November 26. 2009 at 11:14 (Reply)
thanks very much.
FYI: I found an alternative written in Ruby (http://ruby-eql.rubyforge.org/) which works over ssh and has many functions already included.
Thursday, November 26. 2009 at 11:46 (Link) (Reply)
Monday, February 22. 2010 at 10:45 (Reply)
I can get the output of snapshot show easily enough but am not sure how to then use it with the subfunction and then get something meaningful out.
Hope you can help, no worries if you can't.
Wednesday, November 25. 2009 at 13:14 (Reply)
Go with Dell/EMC Cx4 or EQL ps6000XV.
The only application is SQL server 2008 on a cluster. This is effectively DAS.
Access pattern will be random with plenty table scans. I am hoping that the SQL server will cache significant amount of the database which is only 5Gb.
My inclination is to go for the EQL box due to the additional software and simplicity. Any counter arguments why I should stay with EMC?
Thanks
Dan
Wednesday, November 25. 2009 at 14:45 (Link) (Reply)
I really don't have an answer or argument either way for such a small application on a storage system I'd probably save my money.
Saturday, December 12. 2009 at 20:52 (Reply)
The info you've provided on snapshots is great to know and will be helpful in determining if that is the route to take for us. I found an interesting post however that suggested turning off the "Last accessed date" flag for files in Windows systems and that this dramatically changed snapshot sizes:
http://www.interworksinc.com/blogs/bfair/2009/09/01/unnecessarily-large-equallogic-snapshots
Care to comment?
Saturday, December 12. 2009 at 21:49 (Link) (Reply)
The article on snapshot space is interesting, but I'm not really buying it. I'm sure that disabling anything that writes small updates would indeed help with EQL snapshots because the problem with EQL snapshots is well understood, if even a single bit is changed a 16MB block, then that entire block is used immediately. I know of no other snapshot technology that works like this. I tried all these great "tricks" that EQL had me do to improve their snapshot space, but at the end of the day, EQL snapshot were still always orders of magnitudes bigger than the competitors, or the space used by OS based snapshots. With our EMC array I had volumes where 20% snapshot space could easily hold snapshot for 30 days or more, with EQL I'd need 100% snapshot reserve to do the same thing, even after all the tweaks. If you're not using the space for anything else, not a big deal, but assuming you like you expensive storage to be used efficiently, well, EQL just doesn't, and no amount of tweaking changes that.
Wednesday, January 20. 2010 at 14:21 (Reply)
For the Simple File Server example, did you try disabling the Last Accessed stamp that Windows places on the files? Equallogic has a KB article indicating that this can cause large snapshots for file servers.
No excuse for Exchange/VMware, though.
Wednesday, January 20. 2010 at 22:30 (Reply)
Tuesday, February 2. 2010 at 17:57 (Reply)
We are comparing a NS-480 and EQL. We have 140 VM's and want to replicate using SRM and need to get the EQL performance up. I have a PS6000XV with SAS and I am only able to push a guest OS R/W 70 Mbps. Have you touched the 10G models? How do they compare. I was optimistic about the EQL line but replication, snaps and overall performance has me leaning EMC's way
Tuesday, February 2. 2010 at 22:13 (Link) (Reply)
Make sure you test a lot of concurrent access. You have 140VM's so there will be a good amount of contention for the disk, it's important that the array not choke when 5-10 host are really working the array. This is a place where the EQL arrays do shine compared with my EMC experience. Admittedly that was an older EMC now, and with enough spindles you can spread the load, but in our experience the EQL generally does perform as well with only a single or dual load, but slowly overtakes the other system as the number of concurrent loads ramps up.
Tuesday, February 16. 2010 at 08:00 (Reply)
Tuesday, February 16. 2010 at 10:11 (Link) (Reply)
1. If you're happy with the performance, it doesn't really matter how much link bandwidth you're using. We've seen our arrays saturate 1Gb links many times during RMAN backups, etc., but that doesn't mean yours should, only that it could.
2. Think "scale-out". You have one small array, most people "clamouring" for 10Gb would have more than that. We have three arrays, and a volume spread across all three can easily deliver 450-600GB/sec. In this scenario EQL uses the links to transfer data betweens arrays prior sending them to the host. The 1Gb links can quickly become a bottleneck. Now imagine this case with 8 arrays.
3. Think "simplicity". Most of our servers have at least 4 iSCSI interfaces, some more. Even for servers that currently have 10Gb connectivity I have to play "tricks" to get multiple iSCSI connections to the EQL array to acheive full bandwidth. If the EQL array had 10Gb interfaces any single iSCSI connection could provide full connectivity. 10Gb will reduce the cabling in our racks by at least 4x, always one of the most annoying parts of a rack install.
4. Think "IOPS". There's far more to storage than just bandwidth. While one of my previous arguments for iSCSI was that network latency was only a minimal part of the latency equations, solid-state media like flash, and huge array caches are changing this game tremendously. Most 10Gb rack switches are supporting "cut-through" switching that allows them to switch even jumbo frames in ~3 microseconds.
Basically, 10Gb ethernet is coming to most racks in the datacenter and unless EQL wants to be left as the small market player they need 10Gb options (which they now thankfully have). Certainly, if 1Gb is good for you, and fits you budget, and you don't have scale-out or performance issues, it's still a fine option, but for many of us (perhaps most of us) 10Gb is the way forward.
PS -- Can you believe that was just the summary?
Tuesday, February 16. 2010 at 14:02 (Reply)
I do find this peaking at 120Mbps odd
though, because the EMC SATA-based AX150 peaked at around the same throughput. The EQL's IOPS numbers are a definite improvement though. I monitor those per host on Cacti and with operations that soak up as much as they can get (Exchange 2007 nightly Store online defrag) the numbers shot up.
What bothers me is that I don't really see a blazing throughput while moving VMDKs around. They should really work the array - large sequential I/O - but they seem no faster than the EMC, despite being populated with 16 SAS drives @ 15k RPM versus crummy 7.2K RPM SATA. 120Mbps is only 15MB/s which explains the 30 or so MB/s when I move a VM. Should I be concerned?
Tuesday, February 16. 2010 at 14:43 (Link) (Reply)
For copying VMDK files around, I think the throughput you're seeing is about all you should expect. It's what I see as well. Copying VMDK is sequential I/O, but it appears that VMware limits the queue depth significantly and preforms very little (maybe even no) read-ahead. I suspect this may be intentional to keep move operations from overwhelming the underlying disk and impacting performance of you're running VM's, but I really don't know. The only thing that seems to improve VMDK copies is decreasing latency, but that only helps a little. When we had our EMC CX700 with 4Gb fiber channel and a bucket load of fiber channel drives we were able to get about 50MB/sec copying VMDK files around, but I suspect that's mostly because of the improved latency of fiber channel and the manual tweaks to read-ahead that we preformed on the VMDK volumes.
To really test performance, put some load going in a VM, or a couple of VM's. I usually run a nice, quick, parallel iozone test. Something like "iozone -s 4g -r 8192 -t 4". This will start four thread, each writing/reading a 4GB file in 8K chunks. If you can't push into the 200MB/sec range doing that, well, then something's wrong. Otherwise, you're probably good.
Friday, February 19. 2010 at 18:19 (Reply)
Would you have any insight on whether EQL are affected by LUN Queue Depth?
For example when you run a virtual cluster of 6 nodes running 20 VMs, you will have a very random workload that can be very bursty, so even if your average IOPS is well within the capability of the underlying spindles, you could experience slowdown if the queues get filled. My understaning with Netapps is they have a deep port based and/or global queue so they don't get bogged down unless the sustained workload is too much. Does EQL operate the same way with the way it has just one global IP and balances incoming IO among the 4 interfaces?
Saturday, February 20. 2010 at 10:56 (Link) (Reply)
However, it's important to note that the Netapp architecture is completely different from that of the EQL. As far as I know, Netapp still follows the traditional "head-unit" model. You purchase a "head-unit" and then attach it to disk shelves. You then carve up these disk into RAID groups, and create LUN's on the RAID groups. The "queue depth" problem would happen when some server was writing huge amounts of data and keeping the disks of one of the RAID groups busy. If this host continued to write until it filled the "queues" on the head unit, this could impact performance on other RAID groups even though their disk idle.
The EQL is completely different. Each and every array in the group has it's own controller, and each array has exactly one "RAID group" for all drives in that array. LUN's can be "locked" to a specific array, or spread across all arrays in a given resource pool. As you add arrays, you get more ports, more cache, and thus more queues. This is the cornerstone of their "scale-out" architecture.
Yes, the entire group does indeed have a single IP address, however, this is for simplicity, and is not a limiting factor. If you have 3 EQL arrays in a group, and the LUN is spread across all three, the incoming connection will be redirected to the "least-loaded" of the available ports (based on the arrays this might be 9 or 12 ports). The EQL monitors activity on the ports, and if the load pattern changes, it will transparently move the connections to other ports using ARP to equally balance the connections across the available ports in the group.
If a LUN is locked to a specific array in the group, incoming connections for that LUN will be sent to one of the ports on that array. If you move the LUN to another array, the connection will be transparently redirected to a port on the new array.
Monday, February 22. 2010 at 10:42 (Reply)
Lets try again.
I understand how the scale out architecture helps alleviate this issue with more ports, cache, etc working in parallel, but the EQL has only one RAID group per array, so would it matter if I created one volume and placed all of the VMs on it, or more smaller volumes with less VMs, but total workload is the same? The traditional model would say you should break it up into smaller 'chunks' because if you funnel the entire bursty workload onto a single large LUN, you will flood the queue and slow everything down even though the disks aren't doing much, but does this apply to EQL?
Does my question make more sense now?
Wednesday, March 31. 2010 at 13:31 (Reply)
I'm curious... what functionality were you looking to implement in your scripts? What were you hoping to accomplish?
Wednesday, March 31. 2010 at 17:05 (Reply)
I expected that something called a "scripting toolkit" would provide functions for calling like "createSnapshot" or "listSnapshots" where you would call the functions with volume names and option and it would return results. The "scripting toolkit" from Equallogic is basically a few functions calls to wrap opening a telnet session.
On top of that, while it's easy enough to create scripts that create snapshots or delete snapshots, it becomes more difficult to do things like list snapshots, parse them, and automatically mount them.
For example, we had tools for our EMC array that could take snapshots of our Oracle servers and present them to another host. An admin could type something like "orasnap create " and the script would create the snap of the volume, make the changes on the EMC array to present the snap to the host, and mount the volume on the host. An equivalent "orasnap delete" command would do the reverse. The admsnap and navicli tools made running such commands, and parsing the resulting data trivial. We also had scripts to do simply things like "listsnaps " and "mountsnap "
Performing the same tasks on a EQL array is more difficult, mainly due to parsing the horrid CLI output. In the end, if the "scripting toolkit" simply provided functions for creating reasonable output, that would be a huge step.
I don't mean to imply that such scripts couldn't be created, they certainly can, we created them, but the complexity level for doing the same functions under EQL was much higher. Some of our scripts for EMC were 25-30 lines, while the equivalent script from Equallogic was 300-400 lines, mainly do to parsing functions.
Tuesday, April 13. 2010 at 19:15 (Link) (Reply)
We will be using VMwareSRM to protect Tier 1 applications so I am not sure how to present this host. Any suggestions are welcome.
Jonny