Buyers Beware - All Data De-duplication is NOT created equal
Buyers Beware - All Data De-duplication is NOT created equal
Nine areas of functionality you should test before buying a de-dupe device
Three years ago data de-duplication sounded like magic. Now storage manufactures are racing to claim a portion of the data de-duplication marketplace.
Data De-Duplication Defined
Essentially, data de-duplication is the ability of an appliance or software application running on a server with disk attached, to compare segments of data being written to it with data segments that currently reside on it. If duplicate data is found, a pointer is established to the original set of data as opposed to actually storing the duplicate segments; removing or "de-duplicating" the redundant segments from the volume.
This ability has fundamentally changed how disk-to-disk backup can be used in the overall backup strategy. Without data de-duplication, the disk part of disk-to-disk backup is just a cache. This means that even with today's prices on SATA hard disks, the cost to house that data for a significantly long period of time is prohibitive. Most customers can only store 1-2 week's worth of data on disk before they have to spool to tape and free up disk space. The resulting effect is that customers are still burdened with all the challenges associated with tape for backup and off-site disaster recovery.
Now that data de-duplication has caught on, thanks mostly to market leader Data Domain, other manufacturers are entering the market or are adding the ability to their standard disk solutions. There are components of these appliances that buyers should be aware of prior to purchasing or during an evaluation of a data de-duplication solution. Many of these items would be overlooked in the normal manufacture interview and evaluation testing. Some in fact would only become apparent long after the evaluation is purchased. In many cases you may be tempted to treat these as "check mark" items but you really need to peel back the onion on these features.
Test In-line vs. Post Process and Replication
As explained in an earlier article (click here) a key architecture item to explore is whether the de-duplication process occurs in-line or post process. As explained in that article, at what point the data de-duplication process occurs has a critical impact on the data replication process. One of the key advantages of some data de-duplication appliances is their ability to replicate to a second box at a remote site providing built in and automated electronic vaulting of your backups.
Make sure that you can begin the data replication process as soon as the primary appliance begins receiving data. Some devices require that you wait until the entire backup job is complete prior to starting that process, causing a delay in updating the DR site and possibly putting you outside of your DR goals. Even some in-line de-duplication solutions have to wait for the entire backup saveset to complete before replication can start. Backup savesets can be quite large and this too can cause a delay in bringing the DR site into sync.
If you are not replicating then you are probably going to move the backup from disk to tape as soon and as quickly as possible. Try making this move to tape, if you even can, right after the backup completes. Is performance impacted so much that the tapes cannot be feed at full speed, causing shoe shining greatly impacting performance. See what impact it has on the post process de-dupe as well, the combined jobs may render the appliance unusable forcing you to pick one of the other.
And most importantly, understand the management overhead required to administer post-process based designs. The additional complexity is often overlooked, however it needs to compared to simplicity associated of automatic electronic vaulting with inline de-dupe systems. Backup environments are extremely complex today; why add when further complexity when it’s not needed?
Test how Granular the De-Dupe
Another important capability to examine is the level of granularity. Test how the algorithm that the data de-duplication appliance is using to identify redundant segments identifies duplication within your specific data set. Results can vary significantly from solution to solution. The data de-duplication phrase is tossed around in a very cavalier manner today. To be effective data de-duplication needs to be done at a sub file level using variable length segments. File level de-duplication or de-duping across identical named files would not de-dupe out daily backups of your database or exchange environment where sub-file de-duplication will.
For example, if you are backing up a large database that changes throughout the day. With the typical backup application you have to backup, and more importantly store, the entire database with each back up. An incremental won't help you; you are counting on data de-duplication. If you were to backup up that database to a device without data de-duplication or only with file level de-duplication you will have to store that entire database every day just like your backup application. With a de-duplication device you only have to store the net changes to that database. That creates a huge increase in storage efficiency. With variable length segment level de-duplication you can backup the same database to the device on two successive nights and due to its ability to identify redundant segments, only the segments that have changed will be stored. All the redundant data will have pointers established. Granularity of the data de-duplication is critical.
Also test how the device handles data from applications that can shift the whole file making it appear different, even at a segment level. This happens in some of the standard office productivity applications on re-saves. The new XML format of Microsoft Office 2007 is something to test specifically. De-dupe approaches that use fixed length of fixed block segments have problems detecting these changes.
In addition to segment level granularity, that de-duplication should be data independent. Some solutions require an understanding of the actual data being sent to them. For example, they understand that your Exchange data is in fact Exchange data, claiming extra de-duplication efficiency. The risk with this approach is that you will have to count on that supplier to keep up to date with changes to data formats and layouts. This may effectively constrain you ability to deploy new versions of existing applications or new software applications. Additionally, this limits the usefulness to application specific silos rather providing an general purpose de-dupe storage system.
Test the effects of failure
Let’s face it, things go wrong with hardware. Drives fail, interface cards fail and systems get disconnected from the network. Test these. Force a drive failure; remove a power cord. Turn off a unit during a backup operation. What happens?
Once you are two or three weeks into your testing, drive failure is an important condition to examine. Drive failure is more than just failing the drive and making sure that you can keep writing data to it or recovering data. Most if not all data de-duplication devices will use RAID 5 or RAID 6 and be able to stay online while the failed drive is replaced and the data on that drive rebuilt. What is system performance like during the failed state while the system is performing the rebuild activity? In many cases performance grinds to a halt making the unit unusable for backups or recoveries. This is compounded by the fact that most data de-duplication devices use large capacity SATA drives. Vendor implementations of disk reconstruction vary significantly; in some cases taking over 24 hours. The net effect is that you have a unit that is almost totally unusable for over 24 hours and may have to resort to tape. In addition, if you are looking at a solution that only uses RAID 5 your are also exposed to total data loss if during this long rebuild time you suffer from a read error or second drive failure.
What happens to data if there is a failure in the middle of a write operation? Test this by pulling the plug in the middle of a backup or archive operation. Backup applications vary in how this failure case is handled, but what happens to the written data? Test recovery of some of the data that would have been being written during that write. Does the backup or archive software think that data was written correctly but in fact it is not? Corrupted data is bad, but corrupted data without notification may be terminal.
The other failure condition to test is failure of a WAN connection. During replication to the remote appliance fail the WAN connection. What happens? I have seen cases where a complete resynchronization has to occur.
Test Performance during routine housekeeping
Also remember that data de-duplication appliances are very intelligent disk stores. There are a slew of background operations that happen to keep everything in order, for example cleaning up of orphaned segments, data integrity checking and other house keeping chores. Test performance of the unit when these operations are happening as well as during a replication process. Testing how the appliance performs while under this load is critical. Send backup jobs while a replication is occurring and then while those two processes are active, attempt a large restore. Make sure that all these processes maintain an acceptable performance level.
In the inverse world of backups many of these tasks are scheduled to happen during the day. Test a midday restore and of course a backup.
Test backup and recovery over time
Similar to drive failure testing, once the unit has been in place for two or three weeks and after you have run two or three weeks of repeated backups (this causes fragmentation) test its ability to recover data that was written two or three weeks ago. In our testing recovering older data can suffer as much as a 90% performance hit! Remember that while this is the maximum amount of time that you can expect to evaluate an appliance this problem will get exponentially worse in production as you reach the end of your retention period. Also test recovery of data that you assume to be unique and data that you know to be redundant. Different systems will have different ramifications based on those data sets. But always test recovering of old data. This is because by its very nature data de-duplication spreads data segments out all over the device and it is dependent on the quality of those above mentioned background tasks to keep things in order. This wrinkle is especially painful. Remember you bought a data de-duplication device so you could increase your retention of data on disk; it's ironic if that very retention is unusable as the data ages.
Test Concurrent Operations
One of the advantages of disk is that it can act like multiple “virtual” tape drives. Make sure that you test concurrent throughput. Also pay attention to the number of disk drives required to achieve that performance. Some de-dupe systems provide high performance using disks efficiently however many others require lots of drives. The large drive count may make the numbers better but it adds management complexity and operational costs and with SATA, your risk is compounded for RAID failure and may not be feasible in your environment anyway. Run multiple write operations to the device simultaneously. If your software supports multiple media servers or backup servers, have those all running simultaneous jobs. Make sure performance remains acceptable.
Also test concurrent read operations, especially if you are going to be pushing to tape. You need to be able to sustain top performance with multiple reads from the device in order to get the maximum from the new generation of tape drives. Again it would be ironic to invest in a disk-to-disk backup solution only to find out that you STILL can’t push your tape drive. Also test jobs of different sizes and if possible different applications simultaneously.
Remember you backup to recover and during a big recovery you will want to perform concurrent recoveries from the disk to shorten your recovery window. Again, some data de-duplication appliances will suffer significant performance losses during this process. Many backup applications will try to run multiple recovery streams as a default for performance reasons. If your data de-duplication appliance is not up to the task, your recoveries may actually end up going slower than if you were using tape. The work around would be to manually manage a single stream recovery process. When the pressure is on and you need to recover a server, the last thing you want to have to do is baby-sit your device through the process.
Test Single Streams
Don’t only test concurrent streams, test single stream recovery, for example a database that needs to be recovered quickly. You typically need that one file or database back, not a full system recovery that is going to require concurrent reads. In this scenario it is critical to get that data back fast. Specifically do this single stream recovery from an older data set.
Test Multiple Protocols
If possible test these multiple streams through multiple protocols. If your appliance can support network attach through NFS and CIFS for example, run several backup jobs through a Windows backup server and several through a UNIX or Linux backup server. If your appliance can support fibre attached as well, test all three.
Test the cost of Free
Be wary of a supplier that for example “gives” you the de-duped capacity. Free is not free, its not just the costs of the storage but the run costs of that storage. For example a supplier that tries to match a de-duped 20TB’s with a “real” 20TB of SATA disk. Remember that in the long run you have to pay to power, cool and most importantly manage this capacity. Not only is it these run costs but hard cost may come from this additional storage via software licensing of the SAN Array and from the Backup Software vendor as most of these are capacity based. Also be wary of a vendor that offers de-dupe as a later software add on. This is done typically post process and means you have all the expense and complexity associated with post-process de-duplication as mentioned in the earlier article.
As you can see there is more than meets the eye to selecting a data de-duplication device. There are a lot of vendors scrambling to catch the Data De-Dupe wave. To be successful surfing that wave make sure you work through all the above at a minimum.
Tuesday, October 9, 2007