Optimized Migrations
Optimized Migrations
Increasing the ROI of Nearline Disk Repositories
Because of the tightening economic situation using disk to store archive data and reducing the amount of capacity on primary storage is becoming a more popular project each day. The cost benefit of moving data off of primary storage and onto higher capacity nearline storage is significant. However just like in other areas of data center optimization, customers are looking for opportunities to optimize that secondary tier even further.
The basic premise of a disk archive is to move old data off of primary storage and move it on to a less expensive secondary storage platform. Primary storage is designed for high speed, low latency and highly available access to mission critical information, but it has also become clogged with older seldom-accessed data. Nearline and archive storage addresses this issue by using very high capacity, slower access drives, often with the same levels of high availability and with some form of data retention capabilities built into them.
Depending on the system primary storage can average about $6 to $10 per GB to store data, where disk archives are trying to deliver systems that can store data at $3 to as low as $2 per GB.
The represents a significant cost savings but customers want to increase the delta even further and as a result some disk manufacturers have added data reduction technologies to their systems. The challenge with standard deduplication is that it is not nearly as effective when used for online storage as it is with disk backup, where it was made popular. In order to be successful, deduplication must find a high amount of redundant data, hence its success in backup where 80% or more of every full backup job is identical to the previous full backup job. As a result data reduction ratios of 20X are not uncommon in full backups.
Online data sets don’t benefit from the same level of almost identical file similarity that backup jobs provide. Unless the customer is archiving the same database or VMware images every night, where there would be a significant amount of redundant data, the typical archive system only sees about a 3X benefit from deduplication. When you factor in the performance impact of deduplication and the need for all the disks to be active so deduplication comparisons can be made, questions can be raised around the value of using onboard deduplication. Onboard deduplication in archive requires extra expense in CPU’s and reduces the ability to do power management.
Nearline storage systems and disk-based archives have a second challenge outside of cost-effective capacity; in most cases they are systems not solutions. Most offer no built-in migration capabilities. The question always arises “How do I get data to this thing?”. The customer must look for assistance beyond the archive supplier to complete the offering and make it a solution.
This actually becomes an advantage because the customer is free to consider migration solutions that also can perform data optimization during the move. This is the ideal compliment to disk based archiving because it performs the data movement while also applying more advanced data reduction techniques for maximum cost reduction.
Companies like Ocarina Networks provide software solutions that will scan the primary environment for old data, optimize that data for minimal storage footprint and then store the data on the disk archive. This can all be done transparently to the user without intervention.
The net effect of an optimized migration application is that it can reduce the cost of the archive from $3 to $5 per GB to potentially under $1 per GB.
The Architecture of Optimized Migration
The first component is a component that will do the actual optimization. This is typically an out-of-band appliance or a software module that has been added to NAS solutions like HP and BlueArc that will scan assigned file systems. During user defined non-peak hours the optimizer will scan the assigned file systems for data that has not been accessed in a user defined period of time. Once identified that data will be assessed for optimization.
One of the advantages of a nearline disk archive is that recall time as compared to traditional tape or optical is very fast and often unnoticeable to the user. This allows for that user defined period of time to be more aggressive, maximizing the potential cost savings and increasing the period of time between primary storage solution purchases.
Since the optimized migration is only working on data that is not currently active it has the advantage of performing a more detailed analysis of the file compared to inline or real time deduplication processes that must perform these tasks very quickly and can only make broad comparisons of the data and will likely miss duplicate data in favor of speed. Optimized migration on the other hand does have the time to do a much more detailed analysis of the files to be migrated.
The detailed comparison includes examining inside of complex files like PDF’s or Powerpoint files to find images and other data that may be stored elsewhere already in the archive. In addition the optimized migration solutions can perform optimization on rich media content like images, audio and soon video to look for similar data within those data sets. For a detailed description of how optimized migration meets the challenge of rich media see our related article “Data Reduction for Online Image Storage“.“
The result is a slightly slower deduplication process on inactive data in exchange for an increase in optimization of the nearline or archive disk area. In cases where standard deduplication only achieved a 3X optimization, optimized migration achieved a 8X to 10X optimization, thus further driving down the cost of the disk archive area. This ROI is made even more significant when you consider that most disk archives are replicated for redundancy, so this 10X savings is seen twice and the bandwidth needed to perform the replication is also reduced by that same 10X reduction.
The final component is retrieval from the archive area. This retrieval because of the optimized migration requires data reconstruction and might be a concern. The component that performs this operation in often called the reader. It is an in-band software module that can be placed on the same hardware as the optimizer or built in to the NAS systems mentioned above.
Not only is the reconstruction and access to these files seamless to the users, it also does not have the same performance impact that the optimization process does. This is important because while taking extra time to optimally store inactive data is not of major significance, time to retrieval may be. Keeping the impact of optimization out of the retrieval process is critical for convincing users to aggressively migrate old data.
Five Way ROI of Optimized Migration
In tight economic times the faster a project delivers a return on its investment, the faster it can start saving the organization money. The more aspects that the solution delivers that return the faster the ROI is achieved. With most solutions a ROI derived from two areas of impact is considered attractive, optimized migration delivers an X way ROI.
First it eliminates the need for a separate discovery and migration software application. These applications are often $5,000 to $10,000 per TB. It also eliminates the need for a global file system type of tool; since it manages where the data will go, global files systems, even with out implementation can cost upwards of $10,000 to $15,000. Optimized migration as stated earlier also increases the storage efficiency of the typical archive from 3X to 10X, driving down the per GB cost of the archive storage from an already efficient $3 to $5 per GB to under $1 per GB.
Finally it enables green archives that power down disks. Onboard deduplication typically requires that all the drives on the disk archive remain spinning because the deduplication comparisons are global. With optimized migration it maintains its own meta data and does not need to verify against older non-spinning drives. This allows for maximum power efficiency and maximum archive capacity. Archives with onboard deduplication reduce the cost to store data, not the cost to power that data. Archives with MAID that leverage optimized migration can deliver optimized capacity costs at minimal power utilization.
Thursday, March 19, 2009