Any storage system will have a lot of redundant data (same files, etc) stored by the same person, or different people. Data De-duplication removes all that redundant data from the storage system by retaining just one copy and having pointers to this original copy, everywhere else. This saves a lot of storage space, especially in a multi-user storage/ backup scenario.
What is Data De-duplication?
Simply put, Data De-duplication is the process of removing duplicate data from storage systems. Consider this scenario: You are the technical manager of a IT services company. There is a huge installation file, which you send to 10 of your subordinates. They promptly download a copy to their computers. Now there are 11 copies of the exact same data both on the computers and the email server.
When a back up is taken for the day / continuously (using Continuous Data Protection or similar technologies) all the 11 copies of the same document are stored/ backed-up individually (within their respective directories), causing inefficient usage of storage resources. Instead, if the storage system can recognize that all these files are exactly similar, it can just maintain one full copy and have pointers to this copy at all other places.
So, Data De-duplication is a technology that can actively look for patterns of matching data from all the incoming data and actively replace redundant data with pointers to an original source that is saved just once. This saves a lot of disk space and also the bandwidth required for Off-site back-up / Disaster Recovery scenarios.
Data De-duplication can be done at file level (or) block level. It is often used along with compression technologies to save even more disk space. Mostly, this is a transparent technology used by the storage system without requiring any user intervention. This can be done as the data is arriving (in-line) or it can also be done as a post-processing activity.
Types of Data De-duplication:
There are two major types of data de-duplication – Block level & File level.
1. Block level Data De-duplication: All the data entering the storage system irrespective of the file size, are divided into small equi-sized blocks (few kb each). These blocks are subject to a hashing algorithm and the result of the same is stored in a separate index file, for each block. All the incoming data are divided into blocks and are subject to the same hashing algorithm to be compared with the previously stored hashed results (of existing blocks) in the index file to determine if there are any exact matches.
If the results match, the incoming block is removed and a pointer is stored in its place referencing the hash value of the similar block in the index file. If the results do not match with any existing values, the block is stored in full and it is subject to the hashing algorithm whose result is stored as a new entry in the index file.
This technique saves more disk space than file level De-duplication because data is divided in to small blocks and where ever there is a repetition of these blocks, they are replaced by pointers. The probability of finding similar blocks is higher with smaller reference values (blocks).
Also, if there is a change in the original file, only those blocks that are affected by the change are re-written, instead of re-writing the whole file. Some vendors use variable sized blocks, instead of fixed blocks for obtaining better storage efficiency with larger files.
But, the higher efficiency of block level De-duplication comes at a cost – More processing resources, More RAM (primary memory) utilization & More Storage space (Secondary memory, for storing index files).
2. File Level Data De-duplication: The process of file level Data de-duplication is similar to block level Data de-duplication, but instead of dividing data into small blocks and comparing these small blocks against incoming data, whole files are compared with incoming files. So, if the same file requires to be stored once again on a storage system, it will store a pointer to the location of the initial file copy instead of storing whole files again and again at different places.
The amount of storage space saved due to this technique may be lesser than block level process, but it is still significant. It also comes with faster operation, lower overheads (processing capacity, RAM capacity, etc) and lower cost.
To achieve even more disk space savings, Data Compression is often used along with Data De-duplication so that data can be compressed using various compression algorithms before they are actually saved to the disk. The compression efficiency depends on the type of compression algorithm used and often results in a 2:1 disk space reduction.
Of course while retrieving the stored data, reverse processes of Data De-duplication & Data compression needs to be applied, which results in higher processing overhead. This is also true for transferring data between multi vendor storage/ back-up devices that do not recognize the De-duplication techniques used by the other vendor.
You could stay up to date on the various computer networking / enterprise IT technologies by subscribing to this blog with your email address in the sidebar box that says, ‘Get email updates when new articles are published’