Journal of Networks, Vol 6, No 12 (2011), 1705-1712, Dec 2011
doi:10.4304/jnw.6.12.1705-1712

A Method of Object-based De-duplication

Fang Yan, YuAn Tan

Abstract


Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files.

In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.



Keywords


data de-duplication, object-based, backup, object index, metadata

References


Dell product group, Object Storage — A Fresh Approach to Long-Term File Storage, A Dell Technical White Paper.

Tony A, Biggar H. Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. The Enterprise Strategy Group Technical Report. 2007. Biggar H. Experiencing in Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements. The Enterprise Strategy Group Technical Report. 2007.

William J. Bolosky, Scott Corbin, David Goebel*, and John R. Douceur, Single Instance Storage in Windows 2000, In Proceedings of the 4th conference on USENIX Windows Systems Symposium, Volume 4 USENIX Association Berkeley, CA, USA, 2000. An in-depth look at data deduplication methods, The Enterprise Strategy Group Technical Report, www.falconstor.com.

A.Muthitacharoen, B.Chen, and D.Mazieres. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposiumon Operating Systems Principles (SOSP’01), pages174–187, Ban, Canada, October 2001.

Goutham Rao, San Jose, Eric Brueggemann, Carter George, Object deduplication and application aware snapshots, patent application publication, US, 2010.

Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies. 2008.

George Forman, Kave Eshghi, Stephane Chiocchetti, Finding Similar Files in Large Document Repositories. In the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’05), Chicago, USA, August 2005.

Bayer.R and Me. Creight, "Organization and Maintenance of Large ordered Indices", Acta Informatica, Volume I, Springer, Berlin/Heidelberg, New York, 1972, pp. 173-189.

S. Walter, T.Thiago, M.Carla and Jr. Wagner Meira, "A Scalable Parallel Deduplication Algorithm", 19th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, Brazil, 2007, pp. 79-86.

W.You et aI., "PRUN: Eliminating Information Redundancy for Large Scale Data Backup System", International Conference on Computational Sciences and Its Applications (ICCSA 2008), IEEE Computer Society, Italy, 2008

V. Henson and R. Henderson. Guidelines for Using Compare-by-Hash. Forthcoming, 2005. http://infohost.nmt.edu/~val/review/hash2.html

Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USERNIX Conference on File and Storage Technologies. 2009

Quinlan S, Dorward S. Venti: a new approach to archival storage. In Proceedings of the Conference on File and Storage Technologies. 2002, 89–101

  


Full Text: PDF


Journal of Networks (JNW, ISSN 1796-2056)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.