Data deduplication – Knowledge and References

Explore chapters and articles related to this topic

Hash Functions and Applications

Published in Jonathan Katz, Yehuda Lindell, Introduction to Modern Cryptography, 2020

If H is a collision-resistant hash function, the hash (or digest) of a file serves as a unique identifier for that file. (If any other file is found to have the same digest, this implies a collision in H.) The hash H(x) of a file x can thus serve as a “fingerprint” for x, and one can check whether two files are equal by comparing their digests. This simple idea has many applications. Virus fingerprinting: Virus scanners identify whether incoming files are potential viruses. Often, this is done not by analyzing the incoming file to determine whether it is malicious, but instead simply by checking whether the file is in a database of previously identified viruses. The observation here is that rather than comparing the file to each virus in the database, it suffices to compare the hash of the file to the hashes (i.e., fingerprints) of known viruses. This can lead to improved efficiency, as well as reduced communication if the database is stored remotely.Deduplication: Data deduplication is used to eliminate duplicate copies of data, especially in the context of cloud storage where multiple users rely on a single cloud service to store their data. The key insight is that if multiple users wish to store the same file (e.g., a popular video), then the file only needs to be uploaded and stored once and need not be uploaded and stored separately for each user. Deduplication can be achieved by first having a user upload a hash of the new file they want to store; if a file with this hash is already stored on the server, then the cloud-storage provider can simply add a pointer to the existing file to indicate that this specific user has also stored this file, thus saving both communication and storage. The soundness of this approach follows from collision resistance of the hash function.Peer-to-peer (P2P) file sharing: In P2P file-sharing systems, servers store different files and can advertise the files they hold by broadcasting the hashes of those files. Those hashes serves as unique identifiers for the files, and allow clients to easily find out which servers host a particular file (identified by its hash).

A secure and efficient data deduplication framework for the internet of things via edge computing and blockchain

View Article

Journal Information

Published in Connection Science, 2022

Zeng Wu, Hui Huang, Yuping Zhou, Chenhuang Wu

The Internet of Things (IoT) is an extended network based on the Internet that is an essential part of the new generation of information technology (Afzal et al., 2008). IoT devices collect, process and share relevant data (Karati et al., 2021; Lin et al., 2021), and store it in a cloud service provider (CSP) (Cui et al., 2015; C. Zhang et al., 2018; Z. Zhang et al., 2017). However, in the data collected by IoT devices, there are increasing amounts of duplicate data. A large amount of duplicate data results in a massive waste of resources, bringing enormous cost pressures to CSPs (Stanek & Kencl, 2016). Data deduplication (Bolosky et al., 2000) is a data storage optimisation technology that retains a single physical copy of data in the cloud to avoid storing duplicates, which saves storage costs (Hovhannisyan et al., 2018; Y. Tian et al., 2014, 2014?). Therefore, data deduplication is widely used in IoT scenarios and has been a topic of much research. However, there are still many problems to be solved.

SEEDDUP: A Three-Tier SEcurE Data DedUPlication Architecture-Based Storage and Retrieval for Cross-Domains Over Cloud

View Article

Journal Information

Published in IETE Journal of Research, 2023

B. Rasina Begum, P. Chitra

Cloud storage has attracted many companies today [1]. Data deduplication is a compression technique to remove replicate copies of files, which has gained more interest in cloud storage. Most of cloud storage solutions, such as Dropbox, Mozy, Google Drive, and Wuala, are espoused deduplication technique to save network bandwidth and storage space of cloud [2]. According to IDC (International Data Corporation) report analysis, the size of data in the whole digital world is expected to reach 40 trillion GB in the end of 2020 [3,4]. Multimedia data files, such as images, video, audio or text documents, take around 90% of the data. There is a big problem to handle such data securely and efficiently in the cloud [5]. For data deduplication, files with various formats like mp3, jpg, pdf, obb, and unknown are used [6]. The computation cost for data storage, retrieval and management are rapidly increasing. Keeping Storage cost as less becomes a great challenge in the field of cloud computing [7]. Data deduplication is a popular compression technique that not only reduces space utilization but also decreases redundant data transmission in low bandwidth network environments [8]. Data deduplication can be classified into four types: file level chunking, fixed size chunking, variable size chunking, and content aware chunking [9]. Here, each file is partitioned into a number of small chunks. File level chunking can eliminate more redundant data and it is easy to remove. The fixed size chunking eliminates less redundant data and their matching probability is relatively poor. Variable size chunking can easily isolate duplication and nonduplicate information and also minimize the amount of duplicate information, but the major disadvantage is it consumes more CPU resources. The content-aware chunking algorithm determines the chunk size based on the content of the file. It can easily manage small information update and data chunking. It does not lead to large chunk size variance. It is rightly decided that content-aware chunking method is more suitable for data deduplication [10,11]. Next hash values are computed for each chunk with hashing algorithms, such as MD-5, SHA-1, and SHA-256, which are also referred to as fingerprints. With the help of hash values, chunks are identified uniquely. These hash values are transmitted to a centralized cloud for deduplication [4,12,13].