Аналіз ефективності методів розбиття потоку даних для систем дедублікації даних – Вісник Хмельницького національного університету

АНАЛІЗ ЕФЕКТИВНОСТІ МЕТОДІВ РОЗБИТТЯ ПОТОКУ ДАНИХ ДЛЯ СИСТЕМ ДЕДУБЛІКАЦІЇ ДАНИХ

ANALYSIS OF THE EFFICIENCY OF DATA CHUNKING METHODS FOR DATA DEDUBLICATION SYSTEMS

Сторінки: 24-27. Номер: №6, 2022 (315)
Автори:
БАРНА Андрій
Національний університет «Львівська політехніка»
ORCID ID: 0000-0002-6692-7496
e-mail: andrii.o.barna@lpnu.ua
КАМІНСЬКИЙ Роман
Національний університет «Львівська політехніка»
ORCID ID: 0000-0002-0563-5748
e-mail: kaminsky.roman@gmail.com
BARNA Andrii, KAMINSKY Roman
Lviv Polytechnic National University
DOI: https://www.doi.org/10.31891/2307-5732-2022-315-6-24-27

Анотація мовою оригіналу

В роботі наведено результати порівняння ефективності методів розбиття потоку даних: класичного TTTD та нового CB-TTTD, які використовуються в системах дедублікації даних.
Ключові слова: дедублікація, фрагментування, хешування.

Розширена анотація англійською мовою

There is a significant increase in the amount of data that needs to be stored worldwide. More and more companies are turning their attention to deduplication systems, which effectively increase data warehouse volume and reduce storage costs. Deduplication not only reduces the overall amount of information in storage but also reduces the load on networks by eliminating the need to retransmit duplicate data. In this work, we considered the stages that any deduplication system includes, namely chunking, hashing and indexing, mapping. The effectiveness of deduplication systems primarily depends on the choice of the method of dividing the data stream at the chunking stage. We considered the classic Two Threshold Two Divisor (TTTD) method, which is widely used in modern deduplication systems. This method uses Rabin’s fingerprint to find the hash of the substring value. The formula for calculating the hash for the first substring and the formula for calculating the rest of the substring are given. Another method we investigated is Content Based Two Threshold Two Divisor (CB-TTTD) – it uses new hash functions to fragment the data stream, and the corresponding formulas for calculating the first and each subsequent substring are given. To test the effectiveness of these two methods, we developed a test deduplication system, implemented these two fragmentation methods, and tested their performance on two sets of text data. We have modified these methods with the addition of a new string-splitting condition based on the content specification of the data we tested. The results of a comparison of the work of classical and modified methods are given. Using metrics to compare the efficiency of data fragmentation methods, we obtained experimental data, based on which we can make conclusions about the feasibility of using CB-TTTD as an alternative to TTTD in new deduplication systems. The obtained data can be used in the development of new highly efficient data deduplication systems and to improve old solutions
Keywords: deduplication, chunking, hashing.

References

Stevenson D., Wagoner N. J. Bargaining in the shadow of big data. Law Rev vol. 66. 2014. № 5. P. 66.
John Reinsel, Gantz Reinsel, David Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future. 2012. Р. 1-16.
Turner, V., Gantz, J. F., Reinsel, D., & Minton, S. The digital universe of opportunities: Rich data and the increasing value of the internet of things. IDC Analyze the Future. 2014. Р. 5.
Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, Yukun Zhou. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE vol. 104. 2016. № 9. Р. 1681–1710.
Fahad A., Abdulsalam H. Evaluation of Two Thresholds Two Divisor chunking algorithm using Rabin fingerprint, Adler, and SHA-1 hashing algorithms. The Iraqi Journal of Science. 2017. № 4. Р. 58.
Demystifying Data Reduplication: Choosing the Best Solution. 2009. URL: http://www.pexpo.co.uk/contentldownload/20646/353747/file/DemystifyingDataDedupe.
Mark W. Storer, Kevin Greenan, Darrell D. E. Long, Ethan L. Miller. Storer Secure Data Deduplication. StorageSS’08. 2008. № 5. P. 1-10.
Fahad A., Abdulsalam New Techniques to Enhance Data Deduplication using Content based-TTTD Chunking Algorithm. (IJACSA) International Journal of Advanced Computer Science and Applications. vol. 9. 2018. № 5. P. 116.

Post Author: Горященко Сергій