Задачі та алгоритми опрацювання потокових даних – Вісник Хмельницького національного університету

ЗАДАЧІ ТА АЛГОРИТМИ ОПРАЦЮВАННЯ ПОТОКОВИХ ДАНИХ

ALGORITHMS AND CHALLANGES IN STREAMING DATA PROCESSING

Сторінки: 42-48 . Номер: №5,т.2 2023 (327)
Автори:
ЗАВАЛІЙ Т.І.
Національний університет “Львівська політехніка”
ORCID ID: 0009-0002-7544-782X
e-mail: taras.i.zavalii@lpnu.ua
ZAVALIY Taras
Lviv Polytechnic National University
DOI: https://www.doi.org/10.31891/2307-5732-2023-327-5-42-48

Анотація мовою оригіналу

Розглянуто основні поняття галузі аналізу даних в контексті роботи з потоками, а не масивами даних. Базові принципи і алгоритми в обох випадках ті самі, але потокові дані накладають суттєві обмеження по пам’яті і часу, вимагають застосування додаткових методів накопичення, фільтрування і попереднього опрацювання. Переважно, ці методи орієнтовані на роботу з сирими даними. У статті наведено порівняльний аналіз основних типів алгоритмів, розглянуто актуальні задачі аналізу потоків даних. Подана коротка характеристика брокера повідомлень Kafka та фреймворку Spark Streaming.
Ключові слова: потокові дані, опрацювання потоків даних, online data analysis, черги повідомлень.

Розширена анотація англійською мовою

The basic methods and tools of data analysis in the context of data streams, rather than batches, are considered. The fundamental principles and algorithms are the same in both cases, but streaming data imposes significant constraints on memory and time, requiring additional methods for accumulation, filtering, and preprocessing. Mostly, these methods are applied to raw data, and raw data is everywhere now. We have constant streams of data in many areas, such as sports analytics, medical analytics, patient monitoring, real-time stock market analysis, website visitors’ data analysis, infrastructure monitoring, predictive maintenance, not to mention various scientific research projects that gather vast amounts of data.
This paper provides a comparative analysis of the main types of algorithms and discusses current applied problems in stream processing and online data analysis. Specifically, algorithms such as Stream DBScan, DGIM, HyperLogLog, Bloom filter, and Count-Min Sketch are described and compared in the context of their application and computational complexity. A brief description of the Kafka message broker and the Spark Streaming framework is presented, though the number of tools and frameworks available now is constantly expanding. They support concepts such as windowing, event time processing, and state management, machine learning libraries, and enable advanced analytics on streaming data. They also address issues of scalability and provide the throughput for handling large volumes of data.
From a technical standpoint, two factors are equally important for streaming data analysis: the choice of the technological stack and the choice of the algorithm. It is stated that the most important task is obtaining raw streaming data, selecting the optimal analysis algorithm, and considering the specifics of the data. Another challenge to tackle in future research is combining different stream processing algorithms in the multi-stage distributed architecture to achieve a higher quality of the resulting model.
Key words: streaming data, stream processing, online data analysis, message queues.

Post Author: Горященко Сергій