ІНФОРМАЦІЙНА ЗОСЕРЕДЖЕНІСТЬ ЗМІСТОВНОСТІ В ТЕКСТІ
SEARCHING FOR CONTEXT IN THE TEXT
Сторінки: 80-83. Номер: №4, 2019 (275)
Автори:
О.В. ДЖУРАБАЄВ, О.В. БАРМАК, Е.А. МАНЗЮК, Т.К. СКРИПНИК
Хмельницький національний університет
O.V. DZHURABAIEV, O.V. BARMAK, E.A. MANZIUK, T.K. SKRYPNYK
Khmelnytskyi National University
DOI: https://www.doi.org/10.31891/2307-5732-2019-275-4-80-83
Рецензія/Peer review : 27.06.2019 р.
Надрукована/Printed : 17.07.2019 р.
Анотація мовою оригіналу
В роботі було запропоновано підхід для дослідження текстової інформації з описом його теоретичної частини. Підхід полягає в дослідженні текстової інформації як сигналу. Було реалізовано інформаційну технологію та проведено дослідження з описом результатів і побудовою графіків тексту.
Ключові слова: інформаційна технологія, аналіз тексту.
Розширена анотація англійською мовою
Nowadays the search for keywords is not complicated, because there are effective algorithms for their search. The most popular algorithms are TF-IDF, the Bag of Words. There are main disadvantages of these methods: the removal of stopwords, the lack of including the position of each word in the text. The aim of research is development of information technology for searching context and test the effectiveness of to search for keywords in the text without removing stopwords and taking into attention the position of each word. Also, the aim of research is the development of information technology to find places of content concentration in the text with minimal time and low CPU usage returns the correct result for a certain range of tasks in the case of compliance with the limits of input data. The paper proposes an approach based on the analogy of the physical phenomenon of the signal, for constructing a “meaning recognizer”, which does not require any training base, nor a deep machine analysis of the text, and returns the approximate result. The approach is to normalize the text, build the amplitude and phase vectors, and then plot the dependencies of the calculated parameters and visualize the text. Also described are the results of experiments on the recognition of content in the test data. The results of research have shown that the greatest effectiveness is obtained with a text belonging to a specific category. Information technology for the search of content in text information allows graphically to present text in the form of a three-dimensional model, which makes it possible to identify grouped concentrations. In the final case, this allows us to visually cluster groups of words that are a vector of signs of content concentration. Thus, the textual information is presented in the form of a clustered three-dimensional model based on the content concentration, presented in the form of key words of content. It is revealed the basic characteristics of text information as the basic representation after transformation in the form of numerical dimensional characteristics. This presentation is the basis for further research in the direction of clustering and text classification. The results of the research have confirmed that this method is effective for the case where the text belongs to one category. In case you research several texts of a similar category, you can create a set of words that best characterize these texts (the classifier’s core). You can also conduct visually researches of texts as surfaces.
Keywords: information technology, text processing.
References
- Das, S. Chakraborty. An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation, India, 2013, pp. 1–3.
- Labbé, L.I. Martínez-Merino, A.M. Rodríguez-Chía. Mixed Integer Linear Programming for Feature Selection in Support Vector Machine, Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz, Spain, August 8, 2018, pp. 2–5.
- Heap, M. Bain, W. Wobcke, A. Krzywicki, S. Schmeidl. Word Vector Enrichment of Low Frequency Words in the Bag-of-Words Model for Short Text Multi-class Classification Problems, Sydney NSW, Australia, 2017. pp. 1-2.
- Yermakov A.Ye. Statistichna model dlya rozpiznavannya sensiv u tekstah inozemnoyu movoyu z navchannyam na prikladah z paralelnih tekstiv : pidruchnik / Yermakov A.Ye., Polyakov P.Yu. – Moskva, 2017. – 397 s.
- Erica K. Shimomoto, Lincon S. Souza. Text Classification based on Word Subspace with Term-Frequency, University of Tsukuba, Japan, 2018, pp. 2–4.
- William K. Pratt / Digital Image Processing: Translate from English. – Moscow: Mir, 1982. – pp. 32–35.