Popov V.V., Shtelmakh T.V. Natural Text: Mathematical Methods of Attribution
DOI: https://doi.org/10.15688/jvolsu2.2019.2.13
Vladimir V. Popov
Candidate of Sciences (Physics and Mathematics), Associate Professor, Department of Computer Science and Experimental Mathematics, Volgograd State University
Prosp. Universitetsky, 100, 400062 Volgograd, Russia
This email address is being protected from spambots. You need JavaScript enabled to view it. , This email address is being protected from spambots. You need JavaScript enabled to view it.
https://orcid.org/0000-0003-0419-2874
Tatyana V. Shtelmakh
Senior Lecturer, Department of Computer Science and Experimental Mathematics, Volgograd State University
Prosp. Universitetsky, 100, 400062 Volgograd, Russia
This email address is being protected from spambots. You need JavaScript enabled to view it.
https://orcid.org/0000-0002-5320-7406
Abstract. The article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characteristics of the source text with the characteristics of the text resulting from the permutation of words enables researchers to draw conclusions regarding the quality of the source text. The second algorithm is based on calculating and comparing the rate new words appear in good quality and randomly generated texts. In a good text, this rate is, as a rule, uneven whereas in randomly generated texts, this unevenness is smoothed out, which makes it possible to detect low-quality texts. The methods for solving the problem of substandard texts filtering are statistical and are based on the calculation of various frequency characteristics of the text. As compared to the "bag of words" model, a graph model of the text, in which the vertices are words or word forms, and the edges are pairs of words, as well as models with higher order structures, in which the frequency characteristics of n-grams are used with n > 2, takes into account the mutual disposition of word pairs, as well as triples of words in a common part of the text, for example, in one sentence or one n-gram.
Key words: natural text, pseudo-text, text filtering, Zipf's law, n-grams, the rate of appearance of new words, "bag of words" model of the text, graph model of the text.
Citation. Popov V.V., Shtelmakh T.V. Natural Text: Mathematical Methods of Attribution. Vestnik Volgogradskogo gosudarstvennogo universiteta. Seriya 2. Yazykoznanie [Science Journal of Volgograd State University. Linguistics], 2019, vol. 18, no. 2, pp. 147-158. (in Russian). DOI: https://doi.org/10.15688/jvolsu2.2019.2.13
Natural Text: Mathematical Methods of Attribution by Popov V.V., Shtelmakh T.V. is licensed under a Creative Commons Attribution 4.0 International License.