Grigoryeva E.G., Klyachin V.A., Pomelnikov Yu.V., Popov V.V. Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus


Elena G. Grigoryeva

Candidate of Sciences (Physics and Mathematics), Associate Professor, Department of Computer Science and Experimental Mathematics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russian Federation

Vladimir A. Klyachin

Doctor of Sciences (Physics and Mathematics), Associate Professor, Head of Department of Computer Science and Experimental Mathematics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russian Federation

Yuriy V. Pomelnikov

Candidate of Sciences (Physics and Mathematics), Associate Professor, Department of Computer Science and Experimental Mathematics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russian Federation

Vladimir V. Popov

Candidate of Sciences (Physics and Mathematics), Associate Professor, Department of Computer Science and Experimental Mathematics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russian Federation

Abstract. One of the problems of computer corpus linguistics is an automatic determination of keywords inthe text. The solution is a statistical method based on calculation of various frequency characteristics of the text. In this case, the most commonly used model is a “bag of words”, which does not take into account the order of words in the text. In this paper, we propose a graph model of the text that allows us to calculate the frequency characteristics of words in the text not only within the framework of the “word bag” model, but with respect to location of pairs of owls in some common part of the text, for example, in one sentence. To work with such a model, a software model is constructed in the form of a database schema intended for storing various statistical text information. Taking into account such a data model, the article proposes an algorithm for determining the keywords of the text, the implementation of which is performed in the Python programming language.
When analyzing a document d of linguistics corpus D, our algorithm creates a list of about 40 words with the largest measure tf-idf, and choise from them 20 words, which are more often used in the document d. We regard these words as vertices of some graph G, and the multiplicity of the edge, connecting the vertices t and t’ is equal to the number of sentences in document d, containing both these words. Approximately 10 vertices of the graph with the greatest degree are selected. The words corresponding to these vertices are taken for key words of document d.

Key words: graph, text, word, text split, statistic measure tf-idf, key word, base form of word.

Citation. Grigoryeva E.G., Klyachin V.A., Pomelnikov Yu.V., Popov V.V. Algorithm of Key Words Search based on Graph Model of Linguistic Corpus. Vestnik Volgogradskogo gosudarstvennogo universiteta. Seriya 2, Yazykoznanie [Science Journal of Volgograd State University. Linguistics], 2017, vol. 16, no. 2, pp. 58-67. (in Russian). DOI:

