Klyachin V.A., Khizhnyakova E.V. Attribution of Media Texts Based on a Trained Natural Language Model and Linguistic Assessment of Identification Quality

DOI: https://doi.org/10.15688/jvolsu2.2024.5.3

Vladimir A. Klyachin

Doctor of Sciences (Physics and Mathematics), Professor, Head of the Department of Computer Sciences and Experimental Mathematics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russia

This email address is being protected from spambots. You need JavaScript enabled to view it.

https://orcid.org/0000-0003-1922-7849

Ekaterina V. Khizhnyakova

Senior Lecturer, Department of Computer Sciences and Experimental Mathematics, Junior Researcher, Department of Translation Studies and Linguistics, Volgograd State University

Prosp. Universitetsky, 100, 400062 Volgograd, Russia

This email address is being protected from spambots. You need JavaScript enabled to view it.

https://orcid.org/0000-0002-7914-9988


Abstract. The creation of effective systems for filtering media texts is due to the need to develop artificial intelligence systems, which is a large language model that should be trained using "correct" text samples that do not contain signs of disinformation, infodemic and unreliability. The article presents the results of automatic detection of high-quality media texts, as well as text samples with infodemic features carried out using a trained natural language model based on a manually labeled corpus. Manual marking of the corpus was carried out by experts based on the parameterization of the text content. The goal of our work is to build a model of the language of media messages, assess the quality and identify detection errors caused by the linguistic characteristics of texts. Creating a model of the language of media messages is a condition for increasing the efficiency and quality of artificial intelligence systems. It has been established that the test use of a trained natural language model allows filtering media texts with fairly high accuracy. The support vector machine method proved to be most effective. The share of incorrectly recognized informative texts that meet the criteria of reliability and novelty is low and amounts to 6.2 percent. The percentage of incorrectly recognized uninformative texts is approximately 3.9 percent, which indicates a fairly high efficiency of the developed model. The errors in the detection of informative texts are associated with the use of proper names (anthroponyms, toponyms) and numerals in the headings. Linguistic features of misclassified texts containing signs of fake and misinformation comprise text samples using statements with speech verbs that are often used in informative texts.

Key words: media text, neural network, language model, machine learning method, corpus, automatic detection.

Citation. Klyachin V.A., Khizhnyakova E.V. Attribution of Media Texts Based on a Trained Natural Language Model and Linguistic Assessment of Identification Quality. Vestnik Volgogradskogo gosudarstvennogo universiteta. Seriya 2. Yazykoznanie [Science Journal of Volgograd State University. Linguistics], 2024, vol. 23, no. 5, pp. 31-46. (in Russian). DOI: https://doi.org/10.15688/jvolsu2.2024.5.3

Attribution of Media Texts Based on a Trained Natural Language Model and Linguistic Assessment of Identification Quality by Klyachin V.A., Khizhnyakova E.V. is licensed under CC BY 4.0

Attachments:
Download this file (3_Klyachin_Khizhnyakova.pmd.pdf) 3_Klyachin_Khizhnyakova.pmd.pdf
URL: https://l.jvolsu.com/index.php/en/component/attachments/download/3023
7 Downloads