Вестник Волгоградского государственного университета. Серия 2. Языкознание

1998-9911

2409-1979

10.15688/jvolsu2.2024.5.1

Лексикографические проблемы систем машинного перевода: на пути от буквального до нейронного

Lexicographic Problems of Machine Translation Systems: On the Way from Literal to Neural

Беляева

Лариса Николаевна

Беляева

Лариса Николаевна

Beliaeva

Larisa

lauranbel@gmail.com

0000-0002-8622-4595

Камшилова

Ольга Николаевна

Камшилова

Ольга Николаевна

Kamshilova

Olga

onkamshilova@gmail.com

0000-0002-1488-2206

Herzen State Pedagogical University of Russia (Saint Petersburg, Russian Federation)Российский государственный педагогический университет им. А.И. Герцена (Санкт-Петербург, Российская Федерация)

Saint Petersburg University of Management Technologies and Economics (Saint Petersburg, Russian Federation)Санкт-Петербургский университет технологий управления и экономики (Санкт-Петербург, Российская Федерация)

27122024

2356191305202420082024

CC BY 4.0

В статье рассматриваются актуальные вопросы интерпретации современными системами машинного перевода (МП) лексики, неизвестной этим системам (out-of-vocabulary words), в контексте изменений форм и ведения автоматического словаря. Дан критический очерк типологии систем МП и стратегий их развития. Описаны особенности этих стратегий и влияние на них развивающихся программных средств и технологий. Проанализированы формы ведения словарной поддержки, меняющиеся под воздействием технологических условий. Показано, что при любой системе МП ее лингвистическое обеспечение и структура автоматических словарей становятся принципиально важными для поддержания качества перевода. При всем успехе развития нейронных систем МП (НМП) их автоматически пополняемые словарные базы не фиксируют слова, характеризующиеся терминологической спецификой и низкой частотой в массивах и корпусах текстов, на которых обучается система. На примере анализа результатов двух востребованных НМП – Google Translate и Yandex Translate – доказано, что обработка и унификация перевода слов, не вошедших в словари системы, прежде легко решавшаяся пользователями всех типов систем МП на основе пополнения и ведения автоматического словаря, остается по-прежнему актуальной проблемой и требует особого подхода при редактировании результатов НМП.

The article discusses some current issues of interpreting out-of-vocabulary words by modern machine translation systems (MT systems) in the context of changing forms and ways of maintaining an automatic dictionary. It provides a critical outline of the typology of MT systems and strategies for their development. It describes the impact of fast developing software and technologies on these strategies and analyzes the changes they bring into the forms of dictionary support. The research shows that the linguistic support and the structure of automatic dictionaries, whatever the MT system is, are fundamentally important for ensuring the quality of translation. Despite all the success of neural MT (NMT) systems, their automatically updated vocabulary databases do not record words characterized by terminological specificity and low frequency in the special texts and text corpora on which the system is trained. Analysis of translations performed by two popular NMT systems – Google Translate and Yandex Translate – has proven that they fail to process and unify the translation of words that are not entered in the system dictionaries, a task used to be solved easily by users of all types of MT systems with the help of automatic dictionaries. With statistic-based automatic dictionaries it remains a pressing problem and requires a special approach when editing MP results.

machine translation strategymachine translationtypology of machine translation systemsautomatic dictionaryout-of-vocabulary wordslinguistic support

машинный переводстратегия машинного переводатипология систем машинного переводаавтоматический словарьнеизвестное словолингвистическая поддержка

Беляева Л. Н., 2016. Лингвистические технологии в современном сетевом пространстве: language worker в индустрии локализации. СПб. : Кн. дом. 134 с.

Беляева Л. Н., 2022. Машинный перевод в современной технологии процесса перевода // Известия РГПУ им. А.И. Герцена. № 203. С. 22–30.

Беляева Л. Н., Камшилова О. Н., Шубина Н. Л., 2023. Научная статья в технологическом пространстве машинного перевода: правила и процедуры редактирования : учеб. пособие. СПб. : Кн. дом. 90 с.

Нуриев В. А., 2019. Архитектура системы нейронного машинного перевода // Информатика и ее применения. Т. 13, № 3. С. 90–96. DOI: https://doi.org/10.14357/19922264190313

Раренко М. Б., 2021. Машинный перевод: от перевода «по правилам» к нейронному переводу (Обзор) // Социальные и гуманитарные науки. Отечественная и зарубежная литература. Серия 6, Языкознание : РЖ. № 3. С. 70–79. DOI: https://doi.org/10.31249/ling/2021.03.05

Almansoori A., Al Mansoori S., Alshamsi M., Salloum S. A., Shaalan K., 2020. Development of Machine Translation Models: A Systematic Review // International Journal of Control and Automation. Vol. 13, № 2. P. 1462–1483.

Araabi A., Monz C., Niculae V., 2022. How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation? URL: https://arxiv.org/abs/2208.05225v1

Brottrager J., Stahl A., Arslan A., Brandes U., Weitin T., 2022. Modeling and Predicting Literary Reception // Journal of Computational Literary Studies. Vol. 1, iss. 1. P. 1–27. DOI: 10.26083/tuprints-00023250

Dankers V., Bruni E., Hupkes D., 2022. The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Vol. 1. Long Papers. P. 4154–4175. DOI: https://doi.org/10.48550/arXiv.2108.05885

Devlin J., Chang M.-W., Lee K., Toutanova K., 2019. Pre-Training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1. Long and Short Papers. P. 4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423

Khoong E. C., Rodriguez J. A., 2022. A Research Agenda for Using Machine Translation in Clinical Medicine // Journal of General Internal Medicine. Vol. 37, iss. 5. P. 1275–1277. DOI: 10.1007/ s11606-021-07164- y

Lankford S., Afli H., Way A., 2021. Transformers for Low-Resource Languages: Is Feґidir Linn! // Proceedings of the 18th Biennial Machine Translation Summit Virtual USA, August 16–20. Vol. 1. MT Research Track. P. 48–61. DOI: https://doi.org/10.48550/arXiv.2403.01985

Liu X., Sun T., He J., Wu J., Wu L., Zhang X., Jiang H., Cao Z., Huang X., Qiu X., 2022. Towards Efficient NLP: A Standard Evaluation and a Strong Baseline // Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle : Association for Computational Linguistics. P. 3288–3303.

Peris Б., Casacuberta F., 2019. Online Learning for Effort Reduction in Interactive Neural Machine Translation // Computer Speech & Language. Vol. 58. P. 98–126. DOI: https://doi.org/10.48550/arXiv.1802.03594

Popoviж M., 2017. chrF++: Words Helping Character n-Grams // Proceedings of the Second Conference on Machine Translation. Copenhagen : [s. n.]. P. 612–618.

Sennrich R., Haddow B., Birch A., 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909v5 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.1508.07909

Tars M., Tдttar A., Fiљel M., 2022. Cross-Lingual Transfer From Large Multilingual Translation Models to Unseen Under-Resourced Languages // Baltic Journal of Modern Computing. Vol. 10, iss. 3. P. 435–446. DOI: https://doi.org/10.22364/bjmc.2022.10.3.16

Toral A., 2019. Post-Editese: An Exacerbated Translationese // Proceedings of Machine Translation Summit XVII. Vol. 1. Research Track. Dublin : European Association for Machine Translation. P. 273–281.

Zhu C., Yu H., Cheng Sh., Luo W., 2020. Language-Aware Interlingua for Multi-Lingual Neural Machine Translation // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroutsburg : Association for Computational Linguistics. P. 1650–1655.

Zhuang F., Qi Z, Duan K., Xi D., Zhu Y., Zhu H., Xiong H., He Q., 2021. A Comprehensive Survey on Transfer Learning // Proceedings of the IEEE. Vol. 109, iss. 1. P. 43–76. doi: 10.1109/JPROC. 2020.3004555

Belyaeva L.N., 2016. Lingvisticheskiye tekhnologii v sovremennom setevom prostranstve: language worker v industrii lokalizatsii [Linguistic Technologies in the Modern Network Space: Language Worker in the Localization Industry]. Saint Petersburg, Kn. dom Publ. 134 p.

Belyaeva L.N., 2022. Mashinnyy perevod v sovremennoy tekhnologii protsessa perevoda [Machine Translation in Modern Translation Technology]. Izvestiya RGPU im. A.I. Gercena [Izvestia: Herzen University Journal of Humanities & Sciences)], no. 203, pp. 22-30.

Belyaeva L.N., Kamshilova O.N., Shubina N.L., 2023. Nauchnaya statya v tekhnologicheskom prostranstve mashinnogo perevoda: pravila i procedury redaktirovaniya: ucheb. posobie [Scientific Article in the Technological Space of Machine Translation: Editing Rules and Procedures. Textbook]. Saint Petersburg, Kn. dom Publ. 90 p.

Nuriev V.A., 2019. Arkhitektura sistemy neyronnogo mashinnogo perevoda [Architecture of a Machine Translation System]. Informatika i ee primeneniya [Informatics and Applications], vol. 13, no. 3, pp. 90-96. DOI: https://doi.org/10.14357/19922264190313

Rarenko M.B., 2021. Mashinnyy perevod: ot perevoda «po pravilam» k neyronnomu perevodu (Obzor) [Machine Translation: From Translation “According to the Rules” to Neural Translation (Review)]. Sotsialnye i gumanitarnye nauki. Otechestvennaya i zarubezhnaya literatura. Seriya 6. Yazykoznanie: RZh [Social Sciences and Humanities. Domestic and Foreign Literature. Series 6. Linguistics. Abstract Journal. INION RAN], no. 3, pp. 70-79. DOI: https://doi.org/10.31249/ling/2021.03.05

Almansoori A., Al Mansoori S., Alshamsi M., Salloum S.A., Shaalan K., 2020. Development of Machine Translation Models: A Systematic Review. International Journal of Control and Automation, vol. 13, no. 2, pp. 1462-1483.

Araabi A., Monz C., Niculae V., 2022. How Effective Is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation? URL: https://arxiv.org/abs/2208.05225v1

Brottrager J., Stahl A., Arslan A., Brandes U., Weitin T., 2022. Modeling and Predicting Literary Reception. Journal of Computational Literary Studies, vol. 1, iss. 1, pp. 1-27. DOI: 10.26083/tuprints-00023250

Dankers V., Bruni E., Hupkes D., 2022. The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Vol. 1: Long Papers, pp. 4154-4175. DOI: https://doi.org/10.48550/arXiv.2108.05885

Devlin J., Chang M.-W., Lee K., Toutanova K., 2019. Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1. Long and Short Papers, pp. 4171-4186. DOI: https://doi.org/10.18653/v1/N19-1423

Khoong E.C., Rodriguez J.A., 2022. A Research Agenda for Using Machine Translation in Clinical Medicine. Journal of General Internal Medicine, vol. 37, iss. 5, pp. 1275-1277. DOI: 10.1007/s11606-021-07164-y

Lankford S., Afli H., Way A., 2021. Transformers for Low-Resource Languages: Is Feґidir Linn! Proceedings of the 18th Biennial Machine Translation Summit Virtual USA, August 16–20. Vol. 1. MT Research Track, pp. 48-61. DOI: https://doi.org/10.48550/arXiv.2403.01985

Liu X., Sun T., He J., Wu J., Wu L., Zhang X., Jiang H., Cao Z., Huang X., Qiu X., 2022. Towards Efficient NLP: A Standard Evaluation and a Strong Baseline. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, Association for Computational Linguistics, pp. 3288-3303.

Peris Б., Casacuberta F., 2019. Online Learning for Effort Reduction in Interactive Neural Machine Translation. Computer Speech & Language, vol. 58, pp. 98-126. DOI: https://doi.org/10.48550/arXiv.1802.03594

Popoviж M., 2017. chrF++: Words Helping Character n-Grams. Proceedings of the Second Conference on Machine Translation. Copenhagen, s.n., pp. 612-618.

Sennrich R., Haddow B., Birch A., 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909v5 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.1508.07909

Tars M., Tдttar A., Fiљel M., 2022. Cross-Lingual Transfer from Large Multilingual Translation Models to Unseen Under-Resourced Languages. Baltic Journal of Modern Computing, vol. 10, iss. 3, pp. 435-446. DOI: https://doi.org/10.22364/bjmc.2022.10.3.16

Toral A., 2019. Post-Editese: An Exacerbated Translationese. Proceedings of Machine Translation Summit XVII. Vol. 1. Research Track. Dublin, European Association for Machine Translation, pp. 273-281.

Zhu C., Yu H., Cheng Sh., Luo W., 2020. Language-Aware Interlingua for Multi-Lingual Neural Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroutsburg, Association for Computational Linguistics, pp. 1650-1655.

Zhuang F., Qi Z, Duan K., Xi D., Zhu Y., Zhu H., Xiong H., He Q., 2021. A Comprehensive Survey on Transfer Learning. Proceedings of the IEEE, vol. 109, iss. 1, pp. 43-76. doi: 10.1109/JPROC.2020.3004555