Google/Yandex Translation Detection in the Patterns Identifying System of Multilingual Texts

The object of this work is to develop a script for evaluating the ability of online translators to translate text from one language to another. For this purpose, we used Google Translate and Yandex.Translate. Examples from English, Kazakh and Russian languages were used for the analysis of 147 news items and about 1800 sentences. The texts are taken from an Internet resource astana.gov.kz. A corpus of parallel texts for three languages has been created. We used development for the “sentence” pattern with the prospect of further development for the “text” pattern. We analyzed errors in the following categories: untranslated/omitted words, extra words, incorrect word endings, incorrect word order, punctuation errors, mutilate translation and incorrect translation. Based on the analysis of the obtained data we have concluded that it is better to do the translation of the Russian text into Kazakh or English in the YandexTranslate than in Google Translate. The developed comparison script and error analysis script are available on the Internet in open access.


I. INTRODUCTION
N the list of languages by number of native speakers, English is ranked 3rd (379 million of people), Russian -7th (154 million of people), and Kazakh -76th (12.9 million of people) [1]. According to the analysis of multilingualism, the number of people who speak three languages fluently is 13% of the world's population, 42% speak two languages [2]. In the Republic of Kazakhstan knowledge of three languages -Russian, Kazakh and English -is almost a prerequisite for career growth and higher pay. The state language of the Republic of Kazakhstan is Kazakh. Kazakh and Russian are used on equal grounds. Since 2014 all schools have started teaching English from grade 1.
The language industry is a big business. TechNavio analysts in their "Global Language Services Market 2020-2024" research forecast a market growth by 9.72 billion USD in 2020-2024, with a CAGR of 4% [3].
Thus, all these facts indicate that the issue of quality of translators is an urgent problem.
The object of this work is to develop a script for evaluating the ability of online translators to translate text from one language to another. For this purpose, we used Google Translate [4] and Yandex.Translate [5], examples from English, Kazakh and Russian languages were used for the analysis. Research papers on comparing data from online translators are available on the Internet, but they explore other language combinations and algorithms for comparing translations, and there are no errors that we found [6][7][8][9]. For our research, we chose only two online translators, since they have the ability to translate into all three languages we need. Google Trends confirms the demand for these translators [10]. Translation analysis was performed by the similar_text function [11] and token_set_ratio from the FuzzyWuzzy library [12] for the "sentence" pattern. The script works for the "sentence" pattern. It is planned to Refine the system for the "text" pattern and connect the analysis to other common online translators. The similar_text function is a PHP string function and is widely used in cases where fuzzy string I comparison is necessary. At the moment, the token_set_ratio function from the FuzzyWuzzy library is also a popular PHPsolution for fuzzy string comparison [13,14].
To perform the experiments, a corpus of parallel texts was collected in the amount of 1773 sentences in Kazakh, Russian and English languages. We used news published on the official Internet resource of the akimat of Nur-Sultan: http://astana.gov.kz/en.
Creation of a script for evaluating the quality of translation using the coefficients similar_text/token_set_ratio for different language pairs can be considered as the research contribution.
This script is required in the multilingual text pattern identification system (hereinafter -the system), which generates a hybrid translation from sentences with the highest similar_text/token_set_ratio coefficient based on Yandex and Google translations. How will the system work?
The user enters the text (for example Russian text), and selects the language to translate the text into (for example, English), then clicks "Translate" button. The system performs direct and double translation of the received text through Google Translate and Yandex.Translate. A double translation is a translation through an interim language, for example, Russian text translated into English and back from English to Russian. The similar_text/token_set_ratio coefficients show how much the original sentence changed after the double translation. The higher the coefficients, the fewer changes occurred in the text, which means that the direct translation of this sentence into English provided by online translator is of high quality.
The system calculates the similar_text/token_set_ratio coefficients for each sentence using the created script, which is described in this article. Then, based on these coefficients decides: to display the Google Translate or Yandex.Translate translations. The system displays the translation with the highest coefficient, this means that the displayed translation is more stable for double translations.
Thus, the user gets the most reliable translation, which is compiled with the help of two of the most popular online translators in Kazakhstan.

A. TOOLS
A php-script "Text Comparison" was developed for the analysis [15].
The program is executed in the php programming language with the "utf-8 without bom" encoding. The code can be easily adjusted to perform similar analysis for many other language pairs.
In the program code we used information cleansing to text normalization.
Short description of the regular expression: the "u" flag indicates that the searched expression and the text use utf-8 encoding, not just Latin letters. The "I" flag does not require uppercase characters. In the spectrum "а-я" there is no "ё" symbol, so we specify it additionally, and "\d"any number. In addition, $replacement contains a space, so we changed all characters other than letters, numbers, and spaces to a space.
Porter stemmer is a stemming algorithm published by Martin porter in 1980 [18]. The original version of stemmer was intended for English and was written in BCPL. Martin later created the Snowball project and, using the basic idea of the algorithm, wrote stemmers for common Indo-European languages, including Russian [19]. The algorithm does not use word bases, but removes word endings and suffixes based on the features of the language by applying a range of rules.
Similar_text determines the similarity of two lines using the Oliver algorithm [11]. The Function returns the percentage of two lines matching in $percent.
-token_set_ratio function from the FuzzyWuzzy library [20]. Out of the four available functions in the FuzzyWuzzy library we selected token_set_ratio. This function does not depend on the word order and their repetition, and produces the best result based on the matching of lines [12]. "Token_set_ratio=100" means 100% match of the compared lines.
For the translation of manually compiled corps of parallel texts in Kazakh, Russian and English free online translators Google Translate and Yandex.Translate were used. We also intended to use a comparison with Bing [21], but, unfortunately, this translator does not support the Kazakh language.

B. RESEARCH METHODOLOGIES
For the research we created a corpus of parallel texts in three languages, made by professional interpreter: English (En), Russian (Ru) and Kazakh (Kz). For the case we used news published on the official Internet resource of the akimat of Nur-Sultan http://astana.gov.kz/en from 26.12.2018 to 28.11.2019. Note that sometimes news was first created in Russian or Kazakh, and then translated by a professional interpreter into two other languages.
When creating the corpus, it was essential to have all three language versions of texts and the same number of sentences in each. If the number of sentences in translations did not match, the sentence in the text was omitted. There were 48 news and 591 sentences for each language, and in total 144 news and 1773 sentences compiled by a professional interpreter. Each news item in the corpus files is separated from the other by a double line. The corpus and all the research data can be downloaded here [22].
After forming the corpus, we started creating translations of sentences in accordance with Fig. 1.
Both online translators provided translations through interim languages (RuEnRu, RuKzRu) and direct translations (KzRu, EnRu). Table 1 shows the average number of words in a sentence, the longest and the shortest by word count.  To compare texts for similarities we performed text normalization. All lines were converted to lowercase. Replaced all characters other than letters, numbers, and spaces with spaces, since the following cases occurred in the case: "start/finish", "10 km(from 16 years old)", etc.
Then we applied the Porter stemmer for the Russian language.
Next, we performed a fuzzy line comparison using the php (similar_text) and FuzzyWuzzy functions before and after the Porter stemmer.

A. COMPARISON ANALYSIS OF SIMILAR_TEXT AND TOKEN_SET_RATIO
As a result of the received data, it was decided not to use Porter stemmer in the system for identifying patterns of multilingual texts. First, the script execution time using the stemmer increases it almost 2 times (from 1.98 seconds to 4.19 seconds on average). Second, Porter stemmers are not designed for all languages, which will be an obstacle when connecting other languages to the system.
To evaluate offers for translation quality, we selected three parameters: "correct translation", "mutilate translation", and "incorrect translation".
"Correct translation" means that the whole sentence is translated correctly. In this sentence, you can use synonyms.
"Mutilate translation" means that the meaning of the translated sentence is slightly distorted, but generally remains the same.
"Incorrect translation" means that the meaning of the translated sentence is significantly distorted.
As the result of the error analysis we formed a table of intervals for the system for identifying patterns of multilingual texts (Table 2).
Unfortunately, the analysis shows a strong intersection of intervals and we cannot be sure that the sentence translated from one language into another using Yandex.Translate or Google Translate will relate to a specific translation quality parameter. Therefore, it was decided not to include the table in the system.
What is the most reliable indicator of translation quality? Similar_text (ST) or token_set_ratio (TSR)? To do this, the following Ru -RuEnRuGoogle comparison graphs are created for the "Incorrect translation" (Fig. 2) and "Distorted translation" (Fig. 3) parameters. A more accurate result is given by Similar_text, showing lower coefficients for these parameters.

B. ERROR ANALYSIS
When developing a script for the system for identifying patterns of multilingual texts, the following errors were noticed. Google Chrome browser (Version 81.0.4044.138) was used to create double translations. At the same time, it was discovered that the translation to the new line is not always saved. Also, Yandex.Translate does not take into account the line breake "cr", but only the combination "cr" "lf" (Fig. 4). In Google Translate, all types of line breaks are taken into account [23].
To save all line breaks after translation in Yandex.Translate following steps were performed. Replacing "\r" with "\r\n" in the Notepad++ text editor in "Advanced" search mode (Fig. 5).
Replacing "\n\n" with "\n" (Fig. 6). All news is still separated by two line breaks. Coefficient Sentence Data for "mutilate translation" similar_ text token_set_ratio Figure 5. Replacing "\r" with "\r\n" in the Notepad++ Yandex.Translate does not use quotation marks correctly after the translation. For more information, see the files with the name containing the word "yandex" at the link [22]. Figure 6. Replacing "\n\n" with "\n" in the Notepad++ As a result of comparing the text in Russian with double and direct translations, the following table was formed. To create the table, we used a modified script of the main comparison script, which helps to create a table for Excel [15]. Comparing the Ru text with the received direct translations from Kz and from En (KzRuGoogle, KzRuYandex, EnRuGoogle, EnRuYandex), we can say that the ability to determine the identity of multilingual texts created by professional interpreters is low. If you add up the percentages "Mutilate translation" and "Incorrect translation", then the similarity of the meaning of the Ru and Kz texts is easier to detect using Yandex Thus, although it is better to translate Ru to Kz or En in Yandex.Translate, when using the system, the translation subtleties can be taken into account using Similar_text calculations, which will help generate text based on more accurate translations of sentences in online translators.
Detected errors can be used to further improve the program for comparing two texts.

IV. CONCLUSION
The collected corpus of parallel texts in English, Kazakh and Russian is available for download and further use for research in the field of multilingual texts [22]. At the time of writing, we have not found a similar case on the Internet. We think this will save time for other researchers.
The program code can be used for research of many other language pairs based on the Latin or Cyrillic alphabet. If you modify the code based on the observed errors specified in clause 3.3 Error Analysis of this article, you can get more accurate results.
The main comparison script "Text Comparison" and the script for creating a table for comparing sentences in Excel are available on the Internet in open access [15]. The developed scripts are intuitively simple, so they can be scaled up by further development and used for comparison with other online translators as well.
Highlighting with a certain color depending on the similar_text coefficient will allow interpreters to quickly create translations that do not require high accuracy. They will be able to focus on sentences with a low coefficient of similar_text, while skipping sentences with a high coefficient that characterizes a more accurate translation.