APPROACH FOR MINIMIZATION OF PHONEME GROUPS IN AUTHORSHIP ATTRIBUTION

The developed mathematical support for authorship attribution software includes a combination of statistical methods (Student’s t-test, Kolmogorov-Smirnov’s test) and a statistical model for determining significant differences between styles. The combination of statistical methods allows us to enhance test validity of authorship attribution by obtaining the same results by the two methods applied. The model developed makes it possible to identify a consonant phoneme group with high style identification capability. The phoneme position in a word is taken into account. The greater number of significant differences is, the higher authorship identification capability of the phoneme group is. The developed system software is based on the algorithms of the used combination of methods and statistical model. The Java programming language provides platform independence. The minimized number of consonant phoneme groups makes the process of style and authorship attribution more automated. The obtained results of comparisons of the scientific, belles-lettres, conversational and newspaper styles are presented. The data obtained allows us to assert that the used combination of methods and the developed statistical model improve test validity of style and authorship attribution.


INTRODUCTION
We are living in an epoch of implementation of information technologies on a large-scale in all fields of human activity. In cases of an unknown author of a text authorship attribution is performed. An author is identified on different language levels. The differentiation of the studied texts is carried out on the phonological level which has an advantage of an unaltered number of elements. The authorship attribution of a text on the phonological level can be performed according to the following criteria of differentiation: absolute frequency of phoneme groups; relative frequency of phoneme groups; average frequency of phoneme groups.
The texts by different authors can be differentiated according to the following positions of phonemes in a word: arbitrary position in a word; at the beginning of a word; at the end of a word.
In addition, the texts can be differentiated by one group of phonemes, two or three phoneme groups and all phoneme groups.
The most effective approach to solve the abovementioned problem is text differentiation by one, two or three of eight phoneme groups in which significant differences are found when comparing the texts by different authors. Such simplification of process of authorship attribution of a text can be regarded as minimization of a number of phoneme groups.
The frequency of occurrence of phonemes is influenced by such factors as language, style and authorial style. When a text is significantly marked with stylistic and authorial elements, it can be revealed by the phoneme group which is characterized by a high authorship identification capability.
In addition, the reliability and accuracy of the results of the authorship attribution of a text is higher on the phonological level than on other language levels due to the fact that this level is a closed system with an unaltered number of elements. Consequently, the differentiation of texts by different authors on the phonological level with the help of modern information technologies (IT) is an important task of research.

RELATED WORK
The problem of determining the features of author's individual style is widely represented in current research in the field of IT [1,2,3]. The analysis of recent publications makes it possible to conclude that usually, for authorship attribution, short words (prepositions, conjunctions) are selected, which are characterized by relative stability of use. The most common methods are statistical tests. Thus, the statistical method -the discriminant analysis is used for differentiating functional words with accuracy of 81.5 % [4]. The method applied is less powerful than the Kolmogorov-Smirnov test. The linear regression method and the support vector machine algorithm are used to differentiate functional words and punctuation marks in texts by different authors. The accuracy is 90%, 95% and 97% [5,6,7]. This combination of methods is more powerful and gives better results, but the lexical level, it is used on, is less strictly arranged than the phonological one, on which still better results could be obtained. To establish authorship of documents the support vector machine algorithm is used. The letters and stop words are differentiated with accuracy of 97.22 % [8]. A combination of methods could be more efficient. Uni-grams and bi-grams are differentiated with the help of the Kolmogorov-Smirnov test for author identification. The accuracy is 95% and 97 % [9,10]. Phoneme based research could demonstrate higher efficacy. Functional words, prepositions, pronouns, nouns and verb forms from two authorial styles are differentiated by the chi-square test with accuracy 90% and 95% in the texts on judiciary [11]. The chi-square test is less powerful than the Kolmogorov-Smirnov's test [12]. The Student's t-test is used for revealing difference between two compared authors. The first words of sentences and punctuation marks are differentiated with accuracy 66.7% [13]. The Student's t-test and the chi-square test are less powerful than the Kolmogorov-Smirnov's test [12]. The rankfrequency relations for phonemes of a text depend on its author and can be described by the Dirichlet distribution [14]. The ranking method is less powerful than the hypothesis method [15]. In stylometric studies, it is proposed to use linguametric coefficients when dealing with the corpus of Ukrainian language texts of the scientific style. The results are obtained with accuracy 77% and 87%. In this research, the openness of the lexical and syntactic systems causes difficulties in describing language phenomena with a high degree of accuracy [16,17]. Determining the characteristics of author's style in a piece fiction is a more complicated task than identifying author of a scientific article. Components of lexical-semantic groups of the belles-lettres style are more numerous and vary in terms of emotionality and expressiveness. Therefore, the level of accuracy of authorship attribution of literary works may be lower than the level of accuracy of scientific articles [18,19].
Modern IT allows us to differentiate the stylistic characteristics of author's style, but this task is rather complicated because the stylistic devices are not as frequently used as words of common wordstock [20].
Online messages have their own specific genre, which allows us to distinguish between authorial distinctive features. Due to penetration of a great number of conversational elements, typical of this genre, the level of accuracy is lower than in the scientific style [21].
The analysis of the above-mentioned research has shown that differentiation of functional styles and texts by different authors on the phonological level is a topical task nowadays. The problem of enhancing test validity of authorship attribution and the problem of minimizing the number of phoneme groups by which styles differ essentially still remain to be solved. In order to increase the reliability of results of authorship attribution, it is necessary to combine the most efficient methods [12,15] and to differentiate texts on the most strictly arranged language level -the phonological one. The most optimal level of significance is a classical level -0.05. On this level the greatest amount of results is obtained. On the level of significance of 0.01 some part of results is lost. The use of modern information technology tools makes the authorship attribution fast.

PURPOSE AND OBJECTIVES OF THE RESEARCH
The purpose of the study is to minimize the number of consonant phonemes in which the texts by different authors differ essentially on the phonological level. To attain the goal, the following tasks must be fulfilled: -a combination of statistical methods and a statistical model, which allow us to improve test validity of authorship attribution must be proposed; -the phoneme group with the highest style identification capability in a certain phoneme position in a word must be determined; -the system software for automating style and authorship attribution must be developed;

MATHEMATICAL SUPPORT FOR SYSTEM SOFTWARE
Mathematical support of the system software for authorship attribution includes a combination of statistical methods and a statistical model for identifying the consonant phoneme group with the highest author and style identification capability. The combination of methods includes the Student's t-test and the Kolmogorov-Smirnov's test. The test validity of style and authorship attribution can be improved, if the same results are obtained by the two statistical tests applied. The Student's t-test is a parametric method and can be used only if consonant phoneme group frequencies follow normal distribution. The Kolmogorov-Smirnov's test is a nonparametric test and consonant phoneme group frequencies may not follow normal distribution.
The algorithm for applying the proposed combination of two statistical tests includes the following steps: Step 1. The average frequency of occurrence of consonant phoneme groups is determined: where m -a number of intervals, x i -middle of an interval, i n -a number of frequencies getting into an interval, N -a number of portions of a sample.
Step 2. The theoretical normal distribution is determined: where μ and σ 2 are respectively average value and variance of a general population. These values are unknown and are changed for the known -x and S 2 , where S 2 -unbiased estimator of dispersion: Step 3. The theoretical frequency is determined: In the given expressions: x -middle of an interval, lower limit of an interval, i  -an upper limit of an interval, .
Step 4. Compliance of frequencies of all eight consonant phoneme groups with the normal distribution law is determined by the Pearson's test [22]; the level of significance g is 0.05 and 0.03; the number of degrees of freedom where n -a number of portions in a sample: xstatistics value of the Pearson's test.
Step 5. Significant differences between the styles in all eight consonant phoneme groups are determined by the Student's t-test: where t -value of Student's t-test; -an unbiased estimator of dispersion [22].
Step 6. Significant differences between the styles in all eight consonant phoneme groups are determined by the Kolmogorov-Smirnov's test: where ) (z -value of the Kolmogorov-Smirnov's statistics [9].
Step 7. The consonant phoneme group, in which significant differences are determined by the Student's t-test and the Kolmogorov-Smirnov's test is considered the group with a high style and author identification capability.
Step 8. The texts are differentiated by the definite phoneme group (with a high style and author identification capability) in the definite phoneme position in a word.
The algorithm for determining style and author identification capability of consonant phoneme groups is presented in Fig. 1. , where n -a number of significant differences between the compared texts; sc -style identification capability of a certain phoneme group. If the maximum number of significant differences is 10, it corresponds to 1, and the following expression is true: 10 × 1 = const.
According to the developed model it is possible to determine the following dependence: the greater number of the obtained significant differences is, the higher style and author identification capability of this phoneme group is, as it is shown in Table 1. 18 -the maximum number of significant differences; 1 -the highest indicator of the style and author identification capability of the phoneme group.
Consequently, the developed model based on the proposed combination of statistical methods allows us to minimize the number of phoneme groups by which the samples differ essentially and to make style and authorship attribution more automated. The level of test validity is 95%, 97%.

SYSTEM SOFTWARE
The structure of the developed system is based on a modular principle and includes four main modules, namely: user interface, data input module, statistical data processing module and data output module. The data input module is aimed at converting the text into transcription and forming consonant phoneme samples. Data output module secures presentation of the statistically processed results in a format convenient for a user. The user interface provides a connection of the system software with a user, the work with files, a sequence of data processing, etc. The statistical data processing module secures implementation of the methods and model applied.
The use of the modular principle makes it possible to quickly modify and improve the developed software product.
Information support is of particular importance in the process of software development. The constructed scheme of data conversion in the developed system is shown in Fig. 2. The first step of the system performance is to download two files with texts. Then, the texts are transcribed. The next steps involve creating a sample of consonant phonemes, dividing the sample into portions and counting the number of phonemes in each portion. Then, statistical processing of data is performed.
The system software is developed with the use of the Java language providing platform independence. An example of the created diagram of software classes is shown in Fig. 3. The developed interface includes 7 tabs, the names of which correspond to the steps of algorithm of the system functioning: "Text", "Transcription", "Sample formation", "Sample division into portions", "Sample division into groups", "Counting phonemes in portions", "Counting phonemes in groups", "The Kolmogorov-Smirnov's criterion".
The proper organization of the system provides the user with a user-friendly and intuitive interface.
Two jPanels are created in the "Text" tab; two text fields jTextAreaLeft and jTextAreaRight are placed on the panels; a file is selected using jFileChooser; text fields are filled with the contents of the files using the SetText method.
In the "Transcription" tab, we work with Word.txt and Transcription.txt files. The files are filled with new texts and therefore the sample is dynamic. ArrayList <String> list structures allow us to work with dynamic data.
In the "Sample formation" tab, transcription symbols for consonants are singled out from the symbolic stream of transcription symbols. The sample is formed from the first 51,000 consonant phonemes.
Class lists from the java.util library are used to create "Sample division into portions", "Counting phonemes in portions" and "Counting phonemes in groups" tabs. The number of phonemes in portions and groups is counted in the tabs.
Mathematical Library Colt is used for the Pearson's, Student's and Kolmogorov-Smirnov's criteria. Consequently, the developed system software allows us to automate the process of identifying a group of phonemes with the highest style and authorship identification capability.

TESTING RESULTS OF THE SYSTEM SOFTWARE FOR DETERMINING STYLE IDENTIFICATION CAPABILITY OF A PHONEME GROUP FOR AUTHORSHIP ATTRIBUTION OF A TEXT
The combination of methods, model and system software based on them are used to perform style and authorship attribution. The scientific style is compared with the conversational, belles-lettres and newspaper styles. In this comparison, a high style identification capability of the occlusive phoneme group is determined. For style differentiation by the Student's t-test, it is necessary to perform the Pearson's test. The obtained by the Pearson's test results of verification of the compliance of frequencies of the occlusive phoneme group with the law of normal distribution are given in Table 2.  (5). The obtained by the Pearson's test results prove that frequencies of the occlusive phoneme group comply with the law of normal distribution.
The results obtained by the Pearson's test make it possible to apply the Student's t-test for style differentiation. The average frequency of occurrence of consonant phoneme groups is a style differentiation criterion. Two styles are compared in a fixed phoneme group by the values of average frequencies (6). The levels of significance are 5% and 3%. If the difference of the values of average frequencies of two styles compared is significant, the styles differ essentially.
As the results show, significant differences have been obtained in 15 of 18 comparisons in the occlusive phoneme group. The same results have been established by the Kolmogorov-Smirnov's test (7). The results of the comparisons are shown in Table 3. The symbol + means essential difference, the symbol -stands for unessential difference.  The analysis of the obtained results, given in Table 3, allows us to conclude that the occlusive phoneme group has the highest style and authorship identification capability in the arbitrary position in a word and at the end of a word. On the basis of the obtained results, a statistical model for determining style identification capability of the occlusive phoneme group has been developed (Fig. 4).
According to the results of the study, it can be concluded that style and authorship attribution can be performed in the occlusive phoneme group. The number of consonant phoneme groups minimized from 8 to 1 secures more automated style and authorship attribution.  In the model in Fig. 4 the following designations are used: JG -the occlusive phoneme group; AParbitrary position in a word; EW -position at the end of a word; 15 -a number of significant differences revealed in a comparison of the scientific, belles-lettres, conversational and newspaper styles; 1 -the highest index of style identification capability.
Consequently, the results of testing by the Student's t-test and the Kolmogorov-Smirnov's test show that significant differences have been obtained in most cases of style comparisons. This proves high reliability of this combination of statistical methods for style and authorship attribution.

CONCLUSION
The proposed combination of statistical methods: the Student's t-test and the Kolmogorov-Smirnov's test allows us to improve test validity of style and authorship attribution by obtaining the same results by the two tests applied. The level of test validity is 95% and 97%.
The developed statistical model makes it possible to determine a group of phonemes with the highest style and authorship identification capability. In the research this group is the occlusive phoneme group in the arbitrary position in a word and at the end of a word in a comparison of the scientific, belles-lettres, conversational and newspaper styles. The research results allow us to minimize the number of phoneme groups in which the compared styles differ essentially, simplifying the process of style and authorship attribution and making it more automated.
The developed system software allows us to carry out style and authorship attribution more efficiently -in one instead of eight consonant phoneme group.
In further research, another combination of statistical methods will be tested for improving efficacy of style and authorship attribution.