BOTNET DETECTION APPROACH BASED ON THE DISTRIBUTED SYSTEMS

The paper presents a botnet detection approach for the distributed systems. It is based on the developed three level model, which includes botnet’s components: command and control center, control centers, basic elements of the botnet (bots). The novel framework provides the ability to detect known and unknown botnets, and consists of the host and the network levels. At the host level, the detection procedure is based on the implementation of the Bayes classification. The network level extends the results obtained at the host level to the rest of the local area network. Proposed approach provides the exchange of the results obtained by the Bayes classification for further use by other program units of the distributed system. The results of the developed classifier show that representation of the botnets’ samples for different classes and subclasses is sufficient for efficient botnet detection. Proposed technique demonstrates promising results concerning botnet detection in the distributed systems.


INTRODUCTION
The trends concerning malware development and its spreading demonstrate an active extending of the malware's technical capabilities. The main motivating factors leading to its creation are the financial gain and political advantage. One of the most rapidly evolving directions of the malware is the botnets that allow an attacker to gain remote access of the user's computers. The harm caused by the use of such malicious software is rapidly increasing [1].
Architecture of the modern antiviruses has a single central control center. Such tools as ESET Endpoint Security for Windows Endpoint Security for corporate networks [2], Dr.Web CureNet! [3], Symantec Endpoint Protection [4], Malwarebytes Endpoint Security [5], Cisco® Network Admission Control (NAC) [6] are based on the centralized way of functioning. Kaspersky Administration Kit Antivirus is based on the principle of autonomous work, and the decision-making is implemented without the administrator participation in case of the critical situations. However, it is also based on a centralized way of organizing the interaction of system components [7]. Mentioned tools are based on methods that do not sufficiently take into account all stages of the botnets functioning and their possible structures, therefore it leads to the decreasing in the botnet detection efficiency.
Therefore, the development of new methods and tools for efficient botnets detection in the distributed systems is an urgent problem.

RELATED WORKS
In the recent years, the great number of the botnet detection approaches based on machine learning are developed [8,9]. In [10] the botnet detection is based on the analysis of its group activity in the network. The behavior is analyzed by the histogram in order to determine the number of web requests and their diversity over time using HTTP bots. Proposed method uses the correlation analysis to computing@computingonline.net www.computingonline.net

Print ISSN 1727-6209 On-line ISSN 2312-5381
International Journal of Computing detect botnets based on the similarity and the correlation of their group activities in the network.
The modeling system of the botnets' architectures via agents, which take into account various botnets' functioning mechanisms, is presented in [11]. It is based on the necessity to take into consideration the special aspects of the botnets' structure, as it is important for gathering the characteristics of the botnets. In the articles [12][13][14][15][16] the methods for botnets detection based on the traffic analysis and anomaly detection are presented. The disadvantage of the technique is the need for a constant traffic analysis and the obtaining needed features which can be changed rapidly by attackers. Moreover, technique does not take into account the botnets' architectures. In [17,18] the botnet detection methods are based on signatures. Technique requires capturing of a great number of packages and their comparisons with the pre-configured attack templates from database. The common disadvantage of these methods is the need to update templates that affects the inability to detect new botnets or their nodes. In [19] a mechanism for analysis of botnet activities in the IoT, based on machine learning techniques is presented. It is based on network flow identifiers that can track suspicious activities of botnets. In [20][21][22] the methods for botnets detection in corporate area networks (CAN), which include the use of a multi-agent system, are presented. A botnet detection is based on the analysis of the botnets' behavior in the CAN. Method is able to detect bots, that use such evasion techniques as cycling of IP mapping, "domain flax", "fast flux" and DNStunneling. In [23] a system of baits for malware provoking, which is located in a distributed system, is developed. To identify new botnets, the system requires a permanent addition of malware's behaviors. In [24] a botnet detection approach uses the unsupervised machine learning and similarity analysis between benign traffic data and botnet's traffic data. Known methods and tools do not provide high efficiency of the botnet detection. This is due to the development of new techniques for the botnet distribution in the networks and computer systems and appearance of new capabilities of the botnet functioning. Moreover, network antivirus tools are mainly based on the rigidly centralized architecture, which is also used by intruders to attack the computers systems, which contain such center.
Therefore, the development of new effective methods and tools for botnet detection, which will take into account the perspective possibilities of botnets' functioning and the distributed architecture is an actual problem.

THE STRUCTURE OF THE CONTROLLED DISTRIBUTED BOTNET
The botnet is a distributed software system that includes a great number of nodes (bots), which communicate via malicious software. Structurally botnets include the nodes, which are assigned to control of the network and maintain its integrity, and the end nodes are aimed at carrying out the malicious actions. An attacker through a commandcontrol center (C&C) or via other intermediate remote control centers [20,25,26] directly controls the botnet.
Let us define the botnet's components: command and control center, control centers, basic elements of the botnet (bots). The structure of the botnet is shown in Fig. 1. Let us define the basic elements of the botnet as the subset 3, 3 3 = 1, 2, … , 3 , the botnets' control centers as the subsets 2, 2 , where 2 = 1, 2, … , 2 -a number of botnet's basic elements, where 2is a number of control centers of the botnet.
Let us define the command-control centers of the botnet as the subsets 1, 1 where = 1, 2, … , , n1a number of command-control centers of the botnet. The number of these centers may vary.
Botnets may have different architectures, depending on the topology and communication elements: multi-server, hierarchical, random (peerto-peer) and hybrid. Presented architecture in Fig. 1 is generalized and covers these topologies. Let us present the botnet as the union of its components as follows: where E -a set of the botnet's components. As the elements of the subsets , are the functions 1 , 2 , 3 . Different elements of subsets , may include the same functions. Functional load of each of the assigned functions depends on the type of operating systems and their API functions, respectively. Let us consider the botnet's malicious action as the sequence of the API calls.
In order to represent botnet's behavior malicious actions let us define its main items: • a vector that describes its malicious actions • -the vector number; • sa number of variant of the malicious action presentation via API functions; • -a number of the vector components , . The vector components , are the numbers of API functions that may be executed by the botnets. The task of the classifier is to assign the analyzed vector , to one of the botnets' classes.
Based on the structure presented in Fig. 1 of the reference botnet's model, presented via the vectors of the malicious actions is assigned to class Ki. Known types of the botnets are characterized by the different functional possibilities. Furthermore, some malicious actions, implemented via API functions, may occur more frequently.
One malicious botnet's action may be described by more than one vector, which contain the sets of the most often called API functions to perform specified malicious botnet's actions. The botnets' classes are characterized by mentioned vectors. Because the structure of the botnet may include the basic elements, control centers and commandcontrol centers of the botnet and corresponded vector of malicious actions could not be compared with the whole class, each class could be divided into subclasses.
In order to establish, that the resulting vector is a malware or benign software, the naive Bayesian classifier is used. Its main benefits are: simple and easy implementation, it does not require as much training data, it handles both continuous and discrete data, it is highly scalable with the number of predictors and data points, it is fast and can be used to make real-time predictions, and it is not sensitive to irrelevant botnets' features.
Let us define = { ,1 , ,2 , … , , } as the sample formed on the basis of the API calls for the vectors , , Aa hypothesis about the membership of values to one of the botnets classes , where = 0, 1, … , 6. In order to solve the classification problem we are to evaluate the probability, that the sample belongs to the class , taking into account the knowledge about the botnets' actions. For this purpose we need to define the probability (A | ) that the hypothesis A contains the data from the sample . Let us evaluate a posteriori probability (A | ) -the probability that the value of A depends on the actions of the sample , using Bayes' theorem.
Each botnets class is defined and represented by a set of pairs vectors , 1 , 2 , 3 and , , 1 , 2 , 3 , , . The sample belongs to the class with the highest value of the posteriori probability if and only if the condition is fulfilled: For all 1 and 2 , such that 0≤ 1 ≤ 6, 0 ≤ 2 ≤ 6, 1 ≠ 2 .
So, we search for a class with the highest value of the probability ( | ).
In order to assign the vector , to the botnet's certain class, the product of the probabilities of API functions that were included into the vector of potentially suspicious actions is to be evaluated. For this purpose, the multi-nominal generative model, which takes into account the number of repetitions of API functions and does not take into account the absence of some API functions, was used.
The definition of the membership of the vector , to the class or its subclass , is performed on the basis of the calculations of the probabilities for each class or subclass using Bayes classifier evaluations: P(v p,e s | K y,g ) = = P (|n v p |) n v p ! × In order to conduct the training procedure, the probabilities ( , , 1 , 2 , 3 | , ) are to be processed. For this purpose, we evaluate the optimal estimates of the probabilities that some API function will be present in each class or subclass by modifying the result using the Laplace algorithm to avoid the "zero-frequency" problem: (4)

LEARNING PROCEDURE
The learning procedure involves the following stages: 1) definitions of the subclasses via one presentation of API functions for each of them, and calculation of the probabilities for each of the subclasses and classes, and their definitions as primary; 2) for each known next variation of the presentation of the malicious botnet's action via vector of API function, is to be classified by the Bayes classifier into the classes and subclasses; if the received presentation of the malicious botnet's action is assigned correctly to the specified subclass, then its marked elements are added to the subclass as a separate sample; if obtained result does not classify it into the required subclass, it in this subclass, but at the same time comparisons with other values of the classes are to be performed; the result of the comparisons will be the deviation from initial values of probabilities; if the resulting probability is significantly (the threshold more than 10%) different from the primary probability of a subclass or class, then a new separate subclass of this class is to be created; for each learning stage the probabilities for each API function of the subclasses and classes are to be calculated; for those subclasses and classes, where the divergence with the primary classes is more than 10%, a new subclass is to be created in its subclass; all probabilities obtained after several iterations are averaged and are considered as appropriate probabilities for use in further calculations; 3) after the basic training phase is completed and the new data to Bayes classifier is added, the deviations verification for the additional subclasses is to be performed, and divergence between its mean probabilities values is to be evaluated; if the divergence value is less than 10%, then the subclass is added by the data of the additional subclass, and all probabilities and their mean values are to be recalculated; 4) at each stage the difference between the mean probabilities' values and the difference between the probabilities values obtained by the classifier are to be evaluated; if the difference is more than 10% for some subclasses, then training is to be continued for them by adding additional data and repeating the steps 1-3.
The vector , may not be assigned to any given class and subclass. It means that the analyzed object does not include probably malicious actions. This fact is based not only on the search for the maximum probability calculated by Bayes theorem, but also on the correspondence between this probability and the thresholds' values of classes and subclasses defined during the classifier's learning process. This is due to the fact that the executable process represented by the vector , , may belong to benign software. In this case, further analysis is interrupted.
The self-learning procedure is carried out according to the training scheme (steps 1-4). After the analysis of the vector of possible malicious actions is performed, its data has to be added to the class and its probabilities are to be calculated. If the deviations of the probabilities are within the specified thresholds for one of the subclasses, then after the classification is completed, its new data in its subclass is included and new calculation for the entire classifier and its mean values of probability deviations is performed.
After the new item is added to the classifier of the program unit, the obtained information is sent to other program units of the distributed system for use.
The decision concerning the location of the processing of the obtained vector of the malicious actions is determined on the basis of the computer system's workload, in which these data were collected. If the workload is high, then obtained vectors are sent to other program units for processing. After the data processing is complete, obtained results are to be sent back.
If the classifier analyzed software is assigned as the malicious, it is added to the classifier and is sent to all program units of the distributed system.

THE STAGES OF THE BOTNET DETECTION APPROACH BASED ON THE DISTRIBUTED ARCHITECTURE
The botnet detection approach based on the distributed architecture involves the stages: 1. Obtaining the information concerning the active processes using an active monitoring (starting from the first API function of each process that will be performed after the start of the computer system).
2. Gathering the monitoring data into the vector after detecting possible malicious actions in the computer system.
3. Formation of feature vector based on determined potentially malicious actions. The components the feature vector are the API functions.
4. The decision making about the location of the feature vector processing.
5. If the computing load of the computer system is low, then information is processed on this computer system, otherwise it is sent to another specified program unit of the computer system. 6. The implementation of the vector classification and analysis of its results.
6.1. If the feature vector has been assigned to one of the botnets' class, then this information is to be sent to all classifiers of all program units. 6.2. If the feature vector has been assigned to the several botnets classes, then other program units of the distributed system are to be involved for feature vector analysis. 6.3. If the similarity with the available botnets' classes is low, but other program units of the distributed system have made a decision that feature vector contains malicious action, a new botnet class is to be created, the classification information is to be updated and sent to all program units of the distributed system. 6.4. If the analyzed feature vector does not contain malicious actions, then the analysis is completed.
6.5. If the feature vector corresponds to malicious behavior, the analyzed executable is stopped. 6.6. Search for the similar processes in other computer systems of the network using installed program units of the distributed system based on the obtained information.
7. Calculation of the probability values for each program unit of the antivirus distributed system that the computer system is infected.
Thus, the developed technique is able to detect new botnets and is based on the distributed architecture and with the use of the Bayes classifier.
The architecture of the distributed system is presented in Fig. 2.

EXPERIMENTS
The purpose of the experiments was to verify the efficiency of the botnet detection technique using the Bayes classifier. In order to carry out the experiments, 28 artificial botnets were constructed, grouped by classes. Mentioned botnet had the functional properties of the bots' classes: Agobot, SDBot, Spybot, evilbot, DSNX, G-sys (remote control, usage of the system vulnerabilities, server attacks, system spying, etc).
Obtained malicious programs included 25 structural elements with three functioning stages which used 81 API functions [20]. Not all botnets used for experiments contained all possible structural elements and functions.
Each malicious sample was presented as the vector taking into account the variations of its presentation via API functions, and all samples are assigned into the botnets and classes and subclasses.
In order to conduct the experiment, the local network with 19 computer systems was employed. Each of the computer system contained the program unit without any other antivirus tools.
First, a program unit used a classifier with no one of the generated botnets' samples. One computer system contained the command control center, and the control centers were located in 3 computer systems, and 15 computer systems were infected via botnet's bots.
The installation of the generated botnets was carried out alternately. The experiment being completed, all computer systems in the network were completely updated, except the classifiers.
Each experiment for each botnet's sample lasted 96 hours. For the experiment, the botnets were selected that use the strategy of obtaining complete control over the computer system.
The experiments involved the extraction of the vectors of possible malicious actions via API calls monitoring in the computer system. Obtained vectors were analyzed by the classifier of the program unit. The experiments were carried out concerning the trained and untrained classifier.
The aim of the experiments was to determine the rates of the botnet detection efficiency for classes and subclasses using the Bayes classification.
In order to evaluate the method efficiency, let us consider its main parameters: P1,1 and P1,2 -the rates of correctly classified vectors of botnets' malicious samples concerning botnets' classes using trained and untrained classifiers respectively; P2,1 and P2,2 -the rates of correctly classified vectors of botnets' malicious samples concerning botnets' subclasses using trained and untrained classifiers respectively; P3,1 and P3,2the rates of correctly identified computer systems as infected using trained and untrained classifiers respectively; P4,1 and P4,2false positives (for trained and untrained classifiers); P5,1 and P5,2-the rates of the malicious samples assigned to wrong botnet's class using trained and untrained classifiers respectively.
The results of the experiment for seven botnets classes are presented in Table 1. The results of the experiment demonstrated, that the accuracy of the botnet's samples classification is 66% for the classifier without involvement of botnet's samples and 88% for the classifier, which was trained using the generated vectors. The obtained results were averaged and their dispersion relative to the mean value is 1%.
The difference of deviation for each class and subclass using trained and untrained classification execution was 21.5%. The dispersion for each class and subclass using two ways of classification was less than 5%. The rates of false positives using two ways of classification execution were 11.7% (trained classifier) and 31.97% (untrained classifier).
The rates of assignment of the malicious samples to wrong botnet's class using trained and untrained classifiers were 0.01% and 2.14% respectively.

CONCLUSION
The paper presents a botnet detection approach for the distributed systems. The novel framework provides the ability to detect botnets. It consists of two parts: the host and the network levels.
At the host level, the detection procedure is based on the implementation of the Bayes classification. The network level extends the results obtained at the host level to the rest of the local area network. The approach provides the exchange of the results obtained by the Bayes classification for further use by other program units of the distributed system.
The results of the developed classifier show that representation of the botnets' samples for different classes and subclasses is sufficient for efficient botnet detection. The results of the experiment demonstrated, that the accuracy of the botnet's detection is up to 88%.

THE FUTURE WORK
The future work is to develop new methods for botnet detection, which will be focused on the architecture of the distributed systems. It should involve the advantages over the host methods, extending by new botnets samples for more efficient botnet detection.