A FRAMEWORK FOR INCREMENTAL PARALLEL MINING OF INTERESTING ASSOCIATION PATTERNS FOR BIG DATA

Association rule mining plays a very important role in the distributed environment for Big Data analysis. The massive volume of data creates imminent needs to design novel, parallel and incremental algorithms for the association rule mining in order to handle Big Data. In this paper, a framework is proposed for incremental parallel interesting association rule mining algorithm for Big Data. The proposed framework incorporates interestingness measures during the process of mining. The proposed framework works to process the incremental data, which usually comes at different times, the user's important knowledge is explored by processing of new data only, without having to return from scratch. One of the main features of this framework is to consider the user domain knowledge, which is monotonically increased. The model that incorporates the users’ belief during the extraction of patterns is attractive, effective and efficient. The proposed framework is implemented on public datasets as well as it is evaluated based on the interesting results that are found.


INTRODUCTION
Recent advances in digital data collection and data acquisition technologies have opened new avenues to acquire and store increasingly massive volumes of data. This rapid growth of data leads to several considerable issues such as storage, security, scalability, and extraction of interesting knowledge which are difficult to handle using conventional techniques, methods, and tools. Data is useful only if it can be interpreted, analyzed and if a conclusion can be drawn from them [1][2][3].
Big Data mining refers to finding extraction techniques that are performed on Big Data. Big Data extracts and retrieves interesting patterns from a massive volume of data [4]. Association rule mining plays a very important role in a distributed environment in Big Data analysis [5].
Although many efficient algorithms have been developed to extract association rules, traditional algorithms do not work well on Big Data [6][7][8][9]. The main drawbacks with such algorithms are that they don't consider the data size and the time when the data arrives and therefore build a model in batch manner. In contrast, incremental algorithm constructs and refines the model as long as new data arrives at different times [10][11][12][13].
The aim of this paper is to propose a framework for incremental parallel mining of interesting association rules for big data. One of the main advantages of the proposed framework is to handle the time changing big data and user domain knowledge. This is useful when many datasets arrive at different times or from a distributed environment. Certainly, it is desirable to update the discovered patterns each time new data arrives. The incremental and parallel nature of the proposed framework makes it valuable to extract interesting patterns at a current time with regard to the previously discovered patterns, more willingly than comprehensively extracting all patterns.
The parallel and incremental association rules algorithms that incorporate the users' domain knowledge during the extraction of patterns are attractive, effective and efficient for the knowledge discovery in database (KDD) process.

RELATED WORKS
Frequent itemsets mining algorithms are such as Apriori method [14] and Tree method [15]. Also Parallel frequent itemsets mining algorithms are based on Apriori methods [14] such as in [6-, 7, 16]. They are categorized as count distribution (e.g., parallel data mining (PDM) [6], fast parallel mining [7]), and data distribution (DD) [17]. The assumptions of these approaches are that each processor of a parallel system calculates the local support counts of all candidate itemsets. Then, all processors compute the total support counts of the candidates by exchanging the local support counts. Other parallel frequent itemsets mining algorithms are based on Tree methods [15]. For example, Parallel FP-Growth algorithm (PFP-Growth) which is based on the clustered system [18], load balanced parallel FP Growth algorithm [19], an efficient parallel algorithm using message passing interface on a shared-nothing multiprocessor system [9], and Parallel FP-Growth algorithm to mining frequent patterns [8]. PFP algorithm makes use of the MapReduce parallel programming model for the purpose of analysis and mining of data [8,20]. It splits the database into small chunks and then uses the MapReduce in three phases to count values, group items, and build tree, and eventually integrates as well as combines the results of the previous phases. The main drawbacks of PFP-Growth are that it does not work on an incremental database and doesn't use any subjective measure of interestingness. Many works have been conducted for developing algorithms based on mining incremental association rules [21][22][23][24][25]. The main hypothesis of these algorithms is to update the discovered model when new data stream arrives. In [24], DEMON algorithm is proposed to handle the evolving data more effectively and efficiently. In [25], DELI algorithm is proposed for monitoring the environment changes of the data stream. It makes use of statistical methods for the updating process. DELI algorithm uses a sampling method to estimate the support counts using an approximate upper/lower bounds on the number of changes in the newly discovred association rules. As the low bound gets smaller, the changes of the association rules get smaller, therefore the model maintenance is not required. Although these algorithms are incremental, they don't reuse the previously discovered knowledge when new data arrive at new time instance. In [22,23] a Fast UPdate (FUP) algorithm is proposed which is incremental in nature for mining association rules in huge databases. It works by scanning the database to verify whether there are large itemsets or not. FUP algorithm is proposed to compute the large itemsets in the updated database.
The main purpose of this algorithm is to solve the efficient update issue of association patterns in the updated database. The algorithm is extended to FUP* and FUP2 that scan the database kth time. In [26] Paralle incremental FP-Growth (PIFP-Growth) is proposed for improvement PFP algorithm [8] to solve the problem of an incremental database. PIFP Growth is based on MapReduce [20] for parallelized incremental mining. The drawbacks of these algorithms are the following ones: they have many stages that are time consuming and perform MapReduce several times, for instance, PFP uses MapReduce in three stages out of seven stages while PIFP uses MapReduce in four stages out of seven stages. In addition, both algorithms don't use any subjective measure of interestingness. The novelty measure of discovered patterns is proposed in [10][11][12]. It is quantified with respect to known knowledge and it eliminates the patterns that are not interesting from the user's point of view. In our work, we take advantage of the novelty measure of interestingness proposed in [10][11][12]. Although PFP and PIFP are proposed to deal with parallel and incremental Big Data mining, both approaches are based on traditional FP Growth [15] and make use of MapReduce programming model. Our framework can use any frequent pattern mining algorithm which uses MapReduce. It is similar to PFP and PIFP as it uses MapReduce to achieve parallelism but it differs from PIFP in its incremental manner. The major differences between the proposed framework and PFP and PIFP are:  PFP uses MapReduce in three out of its five stages and PIFP uses MapReduce in four out of its seven stages while the proposed framework uses MapReduce only twice out of its four stages.  Both PFP and PIFP don't consider the previous, discovered patterns when new data arrives while the framework updates the model with novel patterns as new data stream arrives.  The PIFP resets the threshold value as new data arrives and updates the old local tree while our approach constructs different local tree as new data arrives and generates new frequent items.  To achieve parallelism, PFP and PIFP divide up items into groups and perform Generating group dependent transactions to build trees and extract frequent items while the framework uses MapReduce to construct trees directly from transactions after pruning the infrequent items that don't meet the minimum support criterion. Even though PFP and PIFP are based on FP-Growth which includes two steps, the framework adds extra steps in order to update the model as new data arrives and guarantees that the discovered patterns are interesting.
The rest of the paper is organized as follows. In section 3, we present the problem statement. A Framework for Incremental Parallel Mining of Interesting Association Patterns is presented in section 4. In section 5 a detailed example is illustrated. In section 6, the experimental results are presented and the conclusion is given in section 7.

PROBLEM STATEMENT
, and F-List is generated to construct local trees FP-Treem. Subsequently, frequent items are extracted and association rules are generated to form model Ti .
Let Mi and Mi+1 be two models discovered at time instances ti and ti+1 from datasets

A FRAMEWORK FOR INCREMENTAL PARALLEL MINING OF INTEREST
In this paper, we present a framework that efficiently discovers interesting patterns from Big Data. It makes use of MapReduce [20] to deal with data in a parallel manner. Our proposed framework is similar to the PFP [8] algorithm except that each rule generated from frequent itemset list in PFP may not be interesting. At time Ti, our framework computes the novelty aspect of interestingness measure with respect to the existing model MTi and pruning uninteresting patterns that are not significant in the current data set. The framework is shown in Fig. 1. It comprises 3 phases namely, building local tree, finding frequent itemset, and building incremental interesting model. These phases are explained in the following subsections:

BUILDING LOCAL TREES
In this phase, Big Data is divided into m small parts, where m can be set manually, among P processors using the MapReduce parallel programming model for the purpose of analyzing and mining data. Each P MapReduce first, reads each small part to achieve parallel count and the integrated count results into a frequent list called F-List, then, it sorts the items of F-List in descending order. Finally, MapReduce performs the second iteration to read each small part and build a local FP -Tree. The phase outputs are FP-Treem. The following steps are required to build local trees and the algorithm is presented in Algorithm 1.

FINDING FREQUENT ITEMSETS
In this phase, the FP-Treem generated in the 1 st phase is taken by Mappers which connect trees with each other from different nodes. Subsequently, the Reducers extract the frequent itemset from trees, and save them in memory temporarily. The output of this phase is the list of frequent itemset. The following steps are required to find the frequent itemsets and the algorithm is presented in Algorithm 2.
1. Divide F-list to number of groups (mGroups) called G-list. 2. Each G-list is sent to different processors each of which has MapReduce. 3. For every processor, the items of descending order of F-list (from that last item to first item) are examined to find out whether these items belong to G-list or not. a. Mapper reads all paths of each item in different FP Trees and extract l temporary local F-list for each item. b. Reduce constructs temporary local tree for each item based on their paths and temporary local F-list. c. For every temporary local tree, Reduce extracts local frequent items for the items with unique paths, otherwise, the previous steps are repeated. 4. Merge all local frequent item list on different processor to form frequent items.

BUILDING INCREMENTAL INTERESTING MODEL
In this phase, association patterns are extracted from the frequent itemsets. These patterns are evaluated using confidence measure and prune the patterns that do not satisfy this criterion resulting in a set of strong association patterns which are subjected to the novelty criterion [11] with the aim of deciding either these patterns are interesting or not. This phase takes into consideration the existing model Mi representing the known association rules and consequently resulting in discovering of Mi+1. For each frequent itemsets, only novel rules are extracted and used to update the model Mi+1. We compute novelty degree rule with the novelty measure (NM), (NM) presented in [10] as shown in equation 1: where S1 and S2 are two conjunct sets with cardinalities |S1| and |S2| respectively. K = the pairs of compatible conjuncts between S1 and S2. , is the i th pair of compatible conjuncts. The algorithm computes novelty measure (NM) at every stage of rules generation to determine whether a rule is likely to lead to an interesting rule, or not. A rule becomes a candidate for next stage rule generation if its novelty measure (NM) value is 1 or the relevance factor of the closest rule in M is less than the relevance factor threshold value. An interestingness value of 1 of the partial temporal rule indicates that this rule is unlikely to expand to any existing temporal association rule. The following steps are required to build the incremental interesting model and the algorithm is presented in Algorithm 3.
1. Generate association rules R from frequent item list.

A DETAILED EXAMPLE
For better understanding of our framework, consider a Big Data D arrived at time T1, denoted by D0. It contains 6 transactions as shown in Table 1. Suppose, D0 is partitioned into 3 parts for the sake of parallel mining, i.e., m = 3, each of which is called dPi, i = 1; 2; 3 whereas ⊂ . Table 2 shows the data in every partition which has to be sent to different computers Pi. The computers Pi in turn computes support of its items by using Mapper and store the counts into f-list local. The following F-list local are generated from P1, P2 and P3 respectively: Subsequently, the f-local lists are merged into cumulative list called F-List as follows: F-list = {A=4, B=6, C=4, D=5, E=4, F=1, G=2}. Now if we consider that the minimum, support =50%, the items G and F will be eliminated from F-list. Then, F-list is sorted on the basis of support in descending order as follows: F-list= {B=6,D=5,A=4,C=4,E=4}. The final F-list represents the reference for every Pi where local trees are constructed using Reduce. During construction of local trees, the items in every transaction are sorted in descending order according to their position in F-list and ignore items which are not in F-list as shown in the third column of Table 2. The ordered frequent items are used to construct local trees in which the roots are set to null. The local trees are constructed using FP-growth algorithm in every computer Pi as shown in Fig. 2. These local trees and F-list, which are maintained in the memory of Pi by using MapReduce, are the outcome of the first stage of the proposed framework. As the FP-Growth makes use of bottomup strategy, the last item in F-list is considered first which is E in our example. All paths of E are examined in all local trees resulting the following: . Note that the item A will be removed due to minimum support criterion. Finally, the frequent items are generated as the path of the item C is unique. Similarly, the same process is executed for the remaining items and all frequent items are merged together which form the outcome of this stage. In our example, the frequent items list is = {[B,D,E:3] and [A,B,C,D:3]}. The next stage is to generate association rules from frequent item sets generated in the previous stage. Table 3 shows the corresponding set of discovered association rules assuming that the confidence threshold value is 0.6 for the frequent items {B,D,E:3}.    Notice that in Table 3, the rules R1, R3, and R7 are eliminated due to confidence criterion which is set to be 60%.
The remaining rules are subjected to novelty measure which is set in the example to be %50. As the data arrived at time T1, no comparison will be made against Model M1 because there are no novel rules discovered so far at time T1. The last two columns of Table 3 show the computation of novelty degree of the rules and therefore the rules which are not novel are eliminated as they are uninteresting. Subsequently, the model M1 is updated as shown in Fig. 3. Now suppose another data D1 arrives at time T2 as shown in Table 4. The same stages are repeated taking into consideration the novel rules in the model M1. Table 5 shows the corresponding set of the discovered association rules assuming that the confidence threshold value is 0.6 for the frequent items {B,C,E : 2}. Notice that in Table 5, the rules {R13,R17,R18} are found to be novel and hence the model M1 is updated incrementally to form model M2 as shown in Fig. 4.

EXPERIMENTAL STUDIES
In this section, experimental results are presented, in particular those related to the proposed framework performance. We conducted two experiments, the first experiment is shown in section 6.1 and the second experiment is in section 6.2. The proposed framework and other algorithms are written in Java and implemented on Hadoop. All experiments were conducted on a PC with Intel Core i5 2.6 GHz and 4G main memory, running on Microsoft Windows 10 64-bit. The experiments are conducted using real-life datasets available at http://kdd.ics.uci.edu. The datasets are considered as evolving with time, and divided up into three increments: D1, D2, and D3 assumed that they have arrived at times T1, T2, and T3 respectively. Table 6 shows the characteristics of these datasets.  T1  330001  31783  8  T2  330001  32098  8  T3  330000  32218  8   Accidents   T1  113395  395  33  T2  113394  398  33  T3  113394  385  33   T40I10D10  0K   T1  33334  942  39  T2  33333  941  39  T3  33333  942  39   T10I4D100  K   T1  33334  868  10  T2  33333  870  10  T3 33333 869 10

FIRST EXPERIMENT
In this experiment, the performance of the proposed framework is compared to the FP-Growth algorithm, PFP-Growth, and PIFP. Since the number of discovered rules of the FP-Growth algorithm, PFP-Growth and PIFP are similar, we perform the comparison to the FP-Growth algorithm only. The performance is measured in term of the number of discovered rules with various thresholds of minimum support and confidence and fixed novelty threshold Φ = 0.50. The dataset used is (Kosarak) and it is considered evolving with time and partitioned into three parts representing times T1; T2; T3 respectively as shown in Table 6. As we can notice in Table 7, the number of discovered rules is reduced in the proposed framework compared to FP-Growth in all various minimum support and confidence. Fig. 5, Fig. 6, and Fig. 7 show the reduction of the discovered rules using (Kosarak) dataset at T1,T2, and T3 times.

SECOND EXPERIMENT
The objective of the second experiment is to show the effectiveness of our framework in reducing the number of discovered rules against PIFP algorithm. It is expected that the number of discovered rules keeps on decreasing over the time. We work with four datasets and considered these datasets as evolving with time, and partitioned them into 3 increments: D1, D2 and D3 mined at times T1, T2 and T3 respectively. For each dataset used, the minimum support and minimum confidence are fixed and Novelty threshold Φ varies. It is observed that the number of interesting rules decreases in our framework in contrast to the number of rules discovered by PIFP algorithm at T1,T2, and T3. Intuitively, the interesting rules discovered by our framework at time T1 is no more interesting at time T2 and the interesting rules discovered at time T2 is no more interesting at time T3. Consequently, as the value of Novelty threshold Φ increases, the number of discovered interesting rules decreases at each time as per our expectations. The results are demonstrated in Table 8.

CONCLUSION AND FUTURE WORK
In this research, we proposed a framework for incremental parallel interesting association rule mining for Big Data. The proposed approach incorporates interestingness measure during the process of mining. It makes a self-upgrading model that utilizes novelty criterion to reflect the user subjectivity and extract patterns, incrementally, from datasets arrive at different points in time. Our future work includes enhancing the framework to create an association system in which the model can adapt to a data stream environment.