INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Dipak V. Patil; Rajankumar S. Bichkar

doi:10.47839/ijc.11.3.565

Authors

Dipak V. Patil
Rajankumar S. Bichkar

DOI:

https://doi.org/10.47839/ijc.11.3.565

Keywords:

Large data sets, decision tree, data cleaning, data sampling, speedup and classification accuracy.

Abstract

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

References

Tim Oates, David Jensen, The effect of training set size on decision tree complexity, Proceedings 14th International Conference on Machine Learning, (1997), pp. 254-262.

G.H. John and Pat Langley, Static versus dynamic sampling for data mining, In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data Mining, 1996.

V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley and Sons, 1978.

Gamberger, N. Lavrac, and S. Dzeroski, Noise elimination in inductive concept learning: a case study in medical diagnosis, In proceedings of 7th International Workshop on Algorithmic Learning Theory, Sydney, 1996.

Ian H. Witten and Eibe Frank, Data Mining Practical Machine Learning Tools and Techniques, Morgan Kaufmann publications, 2005.

Quinlan J. R., C4.5: Programs for Machine Learning, Morgan Kaufman, San Mateo, 1993.

Quinlan J.R., Decision trees and decision making, IEEE Transaction on Systems, Man, and Cybernetics, (20) 2 (1990), pp. 339-346.

Salvatore Ruggieri. Efficient C4.5: IEEE Transaction On Knowledge and Data Engineering, (14) 2 (2002), pp. 438-444.

Quinlan J.R., Simplifying decision trees, International Journal of Man Machine Studies, 1987.

Zhiwei Fu, Fannie Mae, A computational study of using genetic algorithms to develop intelligent decision trees, Proceedings of the 2001 IEEE congress on evolutionary Computation, 2001.

Athanassios Papagelis, Dimitrios Kalles, GATree: genetically evolved decision trees, Proceedings 12th International Conference on Tools with Artificial Intelligence, (13-15 November 2000), pp. 203-206.

A. Niimi and E. Tazaki, Genetic programming combined with association rule algorithm for decision tree construction, Proceedings of Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2 (2000), pp. 746-749.

Y. Kornienko and A. Borisov, Investigation of a hybrid algorithm for decision tree generation, Proceedings of the Second IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing, 2003.

G. H. John, Robust decision trees: removing outliers from databases, In Proceedings of the First ICKDDM, (1995), pp. 174–179.

C. E. Brodley and M. A. Friedl, Identifying mislabelled training data, Journal of Artificial Intelligence Research, (1999), pp. 131-167.

Arning, R. Agrawal, and P. Raghavan, A linear method for deviation detection in large databases, In KDDM, 1996, pp. 164-169.

V. Raman and J.M. Hellerstein, An interactive framework for data transformation and cleaning, Technical report University of California Berkeley, California, September 2000.

A. I. Guyon, N. Matic and V. Vapnik, Discovering informative patterns and data cleaning, Advances in knowledge discovery and data mining, AAAI, (1996), pp. 181-203.

D.Gamberger and N. Lavrac, Conditions for Occam’s razor applicability and noise elimination, Proceedings of the 9th European Conference on Machine Learning, Springer, (1997), pp. 108-123.

E. M. Knorr and R. T. Ng, Algorithms for mining distance-based outliers in large datasets, In Proceedings 24th VLDB, (1998), pp. 392-403.

D. Tax and R. Duin, Outlier detection using classifier instability, Proceedings of the workshop Statistical Pattern Recognition, Sydney, 1998.

D. D. Gamberger, N. Lavrac, and C. Groselj, Experiments with noise filtering in a medical domain, In Proceedings 16th ICML, Morgan Kaufman, San Francisco, CA, (1999), pp. 143-151.

S. Schwarm and S. Wolfman, Cleaning data with Bayesian methods, Final project report for University of Washington Computer Science and Engineering CSE574, March 16, 2000.

S. Ramaswamy, R. Rastogi, and K. Shim, Efficient algorithms for mining outliers from large data sets, ACM SIGMOD, (29) 2 (2000), pp. 427-438.

J. Kubica and A. Moore, Probabilistic noise identification and data cleaning, Third IEEE International Conference on Data Mining, (19-22 Nov. 2003).

V. Verbaeten and A. V. Assche, Ensemble Methods for Noise Elimination in Classification Problems, In Multiple Classifier Systems. Springer, 2003.

J. A. Loureiro, L. Torgo, and C. Soares, Outlier detection using clustering methods: a data cleaning application, In Proceedings of KDNet Symposium on Knowledge-based Systems, Bonn, Germany, 2004.

H. Xiong, G. Pande, M. Stein and Vipin Kumar, Enhancing data analysis with noise removal, IEEE Transaction on knowledge and Data Engineering, (18) 3 (2006), pp. 304-319.

D. V. Patil and R. S. Bichkar, Improving classification performance of evolutionary decision tree using classification filter, Third International Conference on Info. Processing, Bangalore, India, (2009), pp.696-700.

Slobodan Vucetic and Zoran Obradovic, Performance controlled data reduction forknowledge discovery in distributed databases, Proceedings 4th Pacific-Asia Conference, PADKK 2000, Kyoto, Japan, (18-20 April 2000), pp. 29-39.

A. Lazarevic and Z. Obradovic, Data reduction using multiple models integration: principles of data mining and knowledge discovery, 5th European Conference, PKDD 2001, Freiburg, Germany, (3-5 September 2001), pp. 301-313.

D. V. Patil and R. S. Bichkar, A hybrid evolutionary approach to construct optimal decision trees with large data sets, In Proceedings IEEE ICIT06, Mumbai, India (15-17 December 2006), pp. 429-433.

Frank A. and Asuncion A., UCI Machine learning repository Irvine. CA, [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, 2010.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA Data Mining Software: An Update; SIGKDD Explorations, (11) 1 (2009).

International Journal of Computing

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information