EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION

Authors

  • Sebastian Deorowicz
  • Szymon Grabowski

DOI:

https://doi.org/10.47839/ijc.7.1.487

Keywords:

Web logs, text compression, table compression

Abstract

Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.

References

P. Skibinski, Sz. Grabowski Sz., J. Swacha, Effective asymmetric XML compression, Software–Practice and Experience, accepted.

Sz. Grabowski, S. Deorowicz. Web log compression, AGH Automatyka 11 (3) (2007). pp. 417–424.

S. Deorowicz, Sz. Grabowski. Sub-atomic field processing for improved web log compression. Proceedings of the 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), Lviv-Slavsko, Ukraine, 19–23 Feb. 2008, pp. 551–556.

A. A. Benczur, K. Csalogany, K. Hum, A. Lukacs, B. Racz, Cs. I. Sidlo, M. Uher, L. Vegh. Architecture for mining massive web logs with experiments. Proceedings of the HUBUSKA Open Workshop on Generic Issues of Knowledge Technologies, 2005. http://www.sztaki.hu/~alukacs/Papers/origomining.pdf

B. Racz, A. Lukacs. High density compression of log files. Proceeding of the IEEE Data Compression Conference (DCC), Snowbird, UT, USA, 2004, p. 557.

P. Skibinski, J. Swacha. Fast and efficient log file compression. CEUR Workshop Proceedings of the 11th East-European Conference on Advances in Databases and Information Systems (ADBIS), Varna, Bulgaria, 23 Sept. – 3 Oct. 2007, pp. 330–342.

S. Albers. Online algorithms. Book chapter in Interactive Computation: The New Paradigm edited by D.Q. Goldin, S.A. Smolka and P. Wegner, Springer, 2006, pp. 143–164.

A. Kulpa, J. Swacha, R. Budzowski. Script-based system for monitoring client-side activity. Book chapter in Technologies for Business Information Systems edited by W. Abramowicz, H. Mayer, Springer, Dordrecht, Netherlands, 2007, pp. 393–402.

P. Skibinski, Sz. Grabowski, S. Deorowicz. Revisiting dictionary-based compression, Software–Practice and Experience 35 (15) (2005). pp. 1455–1476.

B. D. Vo, K.-P. Vo. Compressing table data with column dependency, Theoretical Computer Science 387 (3) (2007). pp. 273–283.

G. Graefe G., L. Shapiro. Data Compression and Database Performance, Proceedings of ACM/IEEE-CS Symposium on Applied Computing, Kansas City, MO, USA, 1991, pp. 22–27.

J. L. Bentley, D. D. Sleator, R. E. Tarjan, V. K. Wei. A locally adaptive data compression scheme, Communications of ACM 29 (4) (1986). pp. 320–330.

B. R. Iyer, D. Wilhite. Data Compression Support in Databases. Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, 12–15 Sept. 1994, pp. 695–704.

S. Deorowicz. Universal lossless data compression algorithms. Ph.D. dissertation, Silesian University of Technology, 2003.

Downloads

Published

2014-08-01

How to Cite

Deorowicz, S., & Grabowski, S. (2014). EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION. International Journal of Computing, 7(1), 35-42. https://doi.org/10.47839/ijc.7.1.487

Issue

Section

Articles