EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION
DOI:
https://doi.org/10.47839/ijc.7.1.487Keywords:
Web logs, text compression, table compressionAbstract
Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.References
P. Skibinski, Sz. Grabowski Sz., J. Swacha, Effective asymmetric XML compression, Software–Practice and Experience, accepted.
Sz. Grabowski, S. Deorowicz. Web log compression, AGH Automatyka 11 (3) (2007). pp. 417–424.
S. Deorowicz, Sz. Grabowski. Sub-atomic field processing for improved web log compression. Proceedings of the 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), Lviv-Slavsko, Ukraine, 19–23 Feb. 2008, pp. 551–556.
A. A. Benczur, K. Csalogany, K. Hum, A. Lukacs, B. Racz, Cs. I. Sidlo, M. Uher, L. Vegh. Architecture for mining massive web logs with experiments. Proceedings of the HUBUSKA Open Workshop on Generic Issues of Knowledge Technologies, 2005. http://www.sztaki.hu/~alukacs/Papers/origomining.pdf
B. Racz, A. Lukacs. High density compression of log files. Proceeding of the IEEE Data Compression Conference (DCC), Snowbird, UT, USA, 2004, p. 557.
P. Skibinski, J. Swacha. Fast and efficient log file compression. CEUR Workshop Proceedings of the 11th East-European Conference on Advances in Databases and Information Systems (ADBIS), Varna, Bulgaria, 23 Sept. – 3 Oct. 2007, pp. 330–342.
S. Albers. Online algorithms. Book chapter in Interactive Computation: The New Paradigm edited by D.Q. Goldin, S.A. Smolka and P. Wegner, Springer, 2006, pp. 143–164.
A. Kulpa, J. Swacha, R. Budzowski. Script-based system for monitoring client-side activity. Book chapter in Technologies for Business Information Systems edited by W. Abramowicz, H. Mayer, Springer, Dordrecht, Netherlands, 2007, pp. 393–402.
P. Skibinski, Sz. Grabowski, S. Deorowicz. Revisiting dictionary-based compression, Software–Practice and Experience 35 (15) (2005). pp. 1455–1476.
B. D. Vo, K.-P. Vo. Compressing table data with column dependency, Theoretical Computer Science 387 (3) (2007). pp. 273–283.
G. Graefe G., L. Shapiro. Data Compression and Database Performance, Proceedings of ACM/IEEE-CS Symposium on Applied Computing, Kansas City, MO, USA, 1991, pp. 22–27.
J. L. Bentley, D. D. Sleator, R. E. Tarjan, V. K. Wei. A locally adaptive data compression scheme, Communications of ACM 29 (4) (1986). pp. 320–330.
B. R. Iyer, D. Wilhite. Data Compression Support in Databases. Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, 12–15 Sept. 1994, pp. 695–704.
S. Deorowicz. Universal lossless data compression algorithms. Ph.D. dissertation, Silesian University of Technology, 2003.
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.