A METHOD FOR AUTOMATIC TEXT SUMMARIZATION BASED ON RHETORICAL ANALYSIS AND TOPIC MODELING
Keywords:natural language processing, automatic summarization, rhetorical structure theory, discourse markers, additive regularization, topic modeling.
AbstractThis article describes the original method of automatic summarization of scientific and technical texts based on rhetorical analysis and using topic modeling. The proposed method combines the use of a linguistic knowledge base and machine learning. For the detection of key terms, we used topic modeling. First, unigram topic models containing only one-word terms are constructed. Further, these models are extended by adding multiword terms. The most significant fragments of the original document are determined in the process of rhetorical analysis with the help of discursive markers. When evaluating the importance of text fragments, keywords, multiword terms, and scientific lexicon characterizing scientific and technical texts are also taken into account. A linguistic knowledge base has been created to store information about the markers and scientific lexicon. The experiments showed that this method is effective, needs a comparatively small amount of training data and can be adapted to processing texts of different subject fields in other languages.
E. Lloret, M.T. Roma-Ferri, M. Palomar, “COMPENDIUM: A text summarization system for generating abstracts of research papers,” Data & Knowledge Engineering, vol. 88, pp. 164–175, 2013.
E. Hovy, Ch.-Y. Lin, “Automated text summarization and the SUMMARIST system,” Proceedings of the TIPSTER Text Program, 1998, pp. 197–214.
H. Saggion, G. Lapalme, “Generating indicative-informative summaries with SumUM,” Computational Linguistics, vol. 28, no. 4, pp. 497–526, 2002.
G.F. Foster, Statistical Lexical Disambiguation, Master’s Thesis, 1991, 340 p.
L. Plaza, A. Diaz, P. Gervas, “Concept-graph based biomedical automatic summarization using ontologies,” Proceedings of the 3rd Textgraphs workshop on Graph-Based Algorithms in Natural Language Processing, Coling’2008, Manchester, 2008, pp. 53–56.
Unified Medical Language System (UMLS), 2016, [Online] Available at: http://www.nlm.nih.gov/research/umls/
A.R. Aronson, “Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap program,” Proceedings of the American Medical Informatics Association, 2001, pp. 17–21.
A. Farzindar, G. Lapalme, “Legal text summarization by exploration of the thematic structures and argumentative roles,” Proceedings of the Workshop on Text Summarization Branches Out, ACL, Barcelona, Spain, 2004, pp. 27–38.
F. Galgani, P. Compton, A. Hoffmann, “Combining different summarization techniques for legal text,” Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid 2012), EACL’2012, Avignon, France, 2012, pp. 115–123.
S. Megala, A. Kavitha, A. Marimuthu, “Feature extraction based legal document summarization,” International Journal of Advance Research in Computer Science and Management Studies, vol. 2, issue 12, pp. 346–352, 2014.
E. Lloret, E. Boldrini, T. Vodolazova, P. Martínez-Barco, R. Muñoz, M. Palomar, “A novel concept-level approach for ultra-concise opinion summarization,” Expert Systems with Applications, vol. 42, issue 20, pp. 7148–7156, 2015.
S. Brügmann, N. Bouayad-Aghab, A. Burga, S. Carrascosa, A. Ciaramella, M. Ciaramella, J. Codina-Filba, E. Escorsa, A. Judea, S. Mille, A. Müller, H. Saggion, P. Ziering, H. Schütze, L. Wanner, “Towards content-oriented patent document processing: Intelligent patent analysis and summarization,” World Patent Information, vol. 40, pp. 30–42, 2015.
D. Marcu, “Improving summarization through rhetorical parsing tuning,” Proceedings of the Sixth Workshop on Very Large Corpora, 1998, pp. 206–215.
F. Andonov, V. Slavova, G. Petrov, “On the open text summarizer,” International Journal "Information Content and Processing", vol. 3, no. 3, 2016. [Online]. Available at: http://www.foibg.com/ijicp/vol03/ijicp03-03-p05.pdf
S. Teufel, M. Moens, “Summarizing scientific articles: experiments with relevance and rhetorical status,” Computational Linguistics, vol. 28, issue 4, pp. 409–445, 2002.
W. Bosma, “Query-based summarization using rhetorical structure theory,” Proceedings of the 15th Meeting of CLIN, 2005, pp. 29–44.
S.H. Huspi, Improving Single Document Summarization in a Multi-Document Environment, PhD Thesis, RMIT University, Melbourne, Australia, 2017, 190 p.
S. Mithun, Exploiting Rhetorical Relations in Blog Summarization, PhD Thesis, Concordia University, Montreal, Canada, 2012, 230 p.
S.A. Trevgoda, “Methods and algorithms of automatic text summarization based on the analysis of functional relations,” Abstract of PhD Thesis, St. Peterburg, Russia, 2009, 15 p.
P.G. Osminin, Construction of a Model for Abstracting and Annotating Scientific and Technical Texts Focused on Automatic Translation, PhD Thesis, Chelyabinsk, Russia, 2016, 239 p.
K. Vorontsov, O. Frei, M. Apishev, P. Romov, M. Dudarenko, “BigARTM: open source library for regularized multimodal topic modeling of large collections,” Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST). Yekaterinburg, Russia, 2015, pp. 370–384.
W. Mann, C. Thompson, “Rhetorical structure theory: Toward a functional theory of text organization,” Text-Interdisciplinary Journal for the Study of Discourse, vol. 8, no. 3, pp. 243–281, 1988.
M. Louwerse, “An analytic and cognitive parameterization of coherence relations,” Cognitive Linguistics, vol. 12, issue 3, pp. 291–315, 2001.
S. Rose, D. Engel, N. Cramer, W. Cowley, “Automatic keyword extraction from individual documents,” Text Mining: Applications and Theory, 2010, pp. 3–20.
D. Das, A. Martins, “A survey on automatic text summarization. Literature,” Survey for the Language and Statistics II Course at CMU, vol. 4, pp. 192–195, 2007.
Ch.Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Proceedings of the Workshop on Text Summarization Branches Out, 2004, pp. 74–81.
J.J. Zhang, H.Y. Chan, P. Fung, “Improving lecture speech summarization using rhetorical information,” Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007, pp. 195–200.
A. Kozlova, O. Gureenkova, A. Svischev, T. Batura, “A hybrid approach for anaphora resolution in the Russian language,” Proceedings of the 2017 Siberian Symposium on Data Science and Engineering (SSDSE). Russia, 12-13 April 2017, pp. 36–40.
T. Batura, E. Bruches, “Combined approach to problem of part-of-speech homonymy resolution in Russian texts,” Proceedings of the 2018 International Russian Automation Conference, RusAutoCon 2018, 9-16 September 2018, pp. 4–9.
How to Cite
LicenseInternational Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:
• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.