Open Access Open Access  Restricted Access Subscription Access

A METHOD FOR AUTOMATIC TEXT SUMMARIZATION BASED ON RHETORICAL ANALYSIS AND TOPIC MODELING

Tatiana Batura, Aigerim Bakiyeva, Maria Charintseva

Abstract


This article describes the original method of automatic summarization of scientific and technical texts based on rhetorical analysis and using topic modeling. The proposed method combines the use of a linguistic knowledge base and machine learning. For the detection of key terms, we used topic modeling. First, unigram topic models containing only one-word terms are constructed. Further, these models are extended by adding multiword terms. The most significant fragments of the original document are determined in the process of rhetorical analysis with the help of discursive markers. When evaluating the importance of text fragments, keywords, multiword terms, and scientific lexicon characterizing scientific and technical texts are also taken into account. A linguistic knowledge base has been created to store information about the markers and scientific lexicon. The experiments showed that this method is effective, needs a comparatively small amount of training data and can be adapted to processing texts of different subject fields in other languages.

Keywords


natural language processing; automatic summarization; rhetorical structure theory; discourse markers; additive regularization; topic modeling.

Full Text:

PDF

References


E. Lloret, M.T. Roma-Ferri, M. Palomar, “COMPENDIUM: A text summarization system for generating abstracts of research papers,” Data & Knowledge Engineering, vol. 88, pp. 164–175, 2013.

E. Hovy, Ch.-Y. Lin, “Automated text summarization and the SUMMARIST system,” Proceedings of the TIPSTER Text Program, 1998, pp. 197–214.

H. Saggion, G. Lapalme, “Generating indicative-informative summaries with SumUM,” Computational Linguistics, vol. 28, no. 4, pp. 497–526, 2002.

G.F. Foster, Statistical Lexical Disambiguation, Master’s Thesis, 1991, 340 p.

L. Plaza, A. Diaz, P. Gervas, “Concept-graph based biomedical automatic summarization using ontologies,” Proceedings of the 3rd Textgraphs workshop on Graph-Based Algorithms in Natural Language Processing, Coling’2008, Manchester, 2008, pp. 53–56.

Unified Medical Language System (UMLS), 2016, [Online] Available at: http://www.nlm.nih.gov/research/umls/

A.R. Aronson, “Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap program,” Proceedings of the American Medical Informatics Association, 2001, pp. 17–21.

A. Farzindar, G. Lapalme, “Legal text summarization by exploration of the thematic structures and argumentative roles,” Proceedings of the Workshop on Text Summarization Branches Out, ACL, Barcelona, Spain, 2004, pp. 27–38.

F. Galgani, P. Compton, A. Hoffmann, “Combining different summarization techniques for legal text,” Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid 2012), EACL’2012, Avignon, France, 2012, pp. 115–123.

S. Megala, A. Kavitha, A. Marimuthu, “Feature extraction based legal document summarization,” International Journal of Advance Research in Computer Science and Management Studies, vol. 2, issue 12, pp. 346–352, 2014.

E. Lloret, E. Boldrini, T. Vodolazova, P. Martínez-Barco, R. Muñoz, M. Palomar, “A novel concept-level approach for ultra-concise opinion summarization,” Expert Systems with Applications, vol. 42, issue 20, pp. 7148–7156, 2015.

S. Brügmann, N. Bouayad-Aghab, A. Burga, S. Carrascosa, A. Ciaramella, M. Ciaramella, J. Codina-Filba, E. Escorsa, A. Judea, S. Mille, A. Müller, H. Saggion, P. Ziering, H. Schütze, L. Wanner, “Towards content-oriented patent document processing: Intelligent patent analysis and summarization,” World Patent Information, vol. 40, pp. 30–42, 2015.

D. Marcu, “Improving summarization through rhetorical parsing tuning,” Proceedings of the Sixth Workshop on Very Large Corpora, 1998, pp. 206–215.

F. Andonov, V. Slavova, G. Petrov, “On the open text summarizer,” International Journal "Information Content and Processing", vol. 3, no. 3, 2016. [Online]. Available at: http://www.foibg.com/ijicp/vol03/ijicp03-03-p05.pdf

S. Teufel, M. Moens, “Summarizing scientific articles: experiments with relevance and rhetorical status,” Computational Linguistics, vol. 28, issue 4, pp. 409–445, 2002.

W. Bosma, “Query-based summarization using rhetorical structure theory,” Proceedings of the 15th Meeting of CLIN, 2005, pp. 29–44.

S.H. Huspi, Improving Single Document Summarization in a Multi-Document Environment, PhD Thesis, RMIT University, Melbourne, Australia, 2017, 190 p.

S. Mithun, Exploiting Rhetorical Relations in Blog Summarization, PhD Thesis, Concordia University, Montreal, Canada, 2012, 230 p.

S.A. Trevgoda, “Methods and algorithms of automatic text summarization based on the analysis of functional relations,” Abstract of PhD Thesis, St. Peterburg, Russia, 2009, 15 p.

P.G. Osminin, Construction of a Model for Abstracting and Annotating Scientific and Technical Texts Focused on Automatic Translation, PhD Thesis, Chelyabinsk, Russia, 2016, 239 p.

K. Vorontsov, O. Frei, M. Apishev, P. Romov, M. Dudarenko, “BigARTM: open source library for regularized multimodal topic modeling of large collections,” Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST). Yekaterinburg, Russia, 2015, pp. 370–384.

W. Mann, C. Thompson, “Rhetorical structure theory: Toward a functional theory of text organization,” Text-Interdisciplinary Journal for the Study of Discourse, vol. 8, no. 3, pp. 243–281, 1988.

M. Louwerse, “An analytic and cognitive parameterization of coherence relations,” Cognitive Linguistics, vol. 12, issue 3, pp. 291–315, 2001.

S. Rose, D. Engel, N. Cramer, W. Cowley, “Automatic keyword extraction from individual documents,” Text Mining: Applications and Theory, 2010, pp. 3–20.

D. Das, A. Martins, “A survey on automatic text summarization. Literature,” Survey for the Language and Statistics II Course at CMU, vol. 4, pp. 192–195, 2007.

Ch.Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Proceedings of the Workshop on Text Summarization Branches Out, 2004, pp. 74–81.

J.J. Zhang, H.Y. Chan, P. Fung, “Improving lecture speech summarization using rhetorical information,” Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007, pp. 195–200.

A. Kozlova, O. Gureenkova, A. Svischev, T. Batura, “A hybrid approach for anaphora resolution in the Russian language,” Proceedings of the 2017 Siberian Symposium on Data Science and Engineering (SSDSE). Russia, 12-13 April 2017, pp. 36–40.

T. Batura, E. Bruches, “Combined approach to problem of part-of-speech homonymy resolution in Russian texts,” Proceedings of the 2018 International Russian Automation Conference, RusAutoCon 2018, 9-16 September 2018, pp. 4–9.


Refbacks

  • There are currently no refbacks.