Ru-RSTreebank. Russian Discourse Corpus.

Ru-RSTreebank is a corpus of texts in Russian annotated in the framework of the Rhetorical Structure Theory that was developed in the 1980s by W. Mann and S. Thompson.

Learn more:

Purpose

The corpus is intended for researchers interested in studying written discourse. The corpus allows you to conduct various experiments on the automatic text analysis the using data on discourse relations within it.

Possible applications include text generation, fact extraction, automatic summarization, anaphora resolution etc.

Contents

Corpus volume: 333 texts, about 328 000 tokens.

Genres: news, popular science, scientific articles and blogs

News

December 2019: added texts from social media (blogs) to corpus
May 2019: launch of Ru-RSTreebank website
August 2022: updated markup for news and blogs. Fixed structural errors and unreadable files, improved internal markup consistency.

Citation

When quoting or mentioning project materials, please cite one of the following the following: 

A: Pisarevskaya D. et al. Towards building a discourse-annotated corpus of Russian // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2017”. – 2017. – Pages 194-204;
B: Rhetorical parser description: Chistova E. et al. RST Discourse Parser for Russian: An Experimental Study of Deep Learning Models // Proceedings of Analysis of Images, Social Networks and Texts (AIST). — 2020. — P. 105-119.