Models

Here you can lear about trained models for cause-effect connectives mining.

Vectors we used: Download

For Word2Vec models training, we used the Russian mass media stemmed texts corpus (about 360 000 000 words). During preprocessing, we removed only a small number of stopwords - particles and some very frequent conjunctions. Punctuation marks and other stop-words were not removed from the articles before models training, because punctuation and functional words are important for discourse markers.

We provide two Word2Vec models with specific preprocessing. Both models are CBOW models using Gensim python package, with 10 epochs.

For the first model, before model training, for each cause-effect connective from our initial cause-effect list, we combined all its lemmas into one token (e.g. potomu_chto). For the second model, besides concatenating multi-word connectives as in the first model, we also made the further step. We concatenated all 3-grams that correspond to the patterns enumerated below into multi-word tokens.

the construction of 3-grams is in the beginning of the sentence/after a dash/after a semicolon. It begins with lemma of eto ’this’;
the construction of 3-grams is situated after a comma and begins with lemma of chto ’what’;
the construction of 3-grams that contain lemma of to ’that’. After them there is a comma, after the comma there is lemma of chto ’what’ or kak ’lit. as’.

After that we trained the models on this corpus.