Models
Here you can lear about trained models for cause-effect connectives mining.
Vectors we used: Download
For Word2Vec models training, we used the Russian mass media stemmed texts corpus (about 360 000 000 words). During preprocessing, we removed only a small number of stopwords - particles and some very frequent conjunctions. Punctuation marks and other stop-words were not removed from the articles before models training, because punctuation and functional words are important for discourse markers.
We provide two Word2Vec models with specific preprocessing. Both models are CBOW models using Gensim python package, with 10 epochs.
For the first model, before model training, for each cause-effect connective from our initial cause-effect list, we combined all its lemmas into one token (e.g. potomu_chto). For the second model, besides concatenating multi-word connectives as in the first model, we also made the further step. We concatenated all 3-grams that correspond to the patterns enumerated below into multi-word tokens.
After that we trained the models on this corpus.