Dang, A., Moh’d, A., Islam, A., Minghim, R., Smit, M., & Milios, E., Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis, in Proceedings COLING ’16 Osaka Japan (to be appeared)
This paper introduces a new large-scale n-gram corpus that is created specifically from social media
text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is
created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is
exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorithm. The
experimental results show that unsupervised TLSA outperforms all the state-of-the-art unsupervised
and semi-supervised methods in SEMEVAL 2015: paraphrase and semantic similarity in Twitter tasks.
Reddit temporal n-gram corpus
As the corpus is very big, we are applying for Compute Canada resources to provide a reasonable returned result. We will provide a link when it is ready.
Reddit Comment Extraction: Github
Reddit N-gram Tokenizer: Github
Reddit Big Query Upload: Github
Topic-based LSA (TLSA): Github