alt text

About

AffectiveTweets is a WEKA package for analyzing emotion and sentiment of tweets. The source code is hosted on Github.

The package implements WEKA filters for calculating state-of-the-art affective analysis features from tweets that can be fed into machine learning algorithms. Many of these features were drawn from the NRC-Canada System. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabelled tweets.

Description about the filters, installation instructions, and examples are given below.

Official Baseline System

The package was made available as the official baseline system for the WASSA-2017 Shared Task on Emotion Intensity (EmoInt) and for SemEval-2018 Task 1: Affect in Tweets.

Five participating teams used AffectiveTweets in WASSA-2017 to generate feature vectors, including the teams that eventually ranked first, second, and third. For SemEval-2018, the package was used by 15 teams.

Relevant Papers

The most relevant papers on which this package is based are:

Citation

Please cite the following paper if using this package in an academic publication:

You are also welcome to cite a previous publication describing the package:

  • S. M. Mohammad and F. Bravo-Marquez Emotion Intensities in Tweets, In *Sem '17: Proceedings of the sixth joint conference on lexical and computational semantics (*Sem), August 2017, Vancouver, Canada. (pdf)

You should also cite the papers describing any of the lexicons or resources you are using with this package.

  • Here is the BibTex entry for the package along with the entries for the resources listed below.

  • Here is the BibTex entry just for the package.

The individual references for each resource can be found through the links provided below.

Filters

Tweet-level Filters

  1. TweetToSparseFeatureVector: calculates sparse features, such as word and character n-grams from tweets. There are parameters for filtering out infrequent features e.g., (n-grams occurring in less than m tweets) and for setting the weighting approach (boolean or frequency based).

    • Word n-grams: extracts word n-grams from n=1 to a maximum value.
      • Negations: add a prefix to words occurring in negated contexts, e.g., I don't like you => I don't NEG-like NEG-you. The prefixes only affect word n-gram features. The scope of negation finishes with the next punctuation expression ([\.|,|:|;|!|\?]+) .
    • Character n-grams: calculates character n-grams.
    • POS tags: tags tweets using the CMU Tweet NLP tool, and creates a vector space model based on the sequence of POS tags. BibTex
    • Brown clusters: maps the words in a tweet to Brown word clusters and creates a low-dimensional vector space model. It can be used with n-grams of word clusters. The word clusters are also taken from the CMU Tweet NLP tool.
  2. TweetToLexiconFeatureVector: calculates features from a tweet using several lexicons.

    • MPQA: counts the number of positive and negative words from the MPQA subjectivity lexicon. BibTex
    • Bing Liu: counts the number of positive and negative words from the Bing Liu lexicon. BibTex
    • AFINN: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon. BibTex
    • Sentiment140: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon created with tweets annotated by emoticons. BibTex
    • NRC Hashtag Sentiment lexicon: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon created with tweets annotated with emotional hashtags. BibTex
    • NRC Word-Emotion Association Lexicon: counts the number of words matching each emotion from this lexicon. BibTex
    • NRC-10 Expanded: adds the emotion associations of the words matching the Twitter Specific expansion of the NRC Word-Emotion Association Lexicon. BibTex
    • NRC Hashtag Emotion Association Lexicon: adds the emotion associations of the words matching this lexicon. BibTex
    • SentiWordNet: calculates positive and negative scores using SentiWordnet. We calculate a weighted average of the sentiment distributions of the synsets for word occurring in multiple synsets. The weights correspond to the reciprocal ranks of the senses in order to give higher weights to most popular senses. BibTex
    • Emoticons: calculates a positive and a negative score by aggregating the word associations provided by a list of emoticons. The list is taken from the AFINN project.
    • Negations: counts the number of negating words in the tweet.
  3. TweetToInputLexiconFeatureVector: calculates features from a tweet using a given list of affective lexicons, where each lexicon is represented as an ARFF file. The features are calculated by adding or counting the affective associations of the words matching the given lexicons. All numeric and nominal attributes from each lexicon are considered. Numeric scores are added and nominal are counted. The NRC-Affect-Intensity lexicon is used by deault. BibTex

  4. TweetToSentiStrengthFeatureVector: calculates positive and negative sentiment strengths for a tweet using SentiStrength. Disclaimer: SentiStrength can only be used for academic purposes from within this package. BibTex

  5. TweetToEmbeddingsFeatureVector: calculates a tweet-level feature representation using pre-trained word embeddings. A dummy word-embedding formed by zeroes is used for word with no corresponding embedding. The tweet vectors can be calculated using the following schemes:

    • Average word embeddings.
    • Add word embeddings.
    • Concatenation of first k embeddings. Dummy values are added if the tweet has less than k words.
  6. TweetNLPPOSTagger: runs the Twitter-specific POS tagger from the CMU TweetNLP library on the given tweets. POS tags are prepended to the tokens.

Word-level Filters

  1. PMILexiconExpander: calculates the Pointwise Mutual Information (PMI) semantic orientation for each word in a corpus of tweets annotated by sentiment. The score is calculated by subtracting the PMI of the target word with a negative sentiment from the PMI of the target word with a positive sentiment. This is a supervised filter. BibTex

  2. TweetCentroid: calculates word distributional vectors from a corpus of unlabelled tweets by treating them as the centroid of the tweet vectors in which they appear. The vectors can be labelled using an affective lexicon to train a word-level affective classifier. This classifier can be used to expand the original lexicon. BibTex, original paper

  3. LabelWordVectors: labels word vectors with an input lexicon in arff format. This filter is useful for training word-level affective classifiers.

Distant Supervision Filters

  1. ASA: Annotate-Sample-Average (ASA) is a lexicon-based distant supervision method for training polarity classifiers in Twitter in the absence of labelled data. It takes a collection of unlabelled tweets and a polarity lexicon in arff format and creates synthetic labelled instances. Each labelled instance is created by sampling with replacement a number of tweets containing at least one word from the lexicon with the desired polarity, and averaging the feature vectors of the sampled tweets. BibTex, original paper

  2. PTCM: The Partitioned Tweet Centroid Model (PTCM) is an adaption of the TweetCentroidModel for distant supervision. As tweets and words are represented by the same feature vectors, a word-level classifier trained from a polarity lexicon and a corpus of unlabelled tweets can be used for classifying the sentiment of tweets represented by sparse feature vectors. In other words, the labelled word vectors correspond to lexicon-annotated training data for message-level polarity classification. The model includes a simple modification to the tweet centroid model for increasing the number of labelled instances, yielding partitioned tweet centroids. This modification is based on partitioning the tweets associated with each word into smaller disjoint subsets of a fixed size. The method calculates one centroid per partition, which is labelled according to the lexicon. BibTex, original paper

  3. LexiconDistantSupervision: This is the most popular distant supervision approach for Twitter sentiment analysis. It takes a collection of unlabelled tweets and a polarity lexicon in arff format of positive and negative tokens. If a word from the lexicon is found, the tweet is labelled with the word's polarity. Tweets with both positive and negative words are discarded. The word used for labelling the tweet can be removed from the content. Emoticons are used as the default lexicon. original paper

Tokenizers

  1. TweetNLPTokenizer: a Twitter-specific String tokenizer based on the CMU Tweet NLP tool that can be used with the existing StringWordToVector Weka filter.

Other Resources

  1. Datasets: The package provides some tweets annotated by affective values in gzipped ARFF format in $WEKA_HOME/packages/AffectiveTweets/data/. The default location for $WEKA_HOME is $HOME/wekafiles.
  2. Affective Lexicons: The package provides affective lexicons in ARFF format. These lexicons are located in $WEKA_HOME/packages/AffectiveTweets/lexicons/arff_lexicons/ and can be used with the TweetToInputLexiconFeatureVector filter.

  3. Pre-trained Word-Embeddings: The package provides a file with pre-trained word vectors trained with the Word2Vec tool in gzip compressed format. It is a tab separated file with the word in last column located in $WEKA_HOME/packages/AffectiveTweets/resources/w2v.twitter.edinburgh.100d.csv.gz. However, this is a toy example trained from a small collection of tweets. We recommend downloading w2v.twitter.edinburgh10M.400d.csv.gz, which provides embeddings trained from 10 million tweets taken from the Edinburgh corpus. The parameters were calibrated for classifying words into emotions. More info in this paper.

Documentation

The Java documentation is available here.

Team

Main Developer

Contributors

Contact

  • Email: fbravo at dcc.uchile.cl
  • If you have questions about Weka please refer to the Weka mailing list.