The package implements WEKA filters for calculating state-of-the-art affective analysis features from tweets that can be fed into machine learning algorithms. Many of these features were drawn from the NRC-Canada System. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabelled tweets.
Description about the filters, installation instructions, and examples are given below.
Official Baseline System
Five participating teams used AffectiveTweets in WASSA-2017 to generate feature vectors, including the teams that eventually ranked first, second, and third. For SemEval-2018, the package was used by 15 teams.
The most relevant papers on which this package is based are:
- Sentiment Analysis of Short Informal Texts. Svetlana Kiritchenko, Xiaodan Zhu and Saif Mohammad. Journal of Artificial Intelligence Research, volume 50, pages 723-762, August 2014. BibTeX
Meta-Level Sentiment Models for Big Social Data Analysis. F. Bravo-Marquez, M. Mendoza and B. Poblete. Knowledge-Based Systems Volume 69, October 2014, Pages 86–99. BibTex
Stance and sentiment in tweets. Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. 2017. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media 17(3). BibTeX
- Sentiment strength detection for the social Web. Thelwall, M., Buckley, K., & Paltoglou, G. (2012). Journal of the American Society for Information Science and Technology, 63(1), 163-173. BibTex
Please cite the following paper if using this package in an academic publication:
- F. Bravo-Marquez, E. Frank, B. Pfahringer, and S. M. Mohammad AffectiveTweets: a WEKA Package for Analyzing Affect in Tweets, In Journal of Machine Learning Research Volume 20(92), pages 1−6, 2019. (pdf)
You are also welcome to cite a previous publication describing the package:
- S. M. Mohammad and F. Bravo-Marquez Emotion Intensities in Tweets, In *Sem '17: Proceedings of the sixth joint conference on lexical and computational semantics (*Sem), August 2017, Vancouver, Canada. (pdf)
You should also cite the papers describing any of the lexicons or resources you are using with this package.
Here is the BibTex entry for the package along with the entries for the resources listed below.
Here is the BibTex entry just for the package.
The individual references for each resource can be found through the links provided below.
TweetToSparseFeatureVector: calculates sparse features, such as word and character n-grams from tweets. There are parameters for filtering out infrequent features e.g., (n-grams occurring in less than m tweets) and for setting the weighting approach (boolean or frequency based).
- Word n-grams: extracts word n-grams from n=1 to a maximum value.
- Negations: add a prefix to words occurring in negated contexts, e.g., I don't like you => I don't NEG-like NEG-you. The prefixes only affect word n-gram features. The scope of negation finishes with the next punctuation expression ([\.|,|:|;|!|\?]+) .
- Character n-grams: calculates character n-grams.
- POS tags: tags tweets using the CMU Tweet NLP tool, and creates a vector space model based on the sequence of POS tags. BibTex
- Brown clusters: maps the words in a tweet to Brown word clusters and creates a low-dimensional vector space model. It can be used with n-grams of word clusters. The word clusters are also taken from the CMU Tweet NLP tool.
- Word n-grams: extracts word n-grams from n=1 to a maximum value.
TweetToLexiconFeatureVector: calculates features from a tweet using several lexicons.
- MPQA: counts the number of positive and negative words from the MPQA subjectivity lexicon. BibTex
- Bing Liu: counts the number of positive and negative words from the Bing Liu lexicon. BibTex
- AFINN: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon. BibTex
- Sentiment140: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon created with tweets annotated by emoticons. BibTex
- NRC Hashtag Sentiment lexicon: calculates positive and negative variables by aggregating the positive and negative word scores provided by this lexicon created with tweets annotated with emotional hashtags. BibTex
- NRC Word-Emotion Association Lexicon: counts the number of words matching each emotion from this lexicon. BibTex
- NRC-10 Expanded: adds the emotion associations of the words matching the Twitter Specific expansion of the NRC Word-Emotion Association Lexicon. BibTex
- NRC Hashtag Emotion Association Lexicon: adds the emotion associations of the words matching this lexicon. BibTex
- SentiWordNet: calculates positive and negative scores using SentiWordnet. We calculate a weighted average of the sentiment distributions of the synsets for word occurring in multiple synsets. The weights correspond to the reciprocal ranks of the senses in order to give higher weights to most popular senses. BibTex
- Emoticons: calculates a positive and a negative score by aggregating the word associations provided by a list of emoticons. The list is taken from the AFINN project.
- Negations: counts the number of negating words in the tweet.
TweetToInputLexiconFeatureVector: calculates features from a tweet using a given list of affective lexicons, where each lexicon is represented as an ARFF file. The features are calculated by adding or counting the affective associations of the words matching the given lexicons. All numeric and nominal attributes from each lexicon are considered. Numeric scores are added and nominal are counted. The NRC-Affect-Intensity lexicon is used by deault. BibTex
TweetToSentiStrengthFeatureVector: calculates positive and negative sentiment strengths for a tweet using SentiStrength. Disclaimer: SentiStrength can only be used for academic purposes from within this package. BibTex
TweetToEmbeddingsFeatureVector: calculates a tweet-level feature representation using pre-trained word embeddings. A dummy word-embedding formed by zeroes is used for word with no corresponding embedding. The tweet vectors can be calculated using the following schemes:
- Average word embeddings.
- Add word embeddings.
- Concatenation of first k embeddings. Dummy values are added if the tweet has less than k words.
- TweetNLPPOSTagger: runs the Twitter-specific POS tagger from the CMU TweetNLP library on the given tweets. POS tags are prepended to the tokens.
PMILexiconExpander: calculates the Pointwise Mutual Information (PMI) semantic orientation for each word in a corpus of tweets annotated by sentiment. The score is calculated by subtracting the PMI of the target word with a negative sentiment from the PMI of the target word with a positive sentiment. This is a supervised filter. BibTex
TweetCentroid: calculates word distributional vectors from a corpus of unlabelled tweets by treating them as the centroid of the tweet vectors in which they appear. The vectors can be labelled using an affective lexicon to train a word-level affective classifier. This classifier can be used to expand the original lexicon. BibTex, original paper
LabelWordVectors: labels word vectors with an input lexicon in arff format. This filter is useful for training word-level affective classifiers.
Distant Supervision Filters
ASA: Annotate-Sample-Average (ASA) is a lexicon-based distant supervision method for training polarity classifiers in Twitter in the absence of labelled data. It takes a collection of unlabelled tweets and a polarity lexicon in arff format and creates synthetic labelled instances. Each labelled instance is created by sampling with replacement a number of tweets containing at least one word from the lexicon with the desired polarity, and averaging the feature vectors of the sampled tweets. BibTex, original paper
PTCM: The Partitioned Tweet Centroid Model (PTCM) is an adaption of the TweetCentroidModel for distant supervision. As tweets and words are represented by the same feature vectors, a word-level classifier trained from a polarity lexicon and a corpus of unlabelled tweets can be used for classifying the sentiment of tweets represented by sparse feature vectors. In other words, the labelled word vectors correspond to lexicon-annotated training data for message-level polarity classification. The model includes a simple modification to the tweet centroid model for increasing the number of labelled instances, yielding partitioned tweet centroids. This modification is based on partitioning the tweets associated with each word into smaller disjoint subsets of a fixed size. The method calculates one centroid per partition, which is labelled according to the lexicon. BibTex, original paper
LexiconDistantSupervision: This is the most popular distant supervision approach for Twitter sentiment analysis. It takes a collection of unlabelled tweets and a polarity lexicon in arff format of positive and negative tokens. If a word from the lexicon is found, the tweet is labelled with the word's polarity. Tweets with both positive and negative words are discarded. The word used for labelling the tweet can be removed from the content. Emoticons are used as the default lexicon. original paper
- TweetNLPTokenizer: a Twitter-specific String tokenizer based on the CMU Tweet NLP tool that can be used with the existing StringWordToVector Weka filter.
- Datasets: The package provides some tweets annotated by affective values in gzipped ARFF format in $WEKA_HOME/packages/AffectiveTweets/data/. The default location for $WEKA_HOME is $HOME/wekafiles.
Affective Lexicons: The package provides affective lexicons in ARFF format. These lexicons are located in $WEKA_HOME/packages/AffectiveTweets/lexicons/arff_lexicons/ and can be used with the TweetToInputLexiconFeatureVector filter.
Pre-trained Word-Embeddings: The package provides a file with pre-trained word vectors trained with the Word2Vec tool in gzip compressed format. It is a tab separated file with the word in last column located in $WEKA_HOME/packages/AffectiveTweets/resources/w2v.twitter.edinburgh.100d.csv.gz. However, this is a toy example trained from a small collection of tweets. We recommend downloading w2v.twitter.edinburgh10M.400d.csv.gz, which provides embeddings trained from 10 million tweets taken from the Edinburgh corpus. The parameters were calibrated for classifying words into emotions. More info in this paper.
The Java documentation is available here.
- Email: fbravo at dcc.uchile.cl
- If you have questions about Weka please refer to the Weka mailing list.