On this page we show how to benchmark models created using AffectiveTweets against similar models created using the NLTK sentiment analysis module and Scikit-learn. We represents tweets using word n-grams and lexicon-based features and train logistic regression models on the Twitter Message Polarity Classification dataset from the SemEval 2013 Sentiment Analysis Task.

The code for reproducing these experiments can be downloaded from here.

AffectiveTweets Scripts

The following bash scripts assume that AffectiveTweets and Weka are already installed. Declare the following variables according to your installation paths.

export WEKA_HOME=/home/fbravoma/wekafiles/
export WEKA_PATH=/home/fbravoma/weka-3-9-3/

We need to transform the training and testing datasets into Arff format:

java -cp $WEKA_HOME/packages/AffectiveTweets/AffectiveTweets.jar:$WEKA_PATH/weka.jar weka.core.converters.SemEvalToArff benchmark/dataset/twitter-train-B.txt benchmark/dataset/twitter-train-B.arff


java -cp $WEKA_HOME/packages/AffectiveTweets/AffectiveTweets.jar:$WEKA_PATH/weka.jar weka.core.converters.SemEvalToArff benchmark/dataset/twitter-test-gold-B.tsv benchmark/dataset/twitter-test-gold-B.arff

Logistic regression model using word n-grams (n=1,2,3,4).

We train a logistic regression model using word n-grams as features with marked negation using the Weka command-line interface:

java -Xmx4G -cp  $WEKA_PATH/weka.jar weka.Run weka.classifiers.meta.FilteredClassifier -v -o -t benchmark/dataset/twitter-train-B.arff -T benchmark/dataset/twitter-test-gold-B.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.TweetToSparseFeatureVector -E 5 -D 3 -I 0 -F -M 3 -R -G 0 -taggerFile $WEKA_HOME/packages/AffectiveTweets/resources/model.20120919 -wordClustFile $WEKA_HOME/packages/AffectiveTweets/resources/50mpaths2.txt.gz -Q 4 -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.Reorder -R 3-last,2\"" -S 1 -W weka.classifiers.functions.LibLINEAR -- -S 7 -C 1.0 -E 0.001 -B 1.0 -P -L 0.1 -I 1000

LibLinear allows implementing various linear models (e.g., SVMs, logistics regression) by changing the loss function. In this and the following benchmark experiments involving AffectiveTwets and LibLinear we use L2-regularized logistic regression models.

Results:

Time taken to test model on test data: 1.56 seconds

=== Error on test data ===

Correctly Classified Instances        2545               66.7453 %
Incorrectly Classified Instances      1268               33.2547 %
Kappa statistic                          0.4457
Mean absolute error                      0.2642
Root mean squared error                  0.3945
Relative absolute error                 59.454  %
Root relative squared error             83.6944 %
Total Number of Instances             3813     


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.616    0.136    0.760      0.616    0.680      0.501    0.816     0.789     positive
                 0.829    0.388    0.617      0.829    0.708      0.442    0.799     0.724     neutral
                 0.361    0.037    0.644      0.361    0.463      0.416    0.851     0.559     negative
Weighted Avg.    0.667    0.229    0.681      0.667    0.658      0.462    0.814     0.725     


=== Confusion Matrix ===

    a    b    c   <-- classified as
  968  543   61 |    a = positive
  221 1360   59 |    b = neutral
   84  300  217 |    c = negative

Logistic regression model using word n-grams + Bing Liu's Lexicon

We train a logistic regression model word n-grams and features derived from Bing Liu's Lexicon:

java -Xmx4G -cp $WEKA_PATH/weka.jar weka.Run weka.classifiers.meta.FilteredClassifier  -v -o -t dataset/twitter-train-B.arff -T dataset/twitter-test-gold-B.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.TweetToSparseFeatureVector -E 5 -D 3 -I 0 -F -M 3 -R -G 0 -taggerFile $WEKA_HOME/packages/AffectiveTweets/resources/model.20120919 -wordClustFile $WEKA_HOME/packages/AffectiveTweets/resources/50mpaths2.txt.gz -Q 4 -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToLexiconFeatureVector -D -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.Reorder -R 3-last,2\"" -S 1 -W weka.classifiers.functions.LibLINEAR -- -S 7 -C 1.0 -E 0.001 -B 1.0 -P -L 0.1 -I 1000

Results:

Time taken to test model on test data: 10.73 seconds

=== Error on test data ===

Correctly Classified Instances        2612               68.5025 %
Incorrectly Classified Instances      1201               31.4975 %
Kappa statistic                          0.4779
Mean absolute error                      0.2471
Root mean squared error                  0.383 
Relative absolute error                 55.596  %
Root relative squared error             81.2491 %
Total Number of Instances             3813     


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.641    0.143    0.758      0.641    0.695      0.514    0.837     0.807     positive
                 0.820    0.349    0.640      0.820    0.719      0.469    0.818     0.751     neutral
                 0.431    0.038    0.680      0.431    0.527      0.477    0.884     0.616     negative
Weighted Avg.    0.685    0.215    0.695      0.685    0.679      0.489    0.836     0.753     


=== Confusion Matrix ===

    a    b    c   <-- classified as
 1008  502   62 |    a = positive
  235 1345   60 |    b = neutral
   86  256  259 |    c = negative

Logistic regression model using Bing Liu's Lexicon + SentiStrength

We train a logistic regression using features derived from Bing Liu's Lexicon and the SentiStrength method:

java -Xmx4G -cp $WEKA_PATH/weka.jar weka.Run weka.classifiers.meta.FilteredClassifier -v -o -t benchmark/dataset/twitter-train-B.arff -T benchmark/dataset/twitter-test-gold-B.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.TweetToSentiStrengthFeatureVector -L $WEKA_HOME/packages/AffectiveTweets/lexicons/SentiStrength/english -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToLexiconFeatureVector -D -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.Reorder -R 3-last,2\"" -S 1 -W weka.classifiers.functions.LibLINEAR -- -S 7 -C 1.0 -E 0.001 -B 1.0 -P -L 0.1 -I 1000

Results:

Time taken to test model on test data: 3.32 seconds

=== Error on test data ===

Correctly Classified Instances        2457               64.4375 %
Incorrectly Classified Instances      1356               35.5625 %
Kappa statistic                          0.4029
Mean absolute error                      0.3198
Root mean squared error                  0.4016
Relative absolute error                 71.9598 %
Root relative squared error             85.1839 %
Total Number of Instances             3813     


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.622    0.171    0.718      0.622    0.666      0.463    0.794     0.714     positive
                 0.802    0.399    0.603      0.802    0.688      0.403    0.764     0.667     neutral
                 0.275    0.033    0.611      0.275    0.379      0.344    0.790     0.483     negative
Weighted Avg.    0.644    0.247    0.651      0.644    0.630      0.418    0.781     0.657     


=== Confusion Matrix ===

    a    b    c   <-- classified as
  977  537   58 |    a = positive
  278 1315   47 |    b = neutral
  106  330  165 |    c = negative

Logistic regression model using n-grams + Bing Liu's Lexicon + SentiStrength

Now we combine the features from the two previous examples:

java -Xmx4G -cp $WEKA_PATH/weka.jar weka.Run weka.classifiers.meta.FilteredClassifier  -v -o -t benchmark/dataset/twitter-train-B.arff -T benchmark/dataset/twitter-test-gold-B.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.TweetToSparseFeatureVector -E 5 -D 3 -I 0 -F -M 3 -R -G 0 -taggerFile $WEKA_HOME/packages/AffectiveTweets/resources/model.20120919 -wordClustFile $WEKA_HOME/packages/AffectiveTweets/resources/50mpaths2.txt.gz -Q 4 -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToSentiStrengthFeatureVector -L $WEKA_HOME/packages/AffectiveTweets/lexicons/SentiStrength/english -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToLexiconFeatureVector -D -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.Reorder -R 3-last,2\"" -S 1 -W weka.classifiers.functions.LibLINEAR -- -S 7 -C 1.0 -E 0.001 -B 1.0 -P -L 0.1 -I 1000

Results:

Time taken to test model on test data: 13.04 seconds

=== Error on test data ===

Correctly Classified Instances        2648               69.4466 %
Incorrectly Classified Instances      1165               30.5534 %
Kappa statistic                          0.4949
Mean absolute error                      0.2401
Root mean squared error                  0.3786
Relative absolute error                 54.0243 %
Root relative squared error             80.3157 %
Total Number of Instances             3813     


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.649    0.139    0.766      0.649    0.703      0.527    0.845     0.812     positive
                 0.823    0.335    0.649      0.823    0.726      0.485    0.826     0.766     neutral
                 0.463    0.039    0.690      0.463    0.554      0.502    0.896     0.642     negative
Weighted Avg.    0.694    0.208    0.704      0.694    0.689      0.505    0.845     0.766     


=== Confusion Matrix ===

    a    b    c   <-- classified as
 1021  485   66 |    a = positive
  232 1349   59 |    b = neutral
   80  243  278 |    c = negative

Logistic regression model model using n-grams + SentiStrength + all lexicons

Now we include features from all the lexicons implemented by AffectiveTweets:

java -Xmx4G -cp $WEKA_PATH/weka.jar weka.Run weka.classifiers.meta.FilteredClassifier -v -o -t benchmark/dataset/twitter-train-B.arff -T  benchmark/dataset/twitter-test-gold-B.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.TweetToSparseFeatureVector -E 5 -D 3 -I 0 -F -M 3 -R -G 0 -taggerFile $WEKA_HOME/packages/AffectiveTweets/resources/model.20120919 -wordClustFile $WEKA_HOME/packages/AffectiveTweets/resources/50mpaths2.txt.gz -Q 4 -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToSentiStrengthFeatureVector -L $WEKA_HOME/packages/AffectiveTweets/lexicons/SentiStrength/english -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.TweetToLexiconFeatureVector -F -D -R -A -N -P -J -H -Q -red -stan -stemmer weka.core.stemmers.NullStemmer -stopwords-handler \\\"weka.core.stopwords.Null \\\" -I 1 -U -tokenizer \\\"weka.core.tokenizers.TweetNLPTokenizer \\\"\" -F \"weka.filters.unsupervised.attribute.Reorder -R 3-last,2\"" -S 1 -W weka.classifiers.functions.LibLINEAR -- -S 7 -C 1.0 -E 0.001 -B 1.0 -P -L 0.1 -I 1000

Results:

Time taken to test model on test data: 13.29 seconds

=== Error on test data ===

Correctly Classified Instances        2706               70.9677 %
Incorrectly Classified Instances      1107               29.0323 %
Kappa statistic                          0.5215
Mean absolute error                      0.2289
Root mean squared error                  0.372 
Relative absolute error                 51.5039 %
Root relative squared error             78.9079 %
Total Number of Instances             3813     


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.658    0.131    0.779      0.658    0.713      0.544    0.854     0.823     positive
                 0.831    0.319    0.663      0.831    0.738      0.509    0.835     0.772     neutral
                 0.514    0.037    0.720      0.514    0.600      0.550    0.913     0.687     negative
Weighted Avg.    0.710    0.197    0.720      0.710    0.706      0.530    0.855     0.779     


=== Confusion Matrix ===

    a    b    c   <-- classified as
 1034  471   67 |    a = positive
  224 1363   53 |    b = neutral
   70  222  309 |    c = negative

NLTK + SciKit-learn Scripts

Now we will build similar models using python 3.6.

First we need to import the following libraries.

import pandas as pd       
from nltk.tokenize import TweetTokenizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import  mark_negation
from nltk.corpus import opinion_lexicon

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix, cohen_kappa_score
import numpy as np

Make sure to install all of them using pip or conda.

Next, load training and testing datasets as pandas dataframes:


# load training and testing datasets as a pandas dataframe
train_data = pd.read_csv("dataset/twitter-train-B.txt", header=None, delimiter="\t",usecols=(2,3), names=("sent","tweet"))
test_data = pd.read_csv("dataset/twitter-test-gold-B.tsv", header=None, delimiter="\t",usecols=(2,3), names=("sent","tweet"))

# replace objective-OR-neutral and objective to neutral
train_data.sent = train_data.sent.replace(['objective-OR-neutral','objective'],['neutral','neutral'])

# use a Twitter-specific tokenizer
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)

Logistic regression model using word n-grams (n=1,2,3,4).

We replicate the same model we created previously using AffectiveTweets. N-grams are extracted using CountVectorizer from Scikit-learn. N-grams inside a negation word are marked.

vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
text_clf = Pipeline([('vect', vectorizer), ('clf', log_mod)])

text_clf.fit(train_data.tweet, train_data.sent)

predicted = text_clf.predict(test_data.tweet)

conf = confusion_matrix(test_data.sent, predicted)
kappa = cohen_kappa_score(test_data.sent, predicted) 
class_rep = classification_report(test_data.sent, predicted)



print('Confusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa))

Results:

Confusion Matrix for Logistic Regression + ngram features:
[[ 172  335   94]
 [  31 1433  176]
 [  48  620  904]]
Classification Report
             precision    recall  f1-score   support

   negative       0.69      0.29      0.40       601
    neutral       0.60      0.87      0.71      1640
   positive       0.77      0.58      0.66      1572

avg / total       0.68      0.66      0.64      3813

kappa:0.4236034809946826

Logistic regression model using word n-grams + Bing Liu's Lexicon

We replicate now the second model created using AffectiveTweets: a logistic regression trained on word n-grams and features calculated from Bing Liu's Lexicon.

First, we need to make sure that the required NLTK resources are installed:

import nltk
nltk.download('opinion_lexicon')

We extend Scikit-learn classes BaseEstimator and TransformerMixin to implement a feature extractor that uses Bing Liu's lexicon:


class LiuFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using Bing Liu's lexicon"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.pos_set = set(opinion_lexicon.positive())
        self.neg_set = set(opinion_lexicon.negative())

    def liu_score(self,sentence):
        """Calculates the number of positive and negative words in the sentence using Bing Liu's Lexicon""" 
        tokenized_sent = self.tokenizer.tokenize(sentence)
        pos_words = 0
        neg_words = 0
        for word in tokenized_sent:
            if word in self.pos_set:
                pos_words += 1
            elif word in self.neg_set:
                neg_words += 1
        return [pos_words,neg_words]

    def transform(self, X, y=None):
        """Applies liu_score and vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.liu_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """This function must return `self` unless we expect the transform function to perform a 
        different action on training and testing partitions (e.g., when we calculate unigram features, 
        the dictionary is only extracted from the first batch)"""
        return self

We can combine word n-gram features and features derived from Bing Liu's lexicon using the class FeatureUnion from Scikit-learn:

liu_feat = LiuFeatureExtractor(tokenizer)
vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
liu_ngram_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('liu',liu_feat) ])),
    ('clf', log_mod)])


liu_ngram_clf.fit(train_data.tweet, train_data.sent)
pred_liu_ngram = liu_ngram_clf.predict(test_data.tweet)


conf_liu_ngram = confusion_matrix(test_data.sent, pred_liu_ngram)
kappa_liu_ngram = cohen_kappa_score(test_data.sent, pred_liu_ngram) 
class_rep_liu_ngram = classification_report(test_data.sent, pred_liu_ngram)

print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon')
print(conf_liu_ngram)
print('Classification Report')
print(class_rep_liu_ngram)
print('kappa:'+str(kappa_liu_ngram))

Results:

Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu's Lexicon
[[ 236  290   75]
 [  44 1395  201]
 [  59  529  984]]
Classification Report
             precision    recall  f1-score   support

   negative       0.70      0.39      0.50       601
    neutral       0.63      0.85      0.72      1640
   positive       0.78      0.63      0.69      1572

avg / total       0.70      0.69      0.68      3813

kappa:0.4763629485702495

Logistic regression model using Bing Liu's Lexicon + Vader

Unfortunately, SentiStrength is not implemented in the NLTK sentiment module. However, NLTK implements Vader, which is another popular lexicon-based sentiment analysis method.

We implement a logistic regression using features derived from Bing Liu's lexicon and Vader.

First, we need to make sure that the required NLTK resources are installed:

import nltk
nltk.download('opinion_lexicon')
nltk.download('vader_lexicon')

We implement another feature extractor that calculates features using Vader:

class VaderFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using the Vader method"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.sid = SentimentIntensityAnalyzer()


    def vader_score(self,sentence):
        """ Calculates sentiment scores for a sentence using the Vader method """
        pol_scores = self.sid.polarity_scores(sentence)
        return(list(pol_scores.values()))

    def transform(self, X, y=None):
        """Applies vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.vader_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self





vader_feat = VaderFeatureExtractor(tokenizer)
liu_feat = LiuFeatureExtractor(tokenizer)

log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
vader_liu_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('vader', vader_feat), ('liu',liu_feat) ])),
    ('clf', log_mod)])


vader_liu_clf.fit(train_data.tweet, train_data.sent)
pred_vader_liu = vader_liu_clf.predict(test_data.tweet)


conf_vader_liu = confusion_matrix(test_data.sent, pred_vader_liu)
kappa_vader_liu = cohen_kappa_score(test_data.sent, pred_vader_liu) 
class_rep_vader_liu = classification_report(test_data.sent, pred_vader_liu)

print('Confusion Matrix for Logistic Regression + Vader + features from Bing Liu\'s Lexicon')
print(conf_vader_liu)
print('Classification Report')
print(class_rep_vader_liu)
print('kappa:'+str(kappa_vader_liu))

Results:

Confusion Matrix for Logistic Regression + Vader + features from Bing Liu's Lexicon
[[ 169  323  109]
 [  51 1275  314]
 [  58  491 1023]]
Classification Report
             precision    recall  f1-score   support

   negative       0.61      0.28      0.38       601
    neutral       0.61      0.78      0.68      1640
   positive       0.71      0.65      0.68      1572

avg / total       0.65      0.65      0.63      3813

kappa:0.408231856331834

Logistic regression model using n-grams + Bing Liu's Lexicon + Vader

We now combine the feature space of all the previous examples:


ngram_lex_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('vader',vader_feat),('liu',liu_feat)  ])),
    ('clf', log_mod)])


ngram_lex_clf.fit(train_data.tweet, train_data.sent)
pred_ngram_lex = ngram_lex_clf.predict(test_data.tweet)


conf_ngram_lex = confusion_matrix(test_data.sent, pred_ngram_lex)
kappa_ngram_lex = cohen_kappa_score(test_data.sent, pred_ngram_lex) 
class_rep = classification_report(test_data.sent, pred_ngram_lex)


print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon and the Vader method')
print(conf_ngram_lex)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa_ngram_lex))

Results:

Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu's Lexicon and the Vader method
[[ 268  261   72]
 [  45 1387  208]
 [  56  493 1023]]
Classification Report
             precision    recall  f1-score   support

   negative       0.73      0.45      0.55       601
    neutral       0.65      0.85      0.73      1640
   positive       0.79      0.65      0.71      1572

avg / total       0.72      0.70      0.70      3813

kappa:0.5058311344923361

Summary of Results

A table summarising all the experiments from above is shown as follows:

Features	Implementation	Kappa Score	F1 Score	Time (Seconds)
Word n-grams	Scikitlearn + NLTK	0.42	0.64	30.7
Word n-grams	AffectiveTweets	0.45	0.66	13.0
Word n-grams + Liu Lexicon	Scikitlearn + NLTK	0.48	0.68	13.4
Word n-grams + Liu Lexicon	AffectiveTweets	0.48	0.68	27.4
Liu Lexicon + Vader	Scikitlearn + NLTK	0.41	0.63	8.9
Liu Lexicon + SentiStrength	AffectiveTweets	0.40	0.63	31.9
Word n-grams + Liu Lexicon + Vader	Scikitlearn + NLTK	0.51	0.70	16
Word n-grams + Liu Lexicon + SentiStrength	AffectiveTweets	0.49	0.69	68.5
Word n-grams + All lexicons + SentiStrength	AffectiveTweets	0.52	0.71	74.6

The execution time is averaged over 10 repetitions of each model.

Bear in mind that there are only two models (word n-grams and word n-grams+Liu Lexicon) that can be directly compared in both implementations (AffectiveTweets and Scikitlearn+NLTK) as they use the same features and the same learning schemes. Other examples such as Liu Lexicon+Vader and Liu Lexicon+SentiStregnth show how similar models can be implemented using two different tools.

The experiments were performed on an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz with 16 GB of RAM using Ubuntu 16.04.4 LTS. AfftectiveTweets models were run using Weka 3.9.3 and Java 8 (Oracle version). Scikitlearn+NLTK models were run using Python 3.6.4 (Anaconda version), Scikitlearn 0.20.3 and NLTK 3.4.1.