Sentiment analysis Analysis – Neural Networks

Sentiment analysis Analysis – Neural Networks

In the next set of topics we will dive into different approachs to solve the hello world problem of the NLP world, the sentiment analysis .

Check the other parts: Part1 Part2 Part3

The code for this implementation is at https://github.com/iolucas/nlpython/blob/master/blog/sentiment-analysis-analysis/neural-networks.ipynb

The Code

We will use two machine learning libraries:

  • scikit-learn to create onehot vectors from our text and split the dataset into train, test and validation;
  • tensorflow to create the neural network and train it.

Our dataset is composed of movie reviews and labels telling whether the review is negative or positive. Let’s load the dataset:

The reviews file is a little big, so it is in zip format. Let’s Extract it with the the zipfile module:

import zipfile
with zipfile.ZipFile("reviews.zip", 'r') as zip_ref:
zip_ref.extractall(".")

Now that we have the reviews.txt and labels.txt files, we load them to the memory:

with open("reviews.txt") as f:
    reviews = f.read().split("\n")
with open("labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]

Next we load the module to transform our review inputs into binary vectors with the help of the class MultiLabelBinarizer :

from sklearn.preprocessing import MultiLabelBinarizer
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)

After that we split the data into training and test set with the train_test_split function. We then split the test set in half to generate a validation set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.4, random_state=None)
split_point = int(len(X_test)/2)
X_valid, y_valid = X_test[split_point:], y_test[split_point:]
X_test, y_test = X_test[:split_point], y_test[:split_point]

We then define two functions: label2bool, to convert the string label to a binary vector of two elements and get_batch, that is a generator to return parts of the dataset in a iteration:

def label2bool(labels):
    return [[1,0] if label == "positive" else [0,1] for label in labels]

def get_batch(X, y, batch_size):
    for batch_pos in range(0,len(X),batch_size):
        yield X[batch_pos:batch_pos+batch_size], y[batch_pos:batch_pos+batch_size]

Tensorflow connects expressions in structures called graphs. We first clear any existing graph , then get the vocabulary length and declare placeholders that will be used to input our text data and labels:

tf.reset_default_graph()
vocab_len = len(onehot_enc.classes_)
inputs_ = tf.placeholder(dtype=tf.float32, shape=[None, vocab_len], name="inputs")
targets_ = tf.placeholder(dtype=tf.float32, shape=[None, 2], name="targets")

This post does not intend to be a tensorflow tutorial, for more details visit tensorflow.org/get_started

We then create our neural network:

  • h1 is the hidden layer that received as input the text words vectors;
  • logits is the final layer that receives the h1 as input;
  • output is the result of applying the sigmoid function to the logits;
  • loss is the loss expression to calculate the current error of the neural network;
  • optimizer is the expression to adjust the weights of the neural network in order to reduce the loss expression;
  • correct_pred and accuracy are used to calculate the current accuracy of the neural network ranging from 0 to 1.
h1 = tf.layers.dense(inputs_, 500, activation=tf.nn.relu)
logits = tf.layers.dense(h1, 2, activation=None)
output = tf.nn.sigmoid(logits)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=targets_))
optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(targets_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')

We then train the network, periodically printing its current accuracy and loss:

epochs = 10
batch_size = 3000

sess = tf.Session()

# Initializing the variables
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
    for X_batch, y_batch in get_batch(onehot_enc.transform(X_train), label2bool(y_train), batch_size):         
        loss_value, _ = sess.run([loss, optimizer], feed_dict={
            inputs_: X_batch,
            targets_: y_batch
         })
        print("Epoch: {} \t Training loss: {}".format(epoch, loss_value))

    acc = sess.run(accuracy, feed_dict={
        inputs_: onehot_enc.transform(X_valid),
        targets_: label2bool(y_valid)
    })

    print("Epoch: {} \t Validation Accuracy: {}".format(epoch, acc))

test_acc = sess.run(accuracy, feed_dict={
    inputs_: onehot_enc.transform(X_test),
    targets_: label2bool(y_test)
})
print("Test Accuracy: {}".format(test_acc))

With this network we got an accuracy of 90%! With more data and using a bigger network we can improve this result even further!

Pleave leave any questions and comments below!