For this project, we will use the text8 dataset (that can be download here). This dataset is a dump of cleaned wikipedia texts. More details here .

First, we just import the necessary libs.

```
from collections import Counter, defaultdict
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.manifold import TSNE
```

Next we will create a dataset class to manage our feature extraction and batch generation. We must create a cooccurence matrix for every word to feed the glove model. We will generate a vocab of ~190k words. If we create a common matrix, it would have 190000² entries, which would result in ~36 billion entries. If we consider each value a 32bit integer, we are talking of ~135GB of memory, too much to store and process. To handle this issue we can leverage the fact that most of the entries of this matrix are 0 so we just need to store the non-zero values, which drastically reduce the amount of memory necessary.

```
class GloveDataset:
def __init__(self, text, n_words=200000, window_size=5):
self._window_size = window_size
self._tokens = text.split(" ")[:n_words]
word_counter = Counter()
word_counter.update(self._tokens)
self._word2id = {w:i for i, (w,_) in enumerate(word_counter.most_common())}
self._id2word = {i:w for w, i in self._word2id.items()}
self._vocab_len = len(self._word2id)
self._id_tokens = [self._word2id[w] for w in self._tokens]
self._create_coocurrence_matrix()
print("# of words: {}".format(len(self._tokens)))
print("Vocabulary length: {}".format(self._vocab_len))
def _create_coocurrence_matrix(self):
cooc_mat = defaultdict(Counter)
for i, w in enumerate(self._id_tokens):
start_i = max(i - self._window_size, 0)
end_i = min(i + self._window_size + 1, len(self._id_tokens))
for j in range(start_i, end_i):
if i != j:
c = self._id_tokens[j]
cooc_mat[w][c] += 1 / abs(j-i)
self._i_idx = list()
self._j_idx = list()
self._xij = list()
#Create indexes and x values tensors
for w, cnt in cooc_mat.items():
for c, v in cnt.items():
self._i_idx.append(w)
self._j_idx.append(c)
self._xij.append(v)
self._i_idx = torch.LongTensor(self._i_idx).cuda()
self._j_idx = torch.LongTensor(self._j_idx).cuda()
self._xij = torch.FloatTensor(self._xij).cuda()
def get_batches(self, batch_size):
#Generate random idx
rand_ids = torch.LongTensor(np.random.choice(len(self._xij), len(self._xij), replace=False))
for p in range(0, len(rand_ids), batch_size):
batch_ids = rand_ids[p:p+batch_size]
yield self._xij[batch_ids], self._i_idx[batch_ids], self._j_idx[batch_ids]
dataset = GloveDataset(open("text8").read(), 10000000)
```

```
# of words: 10000000
Vocabulary length: 189075
Wall time: 2min 8s
```

Here we create the class of our glove model. In its forward pass it will perform the yellow part of the Glove loss function, described in the original paper:

```
EMBED_DIM = 300
class GloveModel(nn.Module):
def __init__(self, num_embeddings, embedding_dim):
super(GloveModel, self).__init__()
self.wi = nn.Embedding(num_embeddings, embedding_dim)
self.wj = nn.Embedding(num_embeddings, embedding_dim)
self.bi = nn.Embedding(num_embeddings, 1)
self.bj = nn.Embedding(num_embeddings, 1)
self.wi.weight.data.uniform_(-1, 1)
self.wj.weight.data.uniform_(-1, 1)
self.bi.weight.data.zero_()
self.bj.weight.data.zero_()
def forward(self, i_indices, j_indices):
w_i = self.wi(i_indices)
w_j = self.wj(j_indices)
b_i = self.bi(i_indices).squeeze()
b_j = self.bj(j_indices).squeeze()
x = torch.sum(w_i * w_j, dim=1) + b_i + b_j
return x
glove = GloveModel(dataset._vocab_len, EMBED_DIM)
glove.cuda()
```

We must define a function to compute the weighting term f(Xij) of the loss function as per described in the paper:

```
def weight_func(x, x_max, alpha):
wx = (x/x_max)**alpha
wx = torch.min(wx, torch.ones_like(wx))
return wx.cuda()
```

The loss function described in the Glove paper is a weighted mean squared error. Pytorch 1.0 doesn’t have implementation for it, so we must write it ourselves. A good practice is to reuse any piece of this function already implemented so we take advantage of any optimization it might have:

```
def wmse_loss(weights, inputs, targets):
loss = weights * F.mse_loss(inputs, targets, reduction='none')
return torch.mean(loss).cuda()
```

Although we are using a diferent configuration (like the dataset) of the original paper, we will use the same optimizer and learning rate it describes.

```
optimizer = optim.Adagrad(glove.parameters(), lr=0.05)
```

Now we can write our training loop. The ALPHA and X_MAX parameters are set accoring to the paper. We also save our model states every 100 epochs.

```
N_EPOCHS = 100
BATCH_SIZE = 2048
X_MAX = 100
ALPHA = 0.75
n_batches = int(len(dataset._xij) / BATCH_SIZE)
loss_values = list()
for e in range(1, N_EPOCHS+1):
batch_i = 0
for x_ij, i_idx, j_idx in dataset.get_batches(BATCH_SIZE):
batch_i += 1
optimizer.zero_grad()
outputs = glove(i_idx, j_idx)
weights_x = weight_func(x_ij, X_MAX, ALPHA)
loss = wmse_loss(weights_x, outputs, torch.log(x_ij))
loss.backward()
optimizer.step()
loss_values.append(loss.item())
if batch_i % 100 == 0:
print("Epoch: {}/{} \t Batch: {}/{} \t Loss: {}".format(e, N_EPOCHS, batch_i, n_batches, np.mean(loss_values[-20:])))
print("Saving model...")
torch.save(glove.state_dict(), "text8.pt")
```

```
Epoch: 1/100 Batch: 100/10726 Loss: 1.1235822647809983
Epoch: 1/100 Batch: 200/10726 Loss: 1.0464201807975768
Epoch: 1/100 Batch: 300/10726 Loss: 1.0292260229587555
Epoch: 1/100 Batch: 400/10726 Loss: 0.9683106660842895
Epoch: 1/100 Batch: 500/10726 Loss: 0.9407412618398666
Epoch: 1/100 Batch: 600/10726 Loss: 0.9253258764743805
Epoch: 1/100 Batch: 700/10726 Loss: 0.922967490553855
...
```

```
plt.plot(loss_values)
```

Here we sum over the two embedding matrices (as per recommendation of the original paper) to improve results. We then plot the TSNE space of the top 300 words to validate our word embeddings.

```
emb_i = glove.wi.weight.cpu().data.numpy()
emb_j = glove.wj.weight.cpu().data.numpy()
emb = emb_i + emb_j
top_k = 300
tsne = TSNE(metric='cosine', random_state=123)
embed_tsne = tsne.fit_transform(emb[:top_k, :])
fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(top_k):
plt.scatter(*embed_tsne[idx, :], color='steelblue')
plt.annotate(dataset._id2word[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)
```

Checking the words that are close, we can say that our model performs pretty good! We can check that it clusters the direction words north, south, west, east and even central. It also clusters together words with their plural form like system/systems and language/languages.

And thats it. Hope you enjoyed this implementation, and if you have any questions our comments please let them below, I will be happy to answer!

]]>Here is a great resource for understanding the skip gram model.

We will divide this post into three parts:

- Loading and preparing dataset
- Creating dataset tuples
- Creating model
- Training it

For our task in creating word vectors we will use the movie plot description of wikipedia, available at https://www.kaggle.com/jrobischon/wikipedia-movie-plots . We will use the following code:

```
from string import punctuation
import pandas as pd
df = pd.read_csv("data/wiki_movie_plots_deduped.csv")
clear_punct_regex = "[" + punctuation + "\d\r\n]"
corpus = df['Plot'].str.replace(clear_punct_regex, "").str.lower()
corpus = " ".join(corpus)
open("corpus2.txt", "w", encoding="utf8").write(corpus)
```

First we import pandas for parsing the csv file and them the punctuation variable that olds common punctuations used in strings.

```
>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
```

Line 4 we load the data of movie plot; line 5 we build the regex string used for remove punctuations from the text as well as control the characters: \n (new line), \r (also used for new line) and \d (any digits, 1 2 3 etc); line 6 we apply the regex string replacing every match with an empty string; line 7 we join all the rows of the movie plots strings into an unique long row; line 8 we write this cleaned long row of movie plots in a .txt file

We are done with data loading and cleaning.

```
corpus = open("data/corpus.txt", encoding="utf8").readlines()
corpus = " ".join(corpus).replace("\n", "")
corpus = corpus.split(" ")
```

After creating the corpus file, we load it and remove line termination symbols from it (\n). We then split it into a tokens list.

```
from collections import Counter
vocab_cnt = Counter()
vocab_cnt.update(corpus)
vocab_cnt = Counter({w:c for w,c in vocab_cnt.items() if c > 5})
```

Next we count the number of ocurrences of each word and remove those that ocurrs less than 5 times.

```
import numpy as np
import random
vocab = set()
unigram_dist = list()
word2id = dict()
for i, (w, c) in enumerate(vocab_cnt.most_common()):
vocab.add(w)
unigram_dist.append(c)
word2id[w] = i
unigram_dist = np.array(unigram_dist)
word_freq = unigram_dist / unigram_dist.sum()
#Generate word frequencies to use with negative sampling
w_freq_neg_samp = unigram_dist ** 0.75
w_freq_neg_samp /= w_freq_neg_samp.sum() #normalize
#Get words drop prob
w_drop_p = 1 - np.sqrt(0.00001/word_freq)
#Generate train corpus dropping common words
train_corpus = [w for w in corpus if w in vocab and random.random() > w_drop_p[word2id[w]]]
```

In the above code we do several things:

- Create a vocab variable to stores all the words that appears in our corpus;
- Create a unigram_dist var to accumulate the number of occurrences of each word;
- Create a word2id dict that will help us encoding our words into values;
- Create w_freq_neg_samp, a probability distribution for sampling words in the negative sampling step;
- Create w_drop_p, a probability distribution for sampling words that will be in our training data;
- Create our training data, discarding some words based on w_drop_p.

```
import torch
#Generate dataset
dataset = list()
window_size = 5
for i, w in enumerate(train_corpus):
window_start = max(i - window_size, 0)
window_end = i + window_size
for c in train_corpus[window_start:window_end]:
if c != w:
dataset.append((word2id[w], word2id[c]))
dataset = torch.LongTensor(dataset)
if USE_CUDA:
dataset = dataset.cuda()
```

In the above snippet we create the tuples of word and context word that will be used for training our model. We convert them in torch tensors and attach to gpus if they are available.

```
import torch
from torch import nn, optim
import torch.nn.functional as F
VOCAB_SIZE = len(word2id)
EMBED_DIM = 128
class Word2Vec(nn.Module):
def __init__(self, vocabulary_size, embedding_dimension, sparse_grad=False):
super(Word2Vec, self).__init__()
self.embed_in = nn.Embedding(vocabulary_size, embedding_dimension, sparse=sparse_grad)
self.embed_out = nn.Embedding(vocabulary_size, embedding_dimension, sparse=sparse_grad)
#Sparse gradients do not work with momentum
self.embed_in.weight.data.uniform_(-1, 1)
self.embed_out.weight.data.uniform_(-1, 1)
def neg_samp_loss(self, in_idx, pos_out_idx, neg_out_idxs):
emb_in = self.embed_in(in_idx)
emb_out = self.embed_out(pos_out_idx)
pos_loss = torch.mul(emb_in, emb_out) #Perform dot product between the two embeddings by element-wise mult
pos_loss = torch.sum(pos_loss, dim=1) #and sum the row values
pos_loss = F.logsigmoid(pos_loss)
neg_emb_out = self.embed_out(neg_out_idxs)
#Here we must expand dimension for the input embedding in order to perform a matrix-matrix multiplication
#with the negative embeddings
neg_loss = torch.bmm(-neg_emb_out, emb_in.unsqueeze(2)).squeeze()
neg_loss = F.logsigmoid(neg_loss)
neg_loss = torch.sum(neg_loss, dim=1)
total_loss = torch.mean(pos_loss + neg_loss)
return -total_loss
def forward(self, indices):
return self.embed_in(indices)
w2v = Word2Vec(VOCAB_SIZE, EMBED_DIM, False)
if USE_CUDA:
w2v.cuda()
```

In the above class, we define our pytorch model. It is composed by two lookup embedding tables with uniform weight initialization.

To train our embeddings we are going to use Negative Sampling. In the function neg_samp_loss we compute the following quantity:

The first term

we compute between the lines 23 and 26; The second term

we compute between lines 28 and 33.

```
def get_negative_samples(batch_size, n_samples):
neg_samples = np.random.choice(len(vocab), size=(batch_size, n_samples), replace=False, p=w_freq_neg_samp)
if USE_CUDA:
return torch.LongTensor(neg_samples).cuda()
return torch.LongTensor(neg_samples)
```

Here we define our function to generate negative targets to be used in our objective during training.

```
optimizer = optim.Adam(w2v.parameters(), lr=0.003)
```

Here we just define the optimizer that will perform our weight updates.

```
def get_batches(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i+batch_size]
```

This function is used for generate our batches during the training loop.

```
n_epochs = 5
n_neg_samples = 5
batch_size = 512
for epoch in range(n_epochs): # loop over the dataset multiple times
loss_values = []
start_t = time.time()
for dp in get_batches(dataset, batch_size):
optimizer.zero_grad() # zero the parameter gradients
inputs, labels = dp[:,0], dp[:,1]
loss = w2v.neg_samp_loss(inputs, labels, get_negative_samples(len(inputs), n_neg_samples))
loss.backward()
optimizer.step()
loss_values.append(loss.item())
ellapsed_t = time.time() - start_t
#if epoch % 1 == 0:
print("{}/{}\tLoss: {}\tEllapsed time: {}".format(epoch + 1, n_epochs, np.mean(loss_values), ellapsed_t))
print('Done')
```

```
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
data_viz_len = 300
viz_embedding = w2v.embed_in.weight.data.cpu()[:data_viz_len]
tsne = TSNE()
embed_tsne = tsne.fit_transform(viz_embedding)
plt.figure(figsize=(16,16))
for w in vocab[:data_viz_len]:
w_id = word2id[w]
plt.scatter(embed_tsne[w_id,0], embed_tsne[w_id,1])
plt.annotate(w, (embed_tsne[w_id,0], embed_tsne[w_id,1]), alpha=0.7)
```

If you have any questions, please ask them below, I will be happy to answer.

]]>Larger companies already make use of AI for some time, mainly because they can afford world class scientists to get things done. But as the knowledge begins to spread, more and more non scientist people become familiarized with the area and become able to make things happen.

Iniciatives like OpenAI and DeepMind aims to keep advancing AI technology while keep their studies and reports open for other people learn with them. This is very important for the world as whole, since the more people with knowledge in this area, more areas, especially poor ones, have access to the technology and its amazing benefits.

Here are 8 ways you can make some impact and have some profit working with AI (I will try to go from the easist one to the hardest):

To make things easy, a lot of companies create easy to deploy AI products, in the form of wizards or APIs (as chatbots), that you don’t need to know how it works. But even these products may need help of some one that is capable to set things up, like configuring, feed information or setup a server. This is a good way for a non-ai person get into the area.

Artificial inteligence has been around for the last years and more and more people are getting interested in it every day. The fast advancements that are occuring in the field, often create gaps of learn contents (especially in non-english languages). There is some great companies helping filling these gaps, like Udacity and Coursera (cofounded by the AI guru Andrew Ng), but there is always space for people willing to teach new things in different ways, using platforms like Udemy.

There is data science competitions online (like Kaggle) where people dispute in creating the best model that performs in a given task, like predict house prices or classify images. These competitions often involve companies real problems and despite being a great resource of education and research, some of them also offer rewards of thousands of dollars for the best models. Thus, it is another awesome way to earn some money working with AI.

You can create and sell AI models that helps solve problems, just like people create and sell softwares and websites. You can create general models, that works in a larger range of things, (such as an image classifier that says which animal is shown or a model that says whether the sentiment of a phrase was negative or positive), or personalized things that works in a more restricted domain (like a classifier of which dog race is shown or a classifier to say whether or not people are talking about your products in a newspaper).

The scale that cloud computing has taken, has made possible to deploy in online servers much more than a static website. Here you will be doing the same thing as above, but instead of selling the model to the company, you will deploy your software in a cloud server and charges the usage of it monthly or per demand. A lot of companies use this model, not so big ones like imagga and giants such IBM and Google.

You may use one service or use modern technics of data science and machine learning to develop a predictive analysis tool to help you determine when a price of a stock (or a cryptocoin) will raise or fall to determine to either buy or sell them and thus make some money of it. Actually, together with high frequency bots, this is the way that people who act in this field are working today.

You may also create or setup some AI software to assist you in creating some other thing that has not to do with AI. Generative tools are a good example, the project DreamCatcher , by Autodesk uses artificial inteligence to assist engineers to improve their projects. Also the project Magenta by Google aims to create art, such drawings and music, and may be a awesome tool for modern artists.

Probably the most obvious one. Create an end user product where one of the core values is AI based may be a great deal. Self driving cars, machine translation and speech recognition for example are heavily dependent upon AI advancements, but simpler things like chatbots, document analizers, security systems based on image recognition, image-descriptors for blind people or cleaning robots are also great to work with.

Artificial inteligence, despite the lot of advancements that have happened in the recent years and the awesome products that giants and startups are creating, is justing getting its first steps and there is so much to do and create on it. Joining this wave now may be a great opportunity to develop great things, earn money and contribute with the society progress.

]]>Check the other parts: Part1 Part2 Part3

The code for this implementation is at https://github.com/iolucas/nlpython/blob/master/blog/sentiment-analysis-analysis/neural-networks.ipynb

We will use two machine learning libraries:

- scikit-learn to create onehot vectors from our text and split the dataset into train, test and validation;
- tensorflow to create the neural network and train it.

Our dataset is composed of movie **reviews** and **labels** telling whether the review is negative or positive. Let’s load the dataset:

The reviews file is a little big, so it is in zip format. Let’s Extract it with the the zipfile module:

```
import zipfile
with zipfile.ZipFile("reviews.zip", 'r') as zip_ref:
zip_ref.extractall(".")
```

Now that we have the **reviews.txt** and **labels.txt** files, we load them to the memory:

```
with open("reviews.txt") as f:
reviews = f.read().split("\n")
with open("labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
```

Next we load the module to transform our review inputs into binary vectors with the help of the class MultiLabelBinarizer :

```
from sklearn.preprocessing import MultiLabelBinarizer
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
```

After that we split the data into training and test set with the train_test_split function. We then split the test set in half to generate a validation set:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.4, random_state=None)
split_point = int(len(X_test)/2)
X_valid, y_valid = X_test[split_point:], y_test[split_point:]
X_test, y_test = X_test[:split_point], y_test[:split_point]
```

We then define two functions: **label2bool**, to convert the string label to a binary vector of two elements and **get_batch**, that is a generator to return parts of the dataset in a iteration:

```
def label2bool(labels):
return [[1,0] if label == "positive" else [0,1] for label in labels]
def get_batch(X, y, batch_size):
for batch_pos in range(0,len(X),batch_size):
yield X[batch_pos:batch_pos+batch_size], y[batch_pos:batch_pos+batch_size]
```

Tensorflow connects expressions in structures called **graphs**. We first clear any existing graph , then get the vocabulary length and declare placeholders that will be used to input our text data and labels:

```
tf.reset_default_graph()
vocab_len = len(onehot_enc.classes_)
inputs_ = tf.placeholder(dtype=tf.float32, shape=[None, vocab_len], name="inputs")
targets_ = tf.placeholder(dtype=tf.float32, shape=[None, 2], name="targets")
```

This post does not intend to be a tensorflow tutorial, for more details visit https://www.tensorflow.org/get_started/

We then create our neural network:

**h1**is the hidden layer that received as input the text words vectors;**logits**is the final layer that receives the**h1**as input;**output**is the result of applying the sigmoid function to the**logits**;**loss**is the loss expression to calculate the current error of the neural network;**optimizer**is the expression to adjust the weights of the neural network in order to reduce the loss expression;**correct_pred**and**accuracy**are used to calculate the current accuracy of the neural network ranging from 0 to 1.

```
h1 = tf.layers.dense(inputs_, 500, activation=tf.nn.relu)
logits = tf.layers.dense(h1, 2, activation=None)
output = tf.nn.sigmoid(logits)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=targets_))
optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(targets_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
```

We then train the network, periodically printing its current accuracy and loss:

```
epochs = 10
batch_size = 3000
sess = tf.Session()
# Initializing the variables
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
for X_batch, y_batch in get_batch(onehot_enc.transform(X_train), label2bool(y_train), batch_size):
loss_value, _ = sess.run([loss, optimizer], feed_dict={
inputs_: X_batch,
targets_: y_batch
})
print("Epoch: {} \t Training loss: {}".format(epoch, loss_value))
acc = sess.run(accuracy, feed_dict={
inputs_: onehot_enc.transform(X_valid),
targets_: label2bool(y_valid)
})
print("Epoch: {} \t Validation Accuracy: {}".format(epoch, acc))
test_acc = sess.run(accuracy, feed_dict={
inputs_: onehot_enc.transform(X_test),
targets_: label2bool(y_test)
})
print("Test Accuracy: {}".format(test_acc))
```

With this network we got an accuracy of **90%**! With more data and using a bigger network we can improve this result even further!

Pleave leave any questions and comments below!

]]>Check the other parts: Part1 Part2 Part3

The code for this implementation is at https://github.com/iolucas/nlpython/blob/master/blog/sentiment-analysis-analysis/svm.ipynb

This classifier works trying to create a line that divides the dataset leaving the larger margin as possible between points called **support vectors**. As per the figure below, the line **A** has a larger margin than the line **B**, so the points divided by the line A have to travel much more to cross the division, than if the data was divided by **B**, so in this case we would choose the line **A**.

For this task we will use scikit-learn, an open source machine learning library.

Our dataset is composed of movie **reviews** and **labels** telling whether the review is negative or positive. Let’s load the dataset:

The reviews file is a little big, so it is in zip format. Let’s Extract it:

```
import zipfile
with zipfile.ZipFile("reviews.zip", 'r') as zip_ref:
zip_ref.extractall(".")
```

Now that we have the **reviews.txt** and **labels.txt** files, we load them to the memory:

```
with open("reviews.txt") as f:
reviews = f.read().split("\n")
with open("labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
```

Next we load the module to transform our review inputs into binary vectors with the help of the class MultiLabelBinarizer :

```
from sklearn.preprocessing import MultiLabelBinarizer
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
```

After that we split the data into training and test set with the train_test_split function:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.25, random_state=None)
```

We then create our SVM classifier with the class LinearSVC and train it:

```
from sklearn.svm import LinearSVC
lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
```

Training the model took about 2 seconds.

After training, we use the score function to check the performance of the classifier:

```
score = lsvm.score(onehot_enc.transform(X_test), y_test)
```

Computing the score took about 1 second only!

Running the classifier a few times we get around **85%** of accuracy, basically the same of the result of the naive bayes classifier.

If you have any questions or comments, leave them below!

]]>Check the other parts: Part1 Part2 Part3

The code for this implementation is at https://github.com/iolucas/nlpython/blob/master/blog/sentiment-analysis-analysis/naive-bayes.ipynb

The Naive Bayes classifier uses the Bayes Theorem, that for our problem says that the probability of the **label** (positive or negative) for the given **text** is equal to the probability of we find this **text** given the **label**, times the probability a **label** occurs, everything divided by the probability of we find this **text**:

Since the text is composed of words, we can say:

We want to compare the probabilities of the labels and choose the one with higher probability. Since the term P(word1, word2, word3…) is equal for everything, we can remove it. Assuming that there is no dependencebetween words in the text (which can cause some errors, because some words only “work” together with others), we have:

So we are done! With a training set we can find every term of the equation, for example:

**P(label=positive)**is the fraction of the training set that is a**positive**text;**P(word1|label=negative)**is the number of times the**word1**appears in a**negative**text divided by the number of times the**word1**appears in every text.

For this task we will use a famous open source machine learning library, the scikit-learn .

Our dataset is composed of movie **reviews** and **labels** telling whether the review is negative or positive. Let’s load the dataset:

The reviews file is a little big, so it is in zip format. Let’s Extract it:

```
import zipfile
with zipfile.ZipFile("reviews.zip", 'r') as zip_ref:
zip_ref.extractall(".")
```

Now that we have the **reviews.txt** and **labels.txt** files, we load them to the memory:

```
with open("reviews.txt") as f:
reviews = f.read().split("\n")
with open("labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
```

Next we load the module to transform our review inputs into binary vectors with the help of the class MultiLabelBinarizer :

```
from sklearn.preprocessing import MultiLabelBinarizer
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
```

After that we split the data into training and test set with the train_test_split function:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.25, random_state=None)
```

Next, we create a Naive Bayes classifier and train our data. We will use a Bernoulli Naive Bayes classifier that is appropriate for feature vectors composed of binary data. We do this with the class BernoulliNB :

```
from sklearn.naive_bayes import BernoulliNB
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
```

Training the model took about **1** second only!

After training, we use the **score** function to check the performance of the classifier:

```
score = bnbc.score(onehot_enc.transform(X_test), y_test)
```

Computing the score took about **0.4** seconds only!

Running the classifier a few times we get around **85%** of accuracy. Not so bad for a so simple classifier.

If you have any questions or comments, please leave them below!

]]>With matrices in the core of neural networks and deep learning, I will try to explain in a short text the utility of these things. Why do we need symbols for numbers?

Instead of using the symbols 1, 2, 3, and so on, we could count stuff using dots, or any repeating pattern like 1=*, 2= , 3=* etc. It’s obvious why this is not a good idea; with countable things in the order of hundreds or thousands, that wouldn’t be suitable or reasonable.

We can understand functions roughly as rules that takes one or more values, and returns another one, like the function x² that squares every value it takes. Functions are often denoted by f(x), so why do we need to use this notation instead of the rule itself? There is some reasons, but one of them is similar to why we use symbols to represent numbers instead of repeating patterns, because often the function can be so big, that write and deal with the giant thing every time would be time consuming and impractical.

Working with matrices is a way to deal with a lot of numbers at the same time in a reduced space and practical way. Suppose we have a system of equations:

We can see that it has a repeating pattern, that for every equation, we have values multiplied by x, y and z. With matrices we have a way to write this system without repeating these and in a much more elegant way:

With this, adding one more equation to the system is a matter of adding numbers to the first and the last matrix, without touching the **[x y z]** matrix, since they keep repeating. (Those who already did endless exercise lists of equation systems knows how boring is to keep writing **x, y, z** over and over and over hehe)

Neural networks often use too many values and operations, and write down every one of them would be impractical, so we need a way to compact things as much as possible to make them easier. Matrices help us with it.

Of course, these are not the only reasons to use function notations and matrices:

- With these we can write theorems and equations that generalize to any rule or any quantity of values.
- Another nice reason is that matrices are cpu/gpu friendly; computers take advantage of matrices to speed up processing their expressions.

A great resource to learn about matrices and linear algebra is this series of video from the youtube channel 3blue1brown:

If you have questions or comments, please leave them below!

]]>