.: Sentiment Prediction with RNN, Pytorch

Sentimen analisis merupakan sebuah sistem yang dapat membantu manusia untuk mengetahui sebuah sentimen dari sebuah kalimat teks yang ada. Sistem ini dapat mengetahui apakah suatu kalimat positif ataupun negatif. Pada kesempatan kali ini akan dijabarkan salah satu cara, step-step, untuk melakukan sentimen analisis, mulai dari praprocessing hingga menggunaan model yang sudah dibuat.

>>>SOURCE CODE<<<

1. Pra-processing data

hal pertama yang harus dilakukan adalah mempersiapkan data untuk dapat dipelajari oleh sistem.

a. Load data
Untuk mengolah data, kita perlu untuk memindahkan seluruh datanya ke memory.

import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

b. Cleaning dan Tokenize data
Cleaning data diperlukan pada setiap pra-processing data untuk machine learning. Hal ini di lakukan unutk membuat data lebih mudah di proses. Pada proses dibawah ini kita akan:
- mejadikan seluruh kata menjadi lowercase
- menghlangkan puctuation
- memisahkan kalimat berdasarkan '\n'
- menghitung seluruh kata unik dalam dataset

from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

c. Encoded data
Pada tahap ini kita akan menjadikan dataset yang ada agar dimengerti oleh komputer. komputer hanya mengerti angka-angka, atau melakukan perhitungan, komputer tidak mengetahui maksud dari kalimat manusia. Maka dari itu kita harus mengubah bentuk datanya ke angka.

Setiap kata akan dipasangkan ke angkan yang unik, lalu setiap kalimat akan diubah ke dalam representasi angka :

# feel free to use this import 
from collections import Counter

## Build a dictionary that maps words to integers
counters = Counter(words)
vocab = sorted(counters, key=counters.get, reverse=True)
vocab_to_int = {word:intgr for intgr, word in enumerate(vocab, 1) }
# print(vocab_to_int)
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])
# print(reviews_ints)

f. Remove outlier
Data train yang tidak bagus, akan mempengaruhi hasil dari model, termasuk data-data outlier. Pada kasus ini, data outlier ditandai oleh data, jumlah kata, yang terlalu pendek maupun yang terlalu panjang. Maka dari itu, harus kita proses terlebih dahulu. Proses yang akan dilalukan pada kasus ini adalah:
- menghapus data yang memiliki panjang kata < 1
- menghapus label data yang memiliki pangang kata pada datanya < 1

print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.
non_zero_index = [ii for ii, review in enumerate(reviews_ints) if len(review)!=0]

reviews_ints = [reviews_ints[ii] for ii in non_zero_index]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_index])

print('Number of reviews after removing outliers: ', len(reviews_ints))

f. format data
Data pada data set yang ada, pasti tidak akan seragan dalam hal panjangnya, ada yang terlalu panjang dan ada yang sedikit pendek. Untuk data outlier, panjang kata < 1, sudah kita bersihkan pada tahap sebelumnya. Tetapi bagai mana dengan kalimat yang memiliki jumlah kata yang panjang. Untuk itu kita harus memprosesnya juga:
- tentukan panjang seqence (dalam kasus ini panjang yang baik adalah 200)
- potong data yang memiliki panjang lebih dari panjang maksimal sequence
- berikan padding pada bagian kiri dengan nilai 0, jika panjang data kurang dari panjang sequence

def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    ## implement function
    
    features=np.zeros((len(reviews_ints), seq_length), dtype=int)
    
    for i, review in enumerate(reviews_ints):
#         print(i, -len(review))
        features[i, -len(review):] = np.array(review)[:seq_length]
        
    return features

# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

g. Split data menjadi data train, data validasi, data test
Kita sudah sampai pada tahap terakhir pra-processing data, yaitu membagi data yang ada untuk testing, validasi, dan juga testing. Kita akan membagi data dalam koposisi:
- training data : 80%
- validasi data : 10%
- testing data : 10%

pastikan data validasi yang kita sediakan layak untuk validasi selama proses training berlangsung. Data validasi yang kurang akan menjadikan hasil validasi selama training menjadi noisy dan juga tidak merepresentasikan hasil validasi yang informatif.

split_frac = 0.8
split_frac_test = 0.5

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*split_frac_test)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

2. Prepare for training

a. Data loader and batching
Proses ini dilakukan untuk memudahkan framework untuk membaca data yang ada. Sehingga data dapat dibaca dengan baik. Tahapan prosesnya sebagai berikut:
- tentukan bacth size yang diinginkan (dalam kasus ini sebesar 50)
- pindahkan data ke TensorDataset (unutk pytorch)
- pindahkan data hasil sebelumnya ke DataLoader (untuk pytorch)

import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

b. Membuat arsitektur RNN
Sampailah kita pada tahap ini. Tahap ini mengharuskan kita unutk membuat arsitektur dari RNN yang kita inginkan. Untuk mendesain Arsitektur untuk menghasilkan model yang baik, ada saran yang cukup bagus dari Andrej, kita dapat memplajarinya dari tulisannya disini, atau saya sudah salinkan ke salah satu postingan di blog saya ini.
Arsitektur yang digunakan pada kasus ini adalah:
- Algoritma : LSTM
- Dropout : 0.3
- Fully connected layer menggunakan aktivasi sigmoid

# First checking if GPU is available

train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):

    print('Training on GPU.')

else:

    print('No GPU available, training on CPU.')

import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

kita akan menentukan output, hidden dan sebagainya

# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)

c. penentuan parameter training (training hyparameters)
beberapa parameter yang akan kita konfig adalah:

lr: Learning rate for our optimizer.
epochs: Number of times to iterate through the training dataset.
clip: The maximum gradient value to clip at (to prevent exploding gradients).

# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

3. Training model

a. train

 model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

b. testing

kita dapat melihat performa dari model dengan melakukan testing dengan data yang belum pernah sama sekali kita.

# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

4. Menggunakan Model

# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

# test sequence padding
seq_length=200
features = pad_features(test_ints, seq_length)

print(features)

# test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

def predict(net, test_review, sequence_length=200):
    
    net.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")

# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

# call function
seq_length=200 # good to use the length that was trained on

predict(net, test_review_neg, seq_length)

Search what you want

Labels

Sunday, December 2, 2018

Sentiment Prediction with RNN, Pytorch

1 comment:

.