Browse Source

Adding exercise 2

master
Matthias Neeracher 3 years ago
parent
commit
4ee07682fd
9 changed files with 97734 additions and 0 deletions
  1. +125
    -0
      NNML2/README.txt
  2. BIN
      NNML2/data.mat
  3. +28
    -0
      NNML2/display_nearest_words.m
  4. +86
    -0
      NNML2/fprop.m
  5. +27
    -0
      NNML2/load_data.m
  6. +37
    -0
      NNML2/predict_next_word.m
  7. +97162
    -0
      NNML2/raw_sentences.txt
  8. +244
    -0
      NNML2/train.m
  9. +25
    -0
      NNML2/word_distance.m

+ 125
- 0
NNML2/README.txt View File

@@ -0,0 +1,125 @@
#######################################
Neural Networks for Machine Learning
Programming Assignment 2
Learning word representations.
#######################################

In this assignment, you will design a neural net language model that will
learn to predict the next word, given previous three words.

The data set consists of 4-grams (A 4-gram is a sequence of 4 adjacent words
in a sentence). These 4-grams were extracted from a large collection of text.
The 4-grams are chosen so that all the words involved come
from a small vocabulary of 250 words. Note that for the purposes of this
assignment special characters such as commas, full-stops, parentheses etc
are also considered words. The training set consists of 372,550 4-grams. The
validation and test sets have 46,568 4-grams each.

### GETTING STARTED. ###
Look at the file raw_sentences.txt. It contains the raw sentences from which
these 4-grams were extracted. Take a look at the kind of sentences we are
dealing with here. They are fairly simple ones.

To load the data set, go to an octave terminal and cd to the directory where the
downloaded data is located. Type

> load data.mat

This will load a struct called 'data' with 4 fields in it.
You can see them by typing

> fieldnames(data)

'data.vocab' contains the vocabulary of 250 words. Training, validation and
test sets are in 'data.trainData', 'data.validData' and 'data.testData' respectively.
To see the list of words in the vocabulary, type -

> data.vocab

'data.trainData' is a matrix of 372550 X 4. This means there are 372550
training cases and 4 words per training case. Each entry is an integer that is
the index of a word in the vocabulary. So each row represents a sequence of 4
words. 'data.validData' and 'data.testData' are also similar. They contain
46,568 4-grams each. All three need to be separated into inputs and targets
and the training set needs to be split into mini-batches. The file load_data.m
provides code for doing that. To run it type:

>[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

This will load the data, separate it into inputs and target, and make
mini-batches of size 100 for the training set.

train.m implements the function that trains a neural net language model.
To run the training, execute the following -

> model = train(1);

This will train the model for one epoch (one pass through the training set).
Currently, the training is not implemented and the cross entropy will not
decrease. You have to fill in parts of the code in fprop.m and train.m.
Once the code is correctly filled-in, you will see that the cross entropy
starts decreasing. At this point, try changing the hyperparameters (number
of epochs, number of hidden units, learning rates, momentum, etc) and see
what effect that has on the training and validation cross entropy. The
questions in the assignment will ask you try out specific values of these.

The training method will output a 'model' (a struct containing weights, biases
and a list of words). Now it's time to play around with the learned model
and answer the questions in the assignment.

### DESCRIPTION OF THE NETWORK. ###
The network consists of an input layer, embedding layer, hidden layer and output
layer. The input layer consists of three word indices. The same
'word_embedding_weights' are used to map each index to a distributed feature
representation. These mapped features constitute the embedding layer. This layer
is connected to the hidden layer, which in turn is connected to the output
layer. The output layer is a softmax over the 250 words.

### THINGS YOU SEE WHEN THE MODEL IS TRAINING. ###
As the model trains it prints out some numbers that tell you how well the
training is going.
(1) The model shows the average per-case cross entropy (CE) obtained
on the training set. The average CE is computed every 100 mini-batches. The
average CE over the entire training set is reported at the end of every epoch.

(2) After every 1000 mini-batches of training, the model is run on the
validation set. Recall, that the validation set consists of data that is not
used for training. It is used to see how well the model does on unseen data. The
cross entropy on validation set is reported.

(3) At the end of training, the model is run both on the validation set and on
the test set and the cross entropy on both is reported.

You are welcome to change these numbers (100 and 1000) to see the CE's more
frequently if you want to.


### SOME USEFUL FUNCTIONS. ###
These functions are meant to be used for analyzing the model after the training
is done.
display_nearest_words.m : This method will display the words closest to a
given word in the word representation space.
word_distance.m : This method will compute the distance between two given
words.
predict_next_word.m : This method will produce some predictions for the next
word given 3 previous words.
Take a look at the documentation inside these functions to see how to use them.


### THINGS TO TRY. ###
Choose some words from the vocabulary and make a list. Find the words that
the model thinks are close to words in this list (for example, find the words
closest to 'companies', 'president', 'day', 'could', etc). Do the outputs make
sense ?

Pick three words from the vocabulary that go well together (for example,
'government of united', 'city of new', 'life in the', 'he is the' etc). Use
the model to predict the next word. Does the model give sensible predictions?

Which words would you expect to be closer together than others ? For example,
'he' should be closer to 'she' than to 'federal', or 'companies' should be
closer to 'business' than 'political'. Find the distances using the model.
Do the distances that the model predicts make sense ?

You are welcome to try other things with this model and post any interesting
observations on the forums!

BIN
NNML2/data.mat View File


+ 28
- 0
NNML2/display_nearest_words.m View File

@@ -0,0 +1,28 @@
function display_nearest_words(word, model, k)
% Shows the k-nearest words to the query word.
% Inputs:
% word: The query word as a string.
% model: Model returned by the training script.
% k: The number of nearest words to display.
% Example usage:
% display_nearest_words('school', model, 10);

word_embedding_weights = model.word_embedding_weights;
vocab = model.vocab;
id = strmatch(word, vocab, 'exact');
if ~any(id)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word);
return;
end
% Compute distance to every other word.
vocab_size = size(vocab, 2);
word_rep = word_embedding_weights(id, :);
diff = word_embedding_weights - repmat(word_rep, vocab_size, 1);
distance = sqrt(sum(diff .* diff, 2));

% Sort by distance.
[d, order] = sort(distance);
order = order(2:k+1); % The nearest word is the query word itself, skip that.
for i = 1:k
fprintf('%s %.2f\n', vocab{order(i)}, distance(order(i)));
end

+ 86
- 0
NNML2/fprop.m View File

@@ -0,0 +1,86 @@
function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias)
% This method forward propagates through a neural network.
% Inputs:
% input_batch: The input data as a matrix of size numwords X batchsize where,
% numwords is the number of words, batchsize is the number of data points.
% So, if input_batch(i, j) = k then the ith word in data point j is word
% index k of the vocabulary.
%
% word_embedding_weights: Word embedding as a matrix of size
% vocab_size X numhid1, where vocab_size is the size of the vocabulary
% numhid1 is the dimensionality of the embedding space.
%
% embed_to_hid_weights: Weights between the word embedding layer and hidden
% layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
% number of hidden units.
%
% hid_to_output_weights: Weights between the hidden layer and output softmax
% unit as a matrix of size numhid2 X vocab_size
%
% hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
%
% output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
%
% Outputs:
% embedding_layer_state: State of units in the embedding layer as a matrix of
% size numhid1*numwords X batchsize
%
% hidden_layer_state: State of units in the hidden layer as a matrix of size
% numhid2 X batchsize
%
% output_layer_state: State of units in the output layer as a matrix of size
% vocab_size X batchsize
%

[numwords, batchsize] = size(input_batch);
[vocab_size, numhid1] = size(word_embedding_weights);
numhid2 = size(embed_to_hid_weights, 2);

%% COMPUTE STATE OF WORD EMBEDDING LAYER.
% Look up the inputs word indices in the word_embedding_weights matrix.
embedding_layer_state = reshape(...
word_embedding_weights(reshape(input_batch, 1, []),:)',...
numhid1 * numwords, []);

%% COMPUTE STATE OF HIDDEN LAYER.
% Compute inputs to hidden units.
inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
repmat(hid_bias, 1, batchsize);

% Apply logistic activation function.
% FILL IN CODE. Replace the line below by one of the options.
% hidden_layer_state = zeros(numhid2, batchsize);
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
% Options
% (a) hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units));
% (b) hidden_layer_state = 1 ./ (1 - exp(-inputs_to_hidden_units));
% (c) hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
% (d) hidden_layer_state = -1 ./ (1 + exp(-inputs_to_hidden_units));

%% COMPUTE STATE OF OUTPUT LAYER.
% Compute inputs to softmax.
% FILL IN CODE. Replace the line below by one of the options.
% inputs_to_softmax = zeros(vocab_size, batchsize);
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize);
% Options
% (a) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize);
% (b) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, batchsize, 1);
% (c) inputs_to_softmax = hidden_layer_state * hid_to_output_weights' + repmat(output_bias, 1, batchsize);
% (d) inputs_to_softmax = hid_to_output_weights * hidden_layer_state + repmat(output_bias, batchsize, 1);

% Subtract maximum.
% Remember that adding or subtracting the same constant from each input to a
% softmax unit does not affect the outputs. Here we are subtracting maximum to
% make all inputs <= 0. This prevents overflows when computing their
% exponents.
inputs_to_softmax = inputs_to_softmax...
- repmat(max(inputs_to_softmax), vocab_size, 1);

% Compute exp.
output_layer_state = exp(inputs_to_softmax);

% Normalize to get probability distribution.
output_layer_state = output_layer_state ./ repmat(...
sum(output_layer_state, 1), vocab_size, 1);

+ 27
- 0
NNML2/load_data.m View File

@@ -0,0 +1,27 @@
function [train_input, train_target, valid_input, valid_target, test_input, test_target, vocab] = load_data(N)
% This method loads the training, validation and test set.
% It also divides the training set into mini-batches.
% Inputs:
% N: Mini-batch size.
% Outputs:
% train_input: An array of size D X N X M, where
% D: number of input dimensions (in this case, 3).
% N: size of each mini-batch (in this case, 100).
% M: number of minibatches.
% train_target: An array of size 1 X N X M.
% valid_input: An array of size D X number of points in the validation set.
% test: An array of size D X number of points in the test set.
% vocab: Vocabulary containing index to word mapping.

load data.mat;
numdims = size(data.trainData, 1);
D = numdims - 1;
M = floor(size(data.trainData, 2) / N);
train_input = reshape(data.trainData(1:D, 1:N * M), D, N, M);
train_target = reshape(data.trainData(D + 1, 1:N * M), 1, N, M);
valid_input = data.validData(1:D, :);
valid_target = data.validData(D + 1, :);
test_input = data.testData(1:D, :);
test_target = data.testData(D + 1, :);
vocab = data.vocab;
end

+ 37
- 0
NNML2/predict_next_word.m View File

@@ -0,0 +1,37 @@
function predict_next_word(word1, word2, word3, model, k)
% Predicts the next word.
% Inputs:
% word1: The first word as a string.
% word2: The second word as a string.
% word3: The third word as a string.
% model: Model returned by the training script.
% k: The k most probable predictions are shown.
% Example usage:
% predict_next_word('john', 'might', 'be', model, 3);
% predict_next_word('life', 'in', 'new', model, 3);

word_embedding_weights = model.word_embedding_weights;
vocab = model.vocab;
id1 = strmatch(word1, vocab, 'exact');
id2 = strmatch(word2, vocab, 'exact');
id3 = strmatch(word3, vocab, 'exact');
if ~any(id1)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word1);
return;
end
if ~any(id2)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word2);
return;
end
if ~any(id3)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word3);
return;
end
input = [id1; id2; id3];
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input, model.word_embedding_weights, model.embed_to_hid_weights,...
model.hid_to_output_weights, model.hid_bias, model.output_bias);
[prob, indices] = sort(output_layer_state, 'descend');
for i = 1:k
fprintf(1, '%s %s %s %s Prob: %.5f\n', word1, word2, word3, vocab{indices(i)}, prob(i));
end

+ 97162
- 0
NNML2/raw_sentences.txt
File diff suppressed because it is too large
View File


+ 244
- 0
NNML2/train.m View File

@@ -0,0 +1,244 @@
% This function trains a neural network language model.
function [model] = train(epochs)
% Inputs:
% epochs: Number of epochs to run.
% Output:
% model: A struct containing the learned weights and biases and vocabulary.

if size(ver('Octave'),1)
OctaveMode = 1;
warning('error', 'Octave:broadcast');
start_time = time;
else
OctaveMode = 0;
start_time = clock;
end

% SET HYPERPARAMETERS HERE.
batchsize = 100; % Mini-batch size.
learning_rate = 0.1; % Learning rate; default = 0.1.
momentum = 0.9; % Momentum; default = 0.9.
numhid1 = 50; % Dimensionality of embedding space; default = 50.
numhid2 = 200; % Number of units in hidden layer; default = 200.
init_wt = 0.01; % Standard deviation of the normal distribution
% which is sampled to get the initial weights; default = 0.01

% VARIABLES FOR TRACKING TRAINING PROGRESS.
show_training_CE_after = 100;
show_validation_CE_after = 1000;

% LOAD DATA.
[train_input, train_target, valid_input, valid_target, ...
test_input, test_target, vocab] = load_data(batchsize);
[numwords, batchsize, numbatches] = size(train_input);
vocab_size = size(vocab, 2);

% INITIALIZE WEIGHTS AND BIASES.
word_embedding_weights = init_wt * randn(vocab_size, numhid1);
embed_to_hid_weights = init_wt * randn(numwords * numhid1, numhid2);
hid_to_output_weights = init_wt * randn(numhid2, vocab_size);
hid_bias = zeros(numhid2, 1);
output_bias = zeros(vocab_size, 1);

word_embedding_weights_delta = zeros(vocab_size, numhid1);
word_embedding_weights_gradient = zeros(vocab_size, numhid1);
embed_to_hid_weights_delta = zeros(numwords * numhid1, numhid2);
hid_to_output_weights_delta = zeros(numhid2, vocab_size);
hid_bias_delta = zeros(numhid2, 1);
output_bias_delta = zeros(vocab_size, 1);
expansion_matrix = eye(vocab_size);
count = 0;
tiny = exp(-30);
trainset_CE = 0;

% TRAIN.
for epoch = 1:epochs
fprintf(1, 'Epoch %d\n', epoch);
this_chunk_CE = 0;
trainset_CE = 0;
% LOOP OVER MINI-BATCHES.
for m = 1:numbatches
input_batch = train_input(:, :, m);
target_batch = train_target(:, :, m);

% FORWARD PROPAGATE.
% Compute the state of each layer in the network given the input batch
% and all weights and biases
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, ...
word_embedding_weights, embed_to_hid_weights, ...
hid_to_output_weights, hid_bias, output_bias);

% COMPUTE DERIVATIVE.
%% Expand the target to a sparse 1-of-K vector.
expanded_target_batch = expansion_matrix(:, target_batch);
%% Compute derivative of cross-entropy loss function.
%%% vocab_size X batchsize
error_deriv = output_layer_state - expanded_target_batch;

% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;
count = count + 1;
this_chunk_CE = this_chunk_CE + (CE - this_chunk_CE) / count;
trainset_CE = trainset_CE + (CE - trainset_CE) / m;
fprintf(1, '\rBatch %d Train CE %.3f', m, this_chunk_CE);
if mod(m, show_training_CE_after) == 0
fprintf(1, '\n');
count = 0;
this_chunk_CE = 0;
end
if OctaveMode
fflush(1);
end

% BACK PROPAGATE.
%% OUTPUT LAYER.
%%% numhid2 X vocab_size
hid_to_output_weights_gradient = hidden_layer_state * error_deriv';
%%% vocab_size
output_bias_gradient = sum(error_deriv, 2);
%%% numhid2 X batchsize
back_propagated_deriv_1 = (hid_to_output_weights * error_deriv) ...
.* hidden_layer_state .* (1 - hidden_layer_state);

%% HIDDEN LAYER.
% FILL IN CODE. Replace the line below by one of the options.
% embed_to_hid_weights_gradient = zeros(numhid1 * numwords, numhid2);
embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1';
% Options:
% (a) embed_to_hid_weights_gradient = back_propagated_deriv_1' * embedding_layer_state;
% (b) embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1';
% (c) embed_to_hid_weights_gradient = back_propagated_deriv_1;
% (d) embed_to_hid_weights_gradient = embedding_layer_state;

% FILL IN CODE. Replace the line below by one of the options.
% hid_bias_gradient = zeros(numhid2, 1);
hid_bias_gradient = sum(back_propagated_deriv_1, 2);
% Options
% (a) hid_bias_gradient = sum(back_propagated_deriv_1, 2);
% (b) hid_bias_gradient = sum(back_propagated_deriv_1, 1);
% (c) hid_bias_gradient = back_propagated_deriv_1;
% (d) hid_bias_gradient = back_propagated_deriv_1';

% FILL IN CODE. Replace the line below by one of the options.
back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1;
% Options
% (a) back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1;
% (b) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights;
% (c) back_propagated_deriv_2 = back_propagated_deriv_1' * embed_to_hid_weights;
% (d) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights';

word_embedding_weights_gradient(:) = 0;
%% EMBEDDING LAYER.
for w = 1:numwords
word_embedding_weights_gradient = word_embedding_weights_gradient + ...
expansion_matrix(:, input_batch(w, :)) * ...
(back_propagated_deriv_2(1 + (w - 1) * numhid1 : w * numhid1, :)');
end

% UPDATE WEIGHTS AND BIASES.
word_embedding_weights_delta = ...
momentum .* word_embedding_weights_delta + ...
word_embedding_weights_gradient ./ batchsize;
word_embedding_weights = word_embedding_weights...
- learning_rate * word_embedding_weights_delta;

embed_to_hid_weights_delta = ...
momentum .* embed_to_hid_weights_delta + ...
embed_to_hid_weights_gradient ./ batchsize;
embed_to_hid_weights = embed_to_hid_weights...
- learning_rate * embed_to_hid_weights_delta;

hid_to_output_weights_delta = ...
momentum .* hid_to_output_weights_delta + ...
hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
- learning_rate * hid_to_output_weights_delta;

hid_bias_delta = momentum .* hid_bias_delta + ...
hid_bias_gradient ./ batchsize;
hid_bias = hid_bias - learning_rate * hid_bias_delta;

output_bias_delta = momentum .* output_bias_delta + ...
output_bias_gradient ./ batchsize;
output_bias = output_bias - learning_rate * output_bias_delta;

% VALIDATE.
if mod(m, show_validation_CE_after) == 0
fprintf(1, '\rRunning validation ...');
if OctaveMode
fflush(1);
end
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(valid_input, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias);
datasetsize = size(valid_input, 2);
expanded_valid_target = expansion_matrix(:, valid_target);
CE = -sum(sum(...
expanded_valid_target .* log(output_layer_state + tiny))) /datasetsize;
fprintf(1, ' Validation CE %.3f\n', CE);
if OctaveMode
fflush(1);
end
end
end
fprintf(1, '\rAverage Training CE %.3f\n', trainset_CE);
end
fprintf(1, 'Finished Training.\n');
if OctaveMode
fflush(1);
end
fprintf(1, 'Final Training CE %.3f\n', trainset_CE);

% EVALUATE ON VALIDATION SET.
fprintf(1, '\rRunning validation ...');
if OctaveMode
fflush(1);
end
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(valid_input, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias);
datasetsize = size(valid_input, 2);
expanded_valid_target = expansion_matrix(:, valid_target);
CE = -sum(sum(...
expanded_valid_target .* log(output_layer_state + tiny))) / datasetsize;
fprintf(1, '\rFinal Validation CE %.3f\n', CE);
if OctaveMode
fflush(1);
end

% EVALUATE ON TEST SET.
fprintf(1, '\rRunning test ...');
if OctaveMode
fflush(1);
end
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(test_input, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias);
datasetsize = size(test_input, 2);
expanded_test_target = expansion_matrix(:, test_target);
CE = -sum(sum(...
expanded_test_target .* log(output_layer_state + tiny))) / datasetsize;
fprintf(1, '\rFinal Test CE %.3f\n', CE);
if OctaveMode
fflush(1);
end

model.word_embedding_weights = word_embedding_weights;
model.embed_to_hid_weights = embed_to_hid_weights;
model.hid_to_output_weights = hid_to_output_weights;
model.hid_bias = hid_bias;
model.output_bias = output_bias;
model.vocab = vocab;

% In MATLAB replace line below with 'end_time = clock;'
if OctaveMode
end_time = time;
diff = end_time - start_time;
else
end_time = clock;
diff = etime(end_time, start_time);
end
fprintf(1, 'Training took %.2f seconds\n', diff);
end

+ 25
- 0
NNML2/word_distance.m View File

@@ -0,0 +1,25 @@
function distance = word_distance(word1, word2, model)
% Shows the L2 distance between word1 and word2 in the word_embedding_weights.
% Inputs:
% word1: The first word as a string.
% word2: The second word as a string.
% model: Model returned by the training script.
% Example usage:
% word_distance('school', 'university', model);

word_embedding_weights = model.word_embedding_weights;
vocab = model.vocab;
id1 = strmatch(word1, vocab, 'exact');
id2 = strmatch(word2, vocab, 'exact');
if ~any(id1)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word1);
return;
end
if ~any(id2)
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word2);
return;
end
word_rep1 = word_embedding_weights(id1, :);
word_rep2 = word_embedding_weights(id2, :);
diff = word_rep1 - word_rep2;
distance = sqrt(sum(diff .* diff));

Loading…
Cancel
Save