Adding exercise 2
This commit is contained in:
parent
e71a204541
commit
4ee07682fd
125
NNML2/README.txt
Executable file
125
NNML2/README.txt
Executable file
|
@ -0,0 +1,125 @@
|
|||
#######################################
|
||||
Neural Networks for Machine Learning
|
||||
Programming Assignment 2
|
||||
Learning word representations.
|
||||
#######################################
|
||||
|
||||
In this assignment, you will design a neural net language model that will
|
||||
learn to predict the next word, given previous three words.
|
||||
|
||||
The data set consists of 4-grams (A 4-gram is a sequence of 4 adjacent words
|
||||
in a sentence). These 4-grams were extracted from a large collection of text.
|
||||
The 4-grams are chosen so that all the words involved come
|
||||
from a small vocabulary of 250 words. Note that for the purposes of this
|
||||
assignment special characters such as commas, full-stops, parentheses etc
|
||||
are also considered words. The training set consists of 372,550 4-grams. The
|
||||
validation and test sets have 46,568 4-grams each.
|
||||
|
||||
### GETTING STARTED. ###
|
||||
Look at the file raw_sentences.txt. It contains the raw sentences from which
|
||||
these 4-grams were extracted. Take a look at the kind of sentences we are
|
||||
dealing with here. They are fairly simple ones.
|
||||
|
||||
To load the data set, go to an octave terminal and cd to the directory where the
|
||||
downloaded data is located. Type
|
||||
|
||||
> load data.mat
|
||||
|
||||
This will load a struct called 'data' with 4 fields in it.
|
||||
You can see them by typing
|
||||
|
||||
> fieldnames(data)
|
||||
|
||||
'data.vocab' contains the vocabulary of 250 words. Training, validation and
|
||||
test sets are in 'data.trainData', 'data.validData' and 'data.testData' respectively.
|
||||
To see the list of words in the vocabulary, type -
|
||||
|
||||
> data.vocab
|
||||
|
||||
'data.trainData' is a matrix of 372550 X 4. This means there are 372550
|
||||
training cases and 4 words per training case. Each entry is an integer that is
|
||||
the index of a word in the vocabulary. So each row represents a sequence of 4
|
||||
words. 'data.validData' and 'data.testData' are also similar. They contain
|
||||
46,568 4-grams each. All three need to be separated into inputs and targets
|
||||
and the training set needs to be split into mini-batches. The file load_data.m
|
||||
provides code for doing that. To run it type:
|
||||
|
||||
>[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);
|
||||
|
||||
This will load the data, separate it into inputs and target, and make
|
||||
mini-batches of size 100 for the training set.
|
||||
|
||||
train.m implements the function that trains a neural net language model.
|
||||
To run the training, execute the following -
|
||||
|
||||
> model = train(1);
|
||||
|
||||
This will train the model for one epoch (one pass through the training set).
|
||||
Currently, the training is not implemented and the cross entropy will not
|
||||
decrease. You have to fill in parts of the code in fprop.m and train.m.
|
||||
Once the code is correctly filled-in, you will see that the cross entropy
|
||||
starts decreasing. At this point, try changing the hyperparameters (number
|
||||
of epochs, number of hidden units, learning rates, momentum, etc) and see
|
||||
what effect that has on the training and validation cross entropy. The
|
||||
questions in the assignment will ask you try out specific values of these.
|
||||
|
||||
The training method will output a 'model' (a struct containing weights, biases
|
||||
and a list of words). Now it's time to play around with the learned model
|
||||
and answer the questions in the assignment.
|
||||
|
||||
### DESCRIPTION OF THE NETWORK. ###
|
||||
The network consists of an input layer, embedding layer, hidden layer and output
|
||||
layer. The input layer consists of three word indices. The same
|
||||
'word_embedding_weights' are used to map each index to a distributed feature
|
||||
representation. These mapped features constitute the embedding layer. This layer
|
||||
is connected to the hidden layer, which in turn is connected to the output
|
||||
layer. The output layer is a softmax over the 250 words.
|
||||
|
||||
### THINGS YOU SEE WHEN THE MODEL IS TRAINING. ###
|
||||
As the model trains it prints out some numbers that tell you how well the
|
||||
training is going.
|
||||
(1) The model shows the average per-case cross entropy (CE) obtained
|
||||
on the training set. The average CE is computed every 100 mini-batches. The
|
||||
average CE over the entire training set is reported at the end of every epoch.
|
||||
|
||||
(2) After every 1000 mini-batches of training, the model is run on the
|
||||
validation set. Recall, that the validation set consists of data that is not
|
||||
used for training. It is used to see how well the model does on unseen data. The
|
||||
cross entropy on validation set is reported.
|
||||
|
||||
(3) At the end of training, the model is run both on the validation set and on
|
||||
the test set and the cross entropy on both is reported.
|
||||
|
||||
You are welcome to change these numbers (100 and 1000) to see the CE's more
|
||||
frequently if you want to.
|
||||
|
||||
|
||||
### SOME USEFUL FUNCTIONS. ###
|
||||
These functions are meant to be used for analyzing the model after the training
|
||||
is done.
|
||||
display_nearest_words.m : This method will display the words closest to a
|
||||
given word in the word representation space.
|
||||
word_distance.m : This method will compute the distance between two given
|
||||
words.
|
||||
predict_next_word.m : This method will produce some predictions for the next
|
||||
word given 3 previous words.
|
||||
Take a look at the documentation inside these functions to see how to use them.
|
||||
|
||||
|
||||
### THINGS TO TRY. ###
|
||||
Choose some words from the vocabulary and make a list. Find the words that
|
||||
the model thinks are close to words in this list (for example, find the words
|
||||
closest to 'companies', 'president', 'day', 'could', etc). Do the outputs make
|
||||
sense ?
|
||||
|
||||
Pick three words from the vocabulary that go well together (for example,
|
||||
'government of united', 'city of new', 'life in the', 'he is the' etc). Use
|
||||
the model to predict the next word. Does the model give sensible predictions?
|
||||
|
||||
Which words would you expect to be closer together than others ? For example,
|
||||
'he' should be closer to 'she' than to 'federal', or 'companies' should be
|
||||
closer to 'business' than 'political'. Find the distances using the model.
|
||||
Do the distances that the model predicts make sense ?
|
||||
|
||||
You are welcome to try other things with this model and post any interesting
|
||||
observations on the forums!
|
BIN
NNML2/data.mat
Normal file
BIN
NNML2/data.mat
Normal file
Binary file not shown.
28
NNML2/display_nearest_words.m
Normal file
28
NNML2/display_nearest_words.m
Normal file
|
@ -0,0 +1,28 @@
|
|||
function display_nearest_words(word, model, k)
|
||||
% Shows the k-nearest words to the query word.
|
||||
% Inputs:
|
||||
% word: The query word as a string.
|
||||
% model: Model returned by the training script.
|
||||
% k: The number of nearest words to display.
|
||||
% Example usage:
|
||||
% display_nearest_words('school', model, 10);
|
||||
|
||||
word_embedding_weights = model.word_embedding_weights;
|
||||
vocab = model.vocab;
|
||||
id = strmatch(word, vocab, 'exact');
|
||||
if ~any(id)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word);
|
||||
return;
|
||||
end
|
||||
% Compute distance to every other word.
|
||||
vocab_size = size(vocab, 2);
|
||||
word_rep = word_embedding_weights(id, :);
|
||||
diff = word_embedding_weights - repmat(word_rep, vocab_size, 1);
|
||||
distance = sqrt(sum(diff .* diff, 2));
|
||||
|
||||
% Sort by distance.
|
||||
[d, order] = sort(distance);
|
||||
order = order(2:k+1); % The nearest word is the query word itself, skip that.
|
||||
for i = 1:k
|
||||
fprintf('%s %.2f\n', vocab{order(i)}, distance(order(i)));
|
||||
end
|
86
NNML2/fprop.m
Normal file
86
NNML2/fprop.m
Normal file
|
@ -0,0 +1,86 @@
|
|||
function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
|
||||
hid_to_output_weights, hid_bias, output_bias)
|
||||
% This method forward propagates through a neural network.
|
||||
% Inputs:
|
||||
% input_batch: The input data as a matrix of size numwords X batchsize where,
|
||||
% numwords is the number of words, batchsize is the number of data points.
|
||||
% So, if input_batch(i, j) = k then the ith word in data point j is word
|
||||
% index k of the vocabulary.
|
||||
%
|
||||
% word_embedding_weights: Word embedding as a matrix of size
|
||||
% vocab_size X numhid1, where vocab_size is the size of the vocabulary
|
||||
% numhid1 is the dimensionality of the embedding space.
|
||||
%
|
||||
% embed_to_hid_weights: Weights between the word embedding layer and hidden
|
||||
% layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
|
||||
% number of hidden units.
|
||||
%
|
||||
% hid_to_output_weights: Weights between the hidden layer and output softmax
|
||||
% unit as a matrix of size numhid2 X vocab_size
|
||||
%
|
||||
% hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
|
||||
%
|
||||
% output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
|
||||
%
|
||||
% Outputs:
|
||||
% embedding_layer_state: State of units in the embedding layer as a matrix of
|
||||
% size numhid1*numwords X batchsize
|
||||
%
|
||||
% hidden_layer_state: State of units in the hidden layer as a matrix of size
|
||||
% numhid2 X batchsize
|
||||
%
|
||||
% output_layer_state: State of units in the output layer as a matrix of size
|
||||
% vocab_size X batchsize
|
||||
%
|
||||
|
||||
[numwords, batchsize] = size(input_batch);
|
||||
[vocab_size, numhid1] = size(word_embedding_weights);
|
||||
numhid2 = size(embed_to_hid_weights, 2);
|
||||
|
||||
%% COMPUTE STATE OF WORD EMBEDDING LAYER.
|
||||
% Look up the inputs word indices in the word_embedding_weights matrix.
|
||||
embedding_layer_state = reshape(...
|
||||
word_embedding_weights(reshape(input_batch, 1, []),:)',...
|
||||
numhid1 * numwords, []);
|
||||
|
||||
%% COMPUTE STATE OF HIDDEN LAYER.
|
||||
% Compute inputs to hidden units.
|
||||
inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
|
||||
repmat(hid_bias, 1, batchsize);
|
||||
|
||||
% Apply logistic activation function.
|
||||
% FILL IN CODE. Replace the line below by one of the options.
|
||||
% hidden_layer_state = zeros(numhid2, batchsize);
|
||||
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
|
||||
% Options
|
||||
% (a) hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units));
|
||||
% (b) hidden_layer_state = 1 ./ (1 - exp(-inputs_to_hidden_units));
|
||||
% (c) hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
|
||||
% (d) hidden_layer_state = -1 ./ (1 + exp(-inputs_to_hidden_units));
|
||||
|
||||
%% COMPUTE STATE OF OUTPUT LAYER.
|
||||
% Compute inputs to softmax.
|
||||
% FILL IN CODE. Replace the line below by one of the options.
|
||||
% inputs_to_softmax = zeros(vocab_size, batchsize);
|
||||
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize);
|
||||
% Options
|
||||
% (a) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize);
|
||||
% (b) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, batchsize, 1);
|
||||
% (c) inputs_to_softmax = hidden_layer_state * hid_to_output_weights' + repmat(output_bias, 1, batchsize);
|
||||
% (d) inputs_to_softmax = hid_to_output_weights * hidden_layer_state + repmat(output_bias, batchsize, 1);
|
||||
|
||||
% Subtract maximum.
|
||||
% Remember that adding or subtracting the same constant from each input to a
|
||||
% softmax unit does not affect the outputs. Here we are subtracting maximum to
|
||||
% make all inputs <= 0. This prevents overflows when computing their
|
||||
% exponents.
|
||||
inputs_to_softmax = inputs_to_softmax...
|
||||
- repmat(max(inputs_to_softmax), vocab_size, 1);
|
||||
|
||||
% Compute exp.
|
||||
output_layer_state = exp(inputs_to_softmax);
|
||||
|
||||
% Normalize to get probability distribution.
|
||||
output_layer_state = output_layer_state ./ repmat(...
|
||||
sum(output_layer_state, 1), vocab_size, 1);
|
27
NNML2/load_data.m
Normal file
27
NNML2/load_data.m
Normal file
|
@ -0,0 +1,27 @@
|
|||
function [train_input, train_target, valid_input, valid_target, test_input, test_target, vocab] = load_data(N)
|
||||
% This method loads the training, validation and test set.
|
||||
% It also divides the training set into mini-batches.
|
||||
% Inputs:
|
||||
% N: Mini-batch size.
|
||||
% Outputs:
|
||||
% train_input: An array of size D X N X M, where
|
||||
% D: number of input dimensions (in this case, 3).
|
||||
% N: size of each mini-batch (in this case, 100).
|
||||
% M: number of minibatches.
|
||||
% train_target: An array of size 1 X N X M.
|
||||
% valid_input: An array of size D X number of points in the validation set.
|
||||
% test: An array of size D X number of points in the test set.
|
||||
% vocab: Vocabulary containing index to word mapping.
|
||||
|
||||
load data.mat;
|
||||
numdims = size(data.trainData, 1);
|
||||
D = numdims - 1;
|
||||
M = floor(size(data.trainData, 2) / N);
|
||||
train_input = reshape(data.trainData(1:D, 1:N * M), D, N, M);
|
||||
train_target = reshape(data.trainData(D + 1, 1:N * M), 1, N, M);
|
||||
valid_input = data.validData(1:D, :);
|
||||
valid_target = data.validData(D + 1, :);
|
||||
test_input = data.testData(1:D, :);
|
||||
test_target = data.testData(D + 1, :);
|
||||
vocab = data.vocab;
|
||||
end
|
37
NNML2/predict_next_word.m
Normal file
37
NNML2/predict_next_word.m
Normal file
|
@ -0,0 +1,37 @@
|
|||
function predict_next_word(word1, word2, word3, model, k)
|
||||
% Predicts the next word.
|
||||
% Inputs:
|
||||
% word1: The first word as a string.
|
||||
% word2: The second word as a string.
|
||||
% word3: The third word as a string.
|
||||
% model: Model returned by the training script.
|
||||
% k: The k most probable predictions are shown.
|
||||
% Example usage:
|
||||
% predict_next_word('john', 'might', 'be', model, 3);
|
||||
% predict_next_word('life', 'in', 'new', model, 3);
|
||||
|
||||
word_embedding_weights = model.word_embedding_weights;
|
||||
vocab = model.vocab;
|
||||
id1 = strmatch(word1, vocab, 'exact');
|
||||
id2 = strmatch(word2, vocab, 'exact');
|
||||
id3 = strmatch(word3, vocab, 'exact');
|
||||
if ~any(id1)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word1);
|
||||
return;
|
||||
end
|
||||
if ~any(id2)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word2);
|
||||
return;
|
||||
end
|
||||
if ~any(id3)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word3);
|
||||
return;
|
||||
end
|
||||
input = [id1; id2; id3];
|
||||
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(input, model.word_embedding_weights, model.embed_to_hid_weights,...
|
||||
model.hid_to_output_weights, model.hid_bias, model.output_bias);
|
||||
[prob, indices] = sort(output_layer_state, 'descend');
|
||||
for i = 1:k
|
||||
fprintf(1, '%s %s %s %s Prob: %.5f\n', word1, word2, word3, vocab{indices(i)}, prob(i));
|
||||
end
|
97162
NNML2/raw_sentences.txt
Executable file
97162
NNML2/raw_sentences.txt
Executable file
File diff suppressed because it is too large
Load Diff
244
NNML2/train.m
Normal file
244
NNML2/train.m
Normal file
|
@ -0,0 +1,244 @@
|
|||
% This function trains a neural network language model.
|
||||
function [model] = train(epochs)
|
||||
% Inputs:
|
||||
% epochs: Number of epochs to run.
|
||||
% Output:
|
||||
% model: A struct containing the learned weights and biases and vocabulary.
|
||||
|
||||
if size(ver('Octave'),1)
|
||||
OctaveMode = 1;
|
||||
warning('error', 'Octave:broadcast');
|
||||
start_time = time;
|
||||
else
|
||||
OctaveMode = 0;
|
||||
start_time = clock;
|
||||
end
|
||||
|
||||
% SET HYPERPARAMETERS HERE.
|
||||
batchsize = 100; % Mini-batch size.
|
||||
learning_rate = 0.1; % Learning rate; default = 0.1.
|
||||
momentum = 0.9; % Momentum; default = 0.9.
|
||||
numhid1 = 50; % Dimensionality of embedding space; default = 50.
|
||||
numhid2 = 200; % Number of units in hidden layer; default = 200.
|
||||
init_wt = 0.01; % Standard deviation of the normal distribution
|
||||
% which is sampled to get the initial weights; default = 0.01
|
||||
|
||||
% VARIABLES FOR TRACKING TRAINING PROGRESS.
|
||||
show_training_CE_after = 100;
|
||||
show_validation_CE_after = 1000;
|
||||
|
||||
% LOAD DATA.
|
||||
[train_input, train_target, valid_input, valid_target, ...
|
||||
test_input, test_target, vocab] = load_data(batchsize);
|
||||
[numwords, batchsize, numbatches] = size(train_input);
|
||||
vocab_size = size(vocab, 2);
|
||||
|
||||
% INITIALIZE WEIGHTS AND BIASES.
|
||||
word_embedding_weights = init_wt * randn(vocab_size, numhid1);
|
||||
embed_to_hid_weights = init_wt * randn(numwords * numhid1, numhid2);
|
||||
hid_to_output_weights = init_wt * randn(numhid2, vocab_size);
|
||||
hid_bias = zeros(numhid2, 1);
|
||||
output_bias = zeros(vocab_size, 1);
|
||||
|
||||
word_embedding_weights_delta = zeros(vocab_size, numhid1);
|
||||
word_embedding_weights_gradient = zeros(vocab_size, numhid1);
|
||||
embed_to_hid_weights_delta = zeros(numwords * numhid1, numhid2);
|
||||
hid_to_output_weights_delta = zeros(numhid2, vocab_size);
|
||||
hid_bias_delta = zeros(numhid2, 1);
|
||||
output_bias_delta = zeros(vocab_size, 1);
|
||||
expansion_matrix = eye(vocab_size);
|
||||
count = 0;
|
||||
tiny = exp(-30);
|
||||
trainset_CE = 0;
|
||||
|
||||
% TRAIN.
|
||||
for epoch = 1:epochs
|
||||
fprintf(1, 'Epoch %d\n', epoch);
|
||||
this_chunk_CE = 0;
|
||||
trainset_CE = 0;
|
||||
% LOOP OVER MINI-BATCHES.
|
||||
for m = 1:numbatches
|
||||
input_batch = train_input(:, :, m);
|
||||
target_batch = train_target(:, :, m);
|
||||
|
||||
% FORWARD PROPAGATE.
|
||||
% Compute the state of each layer in the network given the input batch
|
||||
% and all weights and biases
|
||||
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(input_batch, ...
|
||||
word_embedding_weights, embed_to_hid_weights, ...
|
||||
hid_to_output_weights, hid_bias, output_bias);
|
||||
|
||||
% COMPUTE DERIVATIVE.
|
||||
%% Expand the target to a sparse 1-of-K vector.
|
||||
expanded_target_batch = expansion_matrix(:, target_batch);
|
||||
%% Compute derivative of cross-entropy loss function.
|
||||
%%% vocab_size X batchsize
|
||||
error_deriv = output_layer_state - expanded_target_batch;
|
||||
|
||||
% MEASURE LOSS FUNCTION.
|
||||
CE = -sum(sum(...
|
||||
expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;
|
||||
count = count + 1;
|
||||
this_chunk_CE = this_chunk_CE + (CE - this_chunk_CE) / count;
|
||||
trainset_CE = trainset_CE + (CE - trainset_CE) / m;
|
||||
fprintf(1, '\rBatch %d Train CE %.3f', m, this_chunk_CE);
|
||||
if mod(m, show_training_CE_after) == 0
|
||||
fprintf(1, '\n');
|
||||
count = 0;
|
||||
this_chunk_CE = 0;
|
||||
end
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
|
||||
% BACK PROPAGATE.
|
||||
%% OUTPUT LAYER.
|
||||
%%% numhid2 X vocab_size
|
||||
hid_to_output_weights_gradient = hidden_layer_state * error_deriv';
|
||||
%%% vocab_size
|
||||
output_bias_gradient = sum(error_deriv, 2);
|
||||
%%% numhid2 X batchsize
|
||||
back_propagated_deriv_1 = (hid_to_output_weights * error_deriv) ...
|
||||
.* hidden_layer_state .* (1 - hidden_layer_state);
|
||||
|
||||
%% HIDDEN LAYER.
|
||||
% FILL IN CODE. Replace the line below by one of the options.
|
||||
% embed_to_hid_weights_gradient = zeros(numhid1 * numwords, numhid2);
|
||||
embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1';
|
||||
% Options:
|
||||
% (a) embed_to_hid_weights_gradient = back_propagated_deriv_1' * embedding_layer_state;
|
||||
% (b) embed_to_hid_weights_gradient = embedding_layer_state * back_propagated_deriv_1';
|
||||
% (c) embed_to_hid_weights_gradient = back_propagated_deriv_1;
|
||||
% (d) embed_to_hid_weights_gradient = embedding_layer_state;
|
||||
|
||||
% FILL IN CODE. Replace the line below by one of the options.
|
||||
% hid_bias_gradient = zeros(numhid2, 1);
|
||||
hid_bias_gradient = sum(back_propagated_deriv_1, 2);
|
||||
% Options
|
||||
% (a) hid_bias_gradient = sum(back_propagated_deriv_1, 2);
|
||||
% (b) hid_bias_gradient = sum(back_propagated_deriv_1, 1);
|
||||
% (c) hid_bias_gradient = back_propagated_deriv_1;
|
||||
% (d) hid_bias_gradient = back_propagated_deriv_1';
|
||||
|
||||
% FILL IN CODE. Replace the line below by one of the options.
|
||||
back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1;
|
||||
% Options
|
||||
% (a) back_propagated_deriv_2 = embed_to_hid_weights * back_propagated_deriv_1;
|
||||
% (b) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights;
|
||||
% (c) back_propagated_deriv_2 = back_propagated_deriv_1' * embed_to_hid_weights;
|
||||
% (d) back_propagated_deriv_2 = back_propagated_deriv_1 * embed_to_hid_weights';
|
||||
|
||||
word_embedding_weights_gradient(:) = 0;
|
||||
%% EMBEDDING LAYER.
|
||||
for w = 1:numwords
|
||||
word_embedding_weights_gradient = word_embedding_weights_gradient + ...
|
||||
expansion_matrix(:, input_batch(w, :)) * ...
|
||||
(back_propagated_deriv_2(1 + (w - 1) * numhid1 : w * numhid1, :)');
|
||||
end
|
||||
|
||||
% UPDATE WEIGHTS AND BIASES.
|
||||
word_embedding_weights_delta = ...
|
||||
momentum .* word_embedding_weights_delta + ...
|
||||
word_embedding_weights_gradient ./ batchsize;
|
||||
word_embedding_weights = word_embedding_weights...
|
||||
- learning_rate * word_embedding_weights_delta;
|
||||
|
||||
embed_to_hid_weights_delta = ...
|
||||
momentum .* embed_to_hid_weights_delta + ...
|
||||
embed_to_hid_weights_gradient ./ batchsize;
|
||||
embed_to_hid_weights = embed_to_hid_weights...
|
||||
- learning_rate * embed_to_hid_weights_delta;
|
||||
|
||||
hid_to_output_weights_delta = ...
|
||||
momentum .* hid_to_output_weights_delta + ...
|
||||
hid_to_output_weights_gradient ./ batchsize;
|
||||
hid_to_output_weights = hid_to_output_weights...
|
||||
- learning_rate * hid_to_output_weights_delta;
|
||||
|
||||
hid_bias_delta = momentum .* hid_bias_delta + ...
|
||||
hid_bias_gradient ./ batchsize;
|
||||
hid_bias = hid_bias - learning_rate * hid_bias_delta;
|
||||
|
||||
output_bias_delta = momentum .* output_bias_delta + ...
|
||||
output_bias_gradient ./ batchsize;
|
||||
output_bias = output_bias - learning_rate * output_bias_delta;
|
||||
|
||||
% VALIDATE.
|
||||
if mod(m, show_validation_CE_after) == 0
|
||||
fprintf(1, '\rRunning validation ...');
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(valid_input, word_embedding_weights, embed_to_hid_weights,...
|
||||
hid_to_output_weights, hid_bias, output_bias);
|
||||
datasetsize = size(valid_input, 2);
|
||||
expanded_valid_target = expansion_matrix(:, valid_target);
|
||||
CE = -sum(sum(...
|
||||
expanded_valid_target .* log(output_layer_state + tiny))) /datasetsize;
|
||||
fprintf(1, ' Validation CE %.3f\n', CE);
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
end
|
||||
end
|
||||
fprintf(1, '\rAverage Training CE %.3f\n', trainset_CE);
|
||||
end
|
||||
fprintf(1, 'Finished Training.\n');
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
fprintf(1, 'Final Training CE %.3f\n', trainset_CE);
|
||||
|
||||
% EVALUATE ON VALIDATION SET.
|
||||
fprintf(1, '\rRunning validation ...');
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(valid_input, word_embedding_weights, embed_to_hid_weights,...
|
||||
hid_to_output_weights, hid_bias, output_bias);
|
||||
datasetsize = size(valid_input, 2);
|
||||
expanded_valid_target = expansion_matrix(:, valid_target);
|
||||
CE = -sum(sum(...
|
||||
expanded_valid_target .* log(output_layer_state + tiny))) / datasetsize;
|
||||
fprintf(1, '\rFinal Validation CE %.3f\n', CE);
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
|
||||
% EVALUATE ON TEST SET.
|
||||
fprintf(1, '\rRunning test ...');
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
|
||||
fprop(test_input, word_embedding_weights, embed_to_hid_weights,...
|
||||
hid_to_output_weights, hid_bias, output_bias);
|
||||
datasetsize = size(test_input, 2);
|
||||
expanded_test_target = expansion_matrix(:, test_target);
|
||||
CE = -sum(sum(...
|
||||
expanded_test_target .* log(output_layer_state + tiny))) / datasetsize;
|
||||
fprintf(1, '\rFinal Test CE %.3f\n', CE);
|
||||
if OctaveMode
|
||||
fflush(1);
|
||||
end
|
||||
|
||||
model.word_embedding_weights = word_embedding_weights;
|
||||
model.embed_to_hid_weights = embed_to_hid_weights;
|
||||
model.hid_to_output_weights = hid_to_output_weights;
|
||||
model.hid_bias = hid_bias;
|
||||
model.output_bias = output_bias;
|
||||
model.vocab = vocab;
|
||||
|
||||
% In MATLAB replace line below with 'end_time = clock;'
|
||||
if OctaveMode
|
||||
end_time = time;
|
||||
diff = end_time - start_time;
|
||||
else
|
||||
end_time = clock;
|
||||
diff = etime(end_time, start_time);
|
||||
end
|
||||
fprintf(1, 'Training took %.2f seconds\n', diff);
|
||||
end
|
25
NNML2/word_distance.m
Normal file
25
NNML2/word_distance.m
Normal file
|
@ -0,0 +1,25 @@
|
|||
function distance = word_distance(word1, word2, model)
|
||||
% Shows the L2 distance between word1 and word2 in the word_embedding_weights.
|
||||
% Inputs:
|
||||
% word1: The first word as a string.
|
||||
% word2: The second word as a string.
|
||||
% model: Model returned by the training script.
|
||||
% Example usage:
|
||||
% word_distance('school', 'university', model);
|
||||
|
||||
word_embedding_weights = model.word_embedding_weights;
|
||||
vocab = model.vocab;
|
||||
id1 = strmatch(word1, vocab, 'exact');
|
||||
id2 = strmatch(word2, vocab, 'exact');
|
||||
if ~any(id1)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word1);
|
||||
return;
|
||||
end
|
||||
if ~any(id2)
|
||||
fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word2);
|
||||
return;
|
||||
end
|
||||
word_rep1 = word_embedding_weights(id1, :);
|
||||
word_rep2 = word_embedding_weights(id2, :);
|
||||
diff = word_rep1 - word_rep2;
|
||||
distance = sqrt(sum(diff .* diff));
|
Loading…
Reference in New Issue
Block a user