Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor concatenation improvements and documentation #748

Merged
merged 37 commits into from
Sep 8, 2021

Conversation

kpu
Copy link
Member

@kpu kpu commented Nov 1, 2020

Description

Adds options for handling factors in Marian including concatenation. This is primarily written by @pedrodiascoelho.

How to test

Marian continuous integration tests are passing. Unbabel has also been doing quality testing.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@emjotde
Copy link
Member

emjotde commented Nov 2, 2020

The added doc is a nice touch, thank you for that. @pedrodiascoelho anything specific we should pay attention to? Is it worth to work with @snukky to add regression/unit tests?

@emjotde
Copy link
Member

emjotde commented Nov 2, 2020

And a first remark, can you please add comments in the github interface where you added new functionality and where you made refactoring changes? I don't know this part of the code very well and will need a few hints.

@snukky I will give you an old production model with factors later this week to make sure things work and there are no regressions for our factored models. We will add that to the regression tests then.

@snukky
Copy link
Member

snukky commented Nov 2, 2020

OK, creating a regression test for the existing code is a good idea.

Comment on lines 197 to 201
cli.add<int>("--factors-dim-emb",
"Embedding dimension of the factors. Only used if concat is selected as factors combining form");
cli.add<std::string>("--factors-combine",
"How to combine the factors and lemma embeddings. Options available: sum, concat",
"sum");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since concatenation was implemented, two new options were created. One (--factors-combine) that controls if we want to embed the lemmas and factors by summing them (the default option), or apply concatenation.
If sum is chosen it will follow the already implemented embedding from Frank.
In case you chose concatenation a dimension of that embedding must be specified with --factors-dim-emb.

Comment on lines 202 to 204
cli.add<std::string>("--factor-predictor",
"Method to use when predicting target factors. Options: soft-transformer-layer, hard-transformer-layer, lemma-dependent-bias, re-embedding",
"soft-transformer-layer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the target-factors, since different decoding options were already implemented, a more user-friendly config variable was introduced to make more clear which are the available options

Comment on lines +658 to +660
size_t FactoredVocab::getTotalFactorCount() const {
return factorVocabSize() - groupRanges_[0].second;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auxiliary function that return the total number of factors (no lemmas) in a factored vocabulary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to the code as function documentation. Preferably use the doxygen syntax that we've established in the PR with documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #788 #801 for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments added

Comment on lines 456 to 462
if (opt<std::string>("factorsCombine") == "concat") {
ABORT_IF(dimFactorEmb == 0, "Embedding: If concatenation is chosen to combine the factor embeddings, a factor embedding size should be specified.");
int numberOfFactors = (int) factoredVocab_->getTotalFactorCount();
dimVoc -= numberOfFactors;
FactorEmbMatrix_ = graph_->param("factor_" + name, {numberOfFactors, dimFactorEmb}, initFunc, fixed);
LOG_ONCE(info, "[embedding] Combining factors concatenation enabled");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If concatenation is chosen the factors will be embedded with a different matrix from the lemma embeddings.

Comment on lines 476 to 490
//Embeds a sequence of words (given as indices), where they have factor information. The matrices are concatenated
/*private*/ Expr Embedding::embedWithConcat(const Words& data) const {
auto graph = E_->graph();
std::vector<IndexType> lemmaIndices;
std::vector<float> factorIndices;
factoredVocab_->lemmaAndFactorsIndexes(data, lemmaIndices, factorIndices);
auto lemmaEmbs = rows(E_, lemmaIndices);
int dimFactors = FactorEmbMatrix_->shape()[0];
auto factEmbs = dot(graph->constant({(int) data.size(), dimFactors}, inits::fromVector(factorIndices), Type::float32), FactorEmbMatrix_);

auto out = concatenate({lemmaEmbs, factEmbs}, -1);

return out;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To embed using concatenation basically we call the lemmaAndFactorsIndexes() function that decodes each wordIndex into its lemma and factor information. Then we embed lemmas and factors separately. To embed the lemmas we just choose the rows of the lemma embedding matrix (E_) that correspond to the indexes of the lemmas in the vocab given by lemmaAndFactorsIndexes. Then to embed the factors, since that a token could have more than one factor, we cannot simple chose the rows of the factor embedding matrix as we did for the lemmas. So for the factors an array of 0's and 1's is returned, where for each token we have the information if a certain factor appears or not in a token. This vector is turned into a sparse matrix that is multiplied by the factor embedding matrix to get the factor embeddings. This two result embeddings (lemmaEmbs and factEmbs) are then concatenated.

Comment on lines 545 to 552
#if 0
auto batchMask = graph->constant({dimWidth, dimBatch, 1},
inits::fromVector(subBatch->mask()));
#else
// experimental: hide inline-fix source tokens from cross attention
auto batchMask = graph->constant({dimWidth, dimBatch, 1},
inits::fromVector(subBatch->crossMaskWithInlineFixSourceSuppressed()));
#else
auto batchMask = graph->constant({dimWidth, dimBatch, 1},
inits::fromVector(subBatch->mask()));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here only the order of what is inside the #if part of the macro and in the #else part was switched. Mainly because this was trowing a couple of warnings if the tags and where not used.

Comment on lines 288 to 289
auto Wk = graph_->param(prefix + "_Wk", {dimModel, dimModel}, inits::glorotUniform());
int dimKeys = keys->shape()[-1];
auto Wk = graph_->param(prefix + "_Wk", {dimKeys, dimModel}, inits::glorotUniform());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight attention transformer matrices were being initialized with shape [dimModel x dimModel]. If concatenation is chosen and we are only using source factors, the embedding size of the encoder and decoder will be different, so in the enc-dec attention layer of the decoder, we need to initialize the weight matrices of keys and values with whatever dimension is outputted from the encoder.

Comment on lines 302 to 304
auto Wv = graph_->param(prefix + "_Wv", {dimModel, dimModel}, inits::glorotUniform());
int dimValues = values->shape()[-1];
auto Wv = graph_->param(prefix + "_Wv", {dimValues, dimModel}, inits::glorotUniform());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 663 to 690
// decodes the indexes of lemma and factor for each word and outputs that information separately.
// inputs:
// - words = vector of words
// output:
// - lemmaIndices: lemma index for each word
// - factorIndices: factor usage information for each word (1 if the factor is used 0 if not)
void FactoredVocab::lemmaAndFactorsIndexes(const Words& words, std::vector<IndexType>& lemmaIndices, std::vector<float>& factorIndices) const {
lemmaIndices.reserve(words.size());
factorIndices.reserve(words.size() * getTotalFactorCount());

auto numGroups = getNumGroups();
std::vector<size_t> lemmaAndFactorIndices;

for (auto &word : words) {
if (vocab_.contains(word.toWordIndex())) {
word2factors(word, lemmaAndFactorIndices);
lemmaIndices.push_back((IndexType) lemmaAndFactorIndices[0]);
for (size_t g = 1; g < numGroups; g++) {
auto factorIndex = lemmaAndFactorIndices[g];
ABORT_IF(factorIndex == FACTOR_NOT_SPECIFIED, "Attempted to embed a word with a factor not specified");
for (int i = 0; i < factorShape_[g] - 1; i++) {
factorIndices.push_back((float) (factorIndex == i));
}
}
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns two data structures that contain separate information regarding lemmas and factors indexes and usage by receiving a list with the word Indexes of a batch.

@emjotde
Copy link
Member

emjotde commented Nov 11, 2020

Hm, I wanted to assign @kpu as a reviewer, but since you submitted in @pedrodiascoelho's stead it seems github doesn't let me do that. Do you have access to the reviewing interface at all?

@emjotde
Copy link
Member

emjotde commented Nov 11, 2020

@ykim362 Can you share the files you shared with NVIDIA with @snukky for regression tests? That's an FS model, right?

@emjotde
Copy link
Member

emjotde commented Nov 11, 2020

@pedrodiascoelho Thanks a lot for the comments; that will be helpful.

Copy link
Member Author

@kpu kpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I can review my own pull request.

@ykim362
Copy link
Member

ykim362 commented Nov 11, 2020

@ykim362 Can you share the files you shared with NVIDIA with @snukky for regression tests? That's an FS model, right?

@emjotde sure. I will.

@pedrodiascoelho
Copy link
Contributor

@pedrodiascoelho Thanks a lot for the comments; that will be helpful.

You're welcome @emjotde. I'll be glad and completely available to discuss any question or doubt regarding my implementation.

@pedrodiascoelho
Copy link
Contributor

pedrodiascoelho commented May 11, 2021

Hello @snukky
Following our discussion I added some logic to ensure backward compatibility with the older cli when choosing the lemma-dependency method, could you check my work?
Also updated the code with the most recent marian-dev

@snukky
Copy link
Member

snukky commented May 19, 2021

@pedrodiascoelho I confirm that after this fix: marian-cef#7, the factors code introduced in this PR is backward compatible, AFAICT.

Move backward compatibility checks for factors to config.cpp
@graemenail graemenail self-requested a review May 19, 2021 14:35
Copy link
Member

@snukky snukky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pedrodiascoelho! Please take a look at my comments and feel free to resolve all conversations that you agree with and you think they don't need more our attention after you fix them.

Copy link
Member

@graemenail graemenail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pedrodiascoelho; I've left some comments - ping me if something isn't clear.

Also, I added small comment on marian-nmt/marian-regression-tests#78

@@ -0,0 +1,208 @@
# Using marian with factors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a new section in the sidebar makes sense. For now I think we can keep doc/ directory flat for now as there's just a few files. But in put this file in the nav toc under Models, or even Vocabularies. Then moving the graph documentation under Expression Graph or similar.

Comment on lines +658 to +660
size_t FactoredVocab::getTotalFactorCount() const {
return factorVocabSize() - groupRanges_[0].second;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #788 #801 for example

@snukky
Copy link
Member

snukky commented Jun 18, 2021

All existing regression tests pass with marian-cef@cb44bbc, which is new factors + the most recent Marian master. There is one new test that fails, but this is tracked in marian-nmt/marian-regression-tests#78 (comment)

@pedrodiascoelho
Copy link
Contributor

Hello @snukky! Just to ping you regarding this PR as you asked me to do during the EU marian project meeting. I've updated the test that was failing in marian-nmt/marian-regression-tests#78 (comment). Let me know if something else is missing and thank you again for the help and review in this.

@snukky
Copy link
Member

snukky commented Jun 22, 2021

Yes, thanks for the reminder. I updated your branch and tested it again with an internal model last week. All looks good to me. @emjotde wanted to take a look at the changes too as the factors code is critical for our models, so now we're waiting for his approval. I will keep reminding him.

@snukky snukky requested a review from emjotde June 22, 2021 08:55
@kpu
Copy link
Member Author

kpu commented Jun 22, 2021

@emjotde wanted to take a look at the changes too as the factors code is critical for our models, so now we're waiting for his approval. I will keep reminding him.

https://www.youtube.com/watch?v=Hr_9b-Lt-pk

@pedrodiascoelho
Copy link
Contributor

Hello @snukky. As requested, I'll leave here a summarized version of my Master Thesis where you can check the results from the experiments that compare the concatenation vs sum (section 3.3).
Extended_Abstract_Pedro_Dias_Coelho_83719.pdf
Let me know if you need any further explanation or clarification regarding this.

@snukky snukky merged commit 4dd30b5 into marian-nmt:master Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants