-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor concatenation improvements and documentation #748
Conversation
Factors modification from Unbabel
The added doc is a nice touch, thank you for that. @pedrodiascoelho anything specific we should pay attention to? Is it worth to work with @snukky to add regression/unit tests? |
And a first remark, can you please add comments in the github interface where you added new functionality and where you made refactoring changes? I don't know this part of the code very well and will need a few hints. @snukky I will give you an old production model with factors later this week to make sure things work and there are no regressions for our factored models. We will add that to the regression tests then. |
OK, creating a regression test for the existing code is a good idea. |
src/common/config_parser.cpp
Outdated
cli.add<int>("--factors-dim-emb", | ||
"Embedding dimension of the factors. Only used if concat is selected as factors combining form"); | ||
cli.add<std::string>("--factors-combine", | ||
"How to combine the factors and lemma embeddings. Options available: sum, concat", | ||
"sum"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since concatenation was implemented, two new options were created. One (--factors-combine) that controls if we want to embed the lemmas and factors by summing them (the default option), or apply concatenation.
If sum is chosen it will follow the already implemented embedding from Frank.
In case you chose concatenation a dimension of that embedding must be specified with --factors-dim-emb.
src/common/config_parser.cpp
Outdated
cli.add<std::string>("--factor-predictor", | ||
"Method to use when predicting target factors. Options: soft-transformer-layer, hard-transformer-layer, lemma-dependent-bias, re-embedding", | ||
"soft-transformer-layer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the target-factors, since different decoding options were already implemented, a more user-friendly config variable was introduced to make more clear which are the available options
size_t FactoredVocab::getTotalFactorCount() const { | ||
return factorVocabSize() - groupRanges_[0].second; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auxiliary function that return the total number of factors (no lemmas) in a factored vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this to the code as function documentation. Preferably use the doxygen syntax that we've established in the PR with documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments added
src/layers/generic.cpp
Outdated
if (opt<std::string>("factorsCombine") == "concat") { | ||
ABORT_IF(dimFactorEmb == 0, "Embedding: If concatenation is chosen to combine the factor embeddings, a factor embedding size should be specified."); | ||
int numberOfFactors = (int) factoredVocab_->getTotalFactorCount(); | ||
dimVoc -= numberOfFactors; | ||
FactorEmbMatrix_ = graph_->param("factor_" + name, {numberOfFactors, dimFactorEmb}, initFunc, fixed); | ||
LOG_ONCE(info, "[embedding] Combining factors concatenation enabled"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If concatenation is chosen the factors will be embedded with a different matrix from the lemma embeddings.
src/layers/generic.cpp
Outdated
//Embeds a sequence of words (given as indices), where they have factor information. The matrices are concatenated | ||
/*private*/ Expr Embedding::embedWithConcat(const Words& data) const { | ||
auto graph = E_->graph(); | ||
std::vector<IndexType> lemmaIndices; | ||
std::vector<float> factorIndices; | ||
factoredVocab_->lemmaAndFactorsIndexes(data, lemmaIndices, factorIndices); | ||
auto lemmaEmbs = rows(E_, lemmaIndices); | ||
int dimFactors = FactorEmbMatrix_->shape()[0]; | ||
auto factEmbs = dot(graph->constant({(int) data.size(), dimFactors}, inits::fromVector(factorIndices), Type::float32), FactorEmbMatrix_); | ||
|
||
auto out = concatenate({lemmaEmbs, factEmbs}, -1); | ||
|
||
return out; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To embed using concatenation basically we call the lemmaAndFactorsIndexes() function that decodes each wordIndex into its lemma and factor information. Then we embed lemmas and factors separately. To embed the lemmas we just choose the rows of the lemma embedding matrix (E_) that correspond to the indexes of the lemmas in the vocab given by lemmaAndFactorsIndexes. Then to embed the factors, since that a token could have more than one factor, we cannot simple chose the rows of the factor embedding matrix as we did for the lemmas. So for the factors an array of 0's and 1's is returned, where for each token we have the information if a certain factor appears or not in a token. This vector is turned into a sparse matrix that is multiplied by the factor embedding matrix to get the factor embeddings. This two result embeddings (lemmaEmbs and factEmbs) are then concatenated.
src/layers/generic.cpp
Outdated
#if 0 | ||
auto batchMask = graph->constant({dimWidth, dimBatch, 1}, | ||
inits::fromVector(subBatch->mask())); | ||
#else | ||
// experimental: hide inline-fix source tokens from cross attention | ||
auto batchMask = graph->constant({dimWidth, dimBatch, 1}, | ||
inits::fromVector(subBatch->crossMaskWithInlineFixSourceSuppressed())); | ||
#else | ||
auto batchMask = graph->constant({dimWidth, dimBatch, 1}, | ||
inits::fromVector(subBatch->mask())); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here only the order of what is inside the #if part of the macro and in the #else part was switched. Mainly because this was trowing a couple of warnings if the tags and where not used.
src/models/transformer.h
Outdated
auto Wk = graph_->param(prefix + "_Wk", {dimModel, dimModel}, inits::glorotUniform()); | ||
int dimKeys = keys->shape()[-1]; | ||
auto Wk = graph_->param(prefix + "_Wk", {dimKeys, dimModel}, inits::glorotUniform()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight attention transformer matrices were being initialized with shape [dimModel x dimModel]. If concatenation is chosen and we are only using source factors, the embedding size of the encoder and decoder will be different, so in the enc-dec attention layer of the decoder, we need to initialize the weight matrices of keys and values with whatever dimension is outputted from the encoder.
src/models/transformer.h
Outdated
auto Wv = graph_->param(prefix + "_Wv", {dimModel, dimModel}, inits::glorotUniform()); | ||
int dimValues = values->shape()[-1]; | ||
auto Wv = graph_->param(prefix + "_Wv", {dimValues, dimModel}, inits::glorotUniform()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/data/factored_vocab.cpp
Outdated
// decodes the indexes of lemma and factor for each word and outputs that information separately. | ||
// inputs: | ||
// - words = vector of words | ||
// output: | ||
// - lemmaIndices: lemma index for each word | ||
// - factorIndices: factor usage information for each word (1 if the factor is used 0 if not) | ||
void FactoredVocab::lemmaAndFactorsIndexes(const Words& words, std::vector<IndexType>& lemmaIndices, std::vector<float>& factorIndices) const { | ||
lemmaIndices.reserve(words.size()); | ||
factorIndices.reserve(words.size() * getTotalFactorCount()); | ||
|
||
auto numGroups = getNumGroups(); | ||
std::vector<size_t> lemmaAndFactorIndices; | ||
|
||
for (auto &word : words) { | ||
if (vocab_.contains(word.toWordIndex())) { | ||
word2factors(word, lemmaAndFactorIndices); | ||
lemmaIndices.push_back((IndexType) lemmaAndFactorIndices[0]); | ||
for (size_t g = 1; g < numGroups; g++) { | ||
auto factorIndex = lemmaAndFactorIndices[g]; | ||
ABORT_IF(factorIndex == FACTOR_NOT_SPECIFIED, "Attempted to embed a word with a factor not specified"); | ||
for (int i = 0; i < factorShape_[g] - 1; i++) { | ||
factorIndices.push_back((float) (factorIndex == i)); | ||
} | ||
} | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function returns two data structures that contain separate information regarding lemmas and factors indexes and usage by receiving a list with the word Indexes of a batch.
Hm, I wanted to assign @kpu as a reviewer, but since you submitted in @pedrodiascoelho's stead it seems github doesn't let me do that. Do you have access to the reviewing interface at all? |
@pedrodiascoelho Thanks a lot for the comments; that will be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I can review my own pull request.
You're welcome @emjotde. I'll be glad and completely available to discuss any question or doubt regarding my implementation. |
Hello @snukky |
…o marian-cef-factors
…del.npz config is loaded
@pedrodiascoelho I confirm that after this fix: marian-cef#7, the factors code introduced in this PR is backward compatible, AFAICT. |
Move backward compatibility checks for factors to config.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pedrodiascoelho! Please take a look at my comments and feel free to resolve all conversations that you agree with and you think they don't need more our attention after you fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pedrodiascoelho; I've left some comments - ping me if something isn't clear.
Also, I added small comment on marian-nmt/marian-regression-tests#78
@@ -0,0 +1,208 @@ | |||
# Using marian with factors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new section in the sidebar makes sense. For now I think we can keep doc/
directory flat for now as there's just a few files. But in put this file in the nav toc under Models
, or even Vocabularies
. Then moving the graph documentation under Expression Graph
or similar.
size_t FactoredVocab::getTotalFactorCount() const { | ||
return factorVocabSize() - groupRanges_[0].second; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All existing regression tests pass with marian-cef@cb44bbc, which is new factors + the most recent Marian master. There is one new test that fails, but this is tracked in marian-nmt/marian-regression-tests#78 (comment) |
Hello @snukky! Just to ping you regarding this PR as you asked me to do during the EU marian project meeting. I've updated the test that was failing in marian-nmt/marian-regression-tests#78 (comment). Let me know if something else is missing and thank you again for the help and review in this. |
Yes, thanks for the reminder. I updated your branch and tested it again with an internal model last week. All looks good to me. @emjotde wanted to take a look at the changes too as the factors code is critical for our models, so now we're waiting for his approval. I will keep reminding him. |
|
Hello @snukky. As requested, I'll leave here a summarized version of my Master Thesis where you can check the results from the experiments that compare the concatenation vs sum (section 3.3). |
Description
Adds options for handling factors in Marian including concatenation. This is primarily written by @pedrodiascoelho.
How to test
Marian continuous integration tests are passing. Unbabel has also been doing quality testing.
Checklist