Factor concatenation improvements and documentation #748

kpu · 2020-11-01T22:08:24Z

Description

Adds options for handling factors in Marian including concatenation. This is primarily written by @pedrodiascoelho.

How to test

Marian continuous integration tests are passing. Unbabel has also been doing quality testing.

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

… size

Factors modification from Unbabel

emjotde · 2020-11-02T00:49:36Z

The added doc is a nice touch, thank you for that. @pedrodiascoelho anything specific we should pay attention to? Is it worth to work with @snukky to add regression/unit tests?

emjotde · 2020-11-02T00:57:26Z

And a first remark, can you please add comments in the github interface where you added new functionality and where you made refactoring changes? I don't know this part of the code very well and will need a few hints.

@snukky I will give you an old production model with factors later this week to make sure things work and there are no regressions for our factored models. We will add that to the regression tests then.

snukky · 2020-11-02T11:31:22Z

OK, creating a regression test for the existing code is a good idea.

pedrodiascoelho · 2020-11-10T15:38:38Z

src/common/config_parser.cpp

+  cli.add<int>("--factors-dim-emb",
+      "Embedding dimension of the factors. Only used if concat is selected as factors combining form");
+  cli.add<std::string>("--factors-combine", 
+    "How to combine the factors and lemma embeddings. Options available: sum, concat",
+    "sum");   


Since concatenation was implemented, two new options were created. One (--factors-combine) that controls if we want to embed the lemmas and factors by summing them (the default option), or apply concatenation.
If sum is chosen it will follow the already implemented embedding from Frank.
In case you chose concatenation a dimension of that embedding must be specified with --factors-dim-emb.

pedrodiascoelho · 2020-11-10T15:39:30Z

src/common/config_parser.cpp

+  cli.add<std::string>("--factor-predictor",
+      "Method to use when predicting target factors. Options: soft-transformer-layer, hard-transformer-layer, lemma-dependent-bias, re-embedding",
+      "soft-transformer-layer");


For the target-factors, since different decoding options were already implemented, a more user-friendly config variable was introduced to make more clear which are the available options

pedrodiascoelho · 2020-11-10T15:39:52Z

src/data/factored_vocab.cpp

+size_t FactoredVocab::getTotalFactorCount() const {
+  return factorVocabSize() - groupRanges_[0].second;
+}


Auxiliary function that return the total number of factors (no lemmas) in a factored vocabulary.

Please add this to the code as function documentation. Preferably use the doxygen syntax that we've established in the PR with documentation.

See #788 #801 for example

comments added

pedrodiascoelho · 2020-11-10T15:42:04Z

src/layers/generic.cpp

+      if (opt<std::string>("factorsCombine") == "concat") {
+        ABORT_IF(dimFactorEmb == 0, "Embedding: If concatenation is chosen to combine the factor embeddings, a factor embedding size should be specified.");
+        int numberOfFactors = (int) factoredVocab_->getTotalFactorCount();
+        dimVoc -= numberOfFactors;
+        FactorEmbMatrix_ = graph_->param("factor_" + name, {numberOfFactors, dimFactorEmb}, initFunc, fixed);
+        LOG_ONCE(info, "[embedding] Combining factors concatenation enabled");
+      }      


If concatenation is chosen the factors will be embedded with a different matrix from the lemma embeddings.

pedrodiascoelho · 2020-11-10T15:51:11Z

src/layers/generic.cpp

+  //Embeds a sequence of words (given as indices), where they have factor information. The matrices are concatenated
+  /*private*/ Expr Embedding::embedWithConcat(const Words& data) const {
+    auto graph = E_->graph();
+    std::vector<IndexType> lemmaIndices;
+    std::vector<float> factorIndices;
+    factoredVocab_->lemmaAndFactorsIndexes(data, lemmaIndices, factorIndices);
+    auto lemmaEmbs = rows(E_, lemmaIndices);
+    int dimFactors =  FactorEmbMatrix_->shape()[0];
+    auto factEmbs = dot(graph->constant({(int) data.size(), dimFactors}, inits::fromVector(factorIndices), Type::float32), FactorEmbMatrix_);
+
+    auto out = concatenate({lemmaEmbs, factEmbs}, -1);
+
+    return out;
+  }
+


To embed using concatenation basically we call the lemmaAndFactorsIndexes() function that decodes each wordIndex into its lemma and factor information. Then we embed lemmas and factors separately. To embed the lemmas we just choose the rows of the lemma embedding matrix (E_) that correspond to the indexes of the lemmas in the vocab given by lemmaAndFactorsIndexes. Then to embed the factors, since that a token could have more than one factor, we cannot simple chose the rows of the factor embedding matrix as we did for the lemmas. So for the factors an array of 0's and 1's is returned, where for each token we have the information if a certain factor appears or not in a token. This vector is turned into a sparse matrix that is multiplied by the factor embedding matrix to get the factor embeddings. This two result embeddings (lemmaEmbs and factEmbs) are then concatenated.

pedrodiascoelho · 2020-11-10T16:01:04Z

src/layers/generic.cpp

 #if 0
-    auto batchMask = graph->constant({dimWidth, dimBatch, 1},
-                                     inits::fromVector(subBatch->mask()));
-#else
    // experimental: hide inline-fix source tokens from cross attention
    auto batchMask = graph->constant({dimWidth, dimBatch, 1},
                                     inits::fromVector(subBatch->crossMaskWithInlineFixSourceSuppressed()));
+#else
+    auto batchMask = graph->constant({dimWidth, dimBatch, 1},
+                                     inits::fromVector(subBatch->mask()));
 #endif


Here only the order of what is inside the #if part of the macro and in the #else part was switched. Mainly because this was trowing a couple of warnings if the tags and where not used.

pedrodiascoelho · 2020-11-10T16:03:09Z

src/models/transformer.h

-      auto Wk = graph_->param(prefix + "_Wk", {dimModel, dimModel}, inits::glorotUniform());
+      int dimKeys =  keys->shape()[-1];
+      auto Wk = graph_->param(prefix + "_Wk", {dimKeys, dimModel}, inits::glorotUniform());


The weight attention transformer matrices were being initialized with shape [dimModel x dimModel]. If concatenation is chosen and we are only using source factors, the embedding size of the encoder and decoder will be different, so in the enc-dec attention layer of the decoder, we need to initialize the weight matrices of keys and values with whatever dimension is outputted from the encoder.

pedrodiascoelho · 2020-11-10T16:03:26Z

src/models/transformer.h

-      auto Wv = graph_->param(prefix + "_Wv", {dimModel, dimModel}, inits::glorotUniform());
+      int dimValues = values->shape()[-1];
+      auto Wv = graph_->param(prefix + "_Wv", {dimValues, dimModel}, inits::glorotUniform());


see https://github.com/marian-nmt/marian-dev/pull/748/files#r520678729

pedrodiascoelho · 2020-11-10T16:06:00Z

src/data/factored_vocab.cpp

+// decodes the indexes of lemma and factor for each word and outputs that information separately.
+// inputs:
+//  - words = vector of words
+// output:
+//  - lemmaIndices: lemma index for each word
+//  - factorIndices: factor usage information for each word (1 if the factor is used 0 if not)
+void FactoredVocab::lemmaAndFactorsIndexes(const Words& words, std::vector<IndexType>& lemmaIndices, std::vector<float>& factorIndices) const {
+  lemmaIndices.reserve(words.size());
+  factorIndices.reserve(words.size() * getTotalFactorCount());
+
+  auto numGroups = getNumGroups();
+  std::vector<size_t> lemmaAndFactorIndices;
+
+  for (auto &word : words) {
+    if (vocab_.contains(word.toWordIndex())) {  
+      word2factors(word, lemmaAndFactorIndices);
+      lemmaIndices.push_back((IndexType) lemmaAndFactorIndices[0]);
+      for (size_t g = 1; g < numGroups; g++) {
+        auto factorIndex = lemmaAndFactorIndices[g]; 
+        ABORT_IF(factorIndex == FACTOR_NOT_SPECIFIED, "Attempted to embed a word with a factor not specified");
+        for (int i = 0; i < factorShape_[g] - 1; i++) {
+          factorIndices.push_back((float) (factorIndex == i));
+        }
+      }      
+    }
+  }
+}
+


This function returns two data structures that contain separate information regarding lemmas and factors indexes and usage by receiving a list with the word Indexes of a batch.

emjotde · 2020-11-11T15:10:39Z

Hm, I wanted to assign @kpu as a reviewer, but since you submitted in @pedrodiascoelho's stead it seems github doesn't let me do that. Do you have access to the reviewing interface at all?

emjotde · 2020-11-11T15:11:45Z

@ykim362 Can you share the files you shared with NVIDIA with @snukky for regression tests? That's an FS model, right?

emjotde · 2020-11-11T15:12:20Z

@pedrodiascoelho Thanks a lot for the comments; that will be helpful.

kpu

Apparently I can review my own pull request.

CHANGELOG.md

ykim362 · 2020-11-11T15:42:27Z

@ykim362 Can you share the files you shared with NVIDIA with @snukky for regression tests? That's an FS model, right?

@emjotde sure. I will.

pedrodiascoelho · 2020-11-11T16:21:40Z

@pedrodiascoelho Thanks a lot for the comments; that will be helpful.

You're welcome @emjotde. I'll be glad and completely available to discuss any question or doubt regarding my implementation.

pedrodiascoelho · 2021-05-11T11:56:33Z

Hello @snukky
Following our discussion I added some logic to ensure backward compatibility with the older cli when choosing the lemma-dependency method, could you check my work?
Also updated the code with the most recent marian-dev

…o marian-cef-factors

…del.npz config is loaded

snukky · 2021-05-19T10:28:57Z

@pedrodiascoelho I confirm that after this fix: marian-cef#7, the factors code introduced in this PR is backward compatible, AFAICT.

Move backward compatibility checks for factors to config.cpp

snukky

Thanks @pedrodiascoelho! Please take a look at my comments and feel free to resolve all conversations that you agree with and you think they don't need more our attention after you fix them.

doc/factors.md

src/layers/embedding.cpp

src/layers/embedding.h

src/models/transformer.h

graemenail

Hi @pedrodiascoelho; I've left some comments - ping me if something isn't clear.

Also, I added small comment on marian-nmt/marian-regression-tests#78

graemenail · 2021-05-31T13:22:48Z

doc/factors.md

@@ -0,0 +1,208 @@
+# Using marian with factors


I think a new section in the sidebar makes sense. For now I think we can keep doc/ directory flat for now as there's just a few files. But in put this file in the nav toc under Models, or even Vocabularies. Then moving the graph documentation under Expression Graph or similar.

doc/factors.md

graemenail · 2021-06-01T10:48:34Z

src/data/factored_vocab.cpp

+size_t FactoredVocab::getTotalFactorCount() const {
+  return factorVocabSize() - groupRanges_[0].second;
+}


See #788 #801 for example

src/data/factored_vocab.h

src/layers/output.cpp

snukky · 2021-06-18T07:59:19Z

All existing regression tests pass with marian-cef@cb44bbc, which is new factors + the most recent Marian master. There is one new test that fails, but this is tracked in marian-nmt/marian-regression-tests#78 (comment)

pedrodiascoelho · 2021-06-22T08:07:49Z

Hello @snukky! Just to ping you regarding this PR as you asked me to do during the EU marian project meeting. I've updated the test that was failing in marian-nmt/marian-regression-tests#78 (comment). Let me know if something else is missing and thank you again for the help and review in this.

snukky · 2021-06-22T08:54:56Z

Yes, thanks for the reminder. I updated your branch and tested it again with an internal model last week. All looks good to me. @emjotde wanted to take a look at the changes too as the factors code is critical for our models, so now we're waiting for his approval. I will keep reminding him.

kpu · 2021-06-22T08:57:19Z

@emjotde wanted to take a look at the changes too as the factors code is critical for our models, so now we're waiting for his approval. I will keep reminding him.

https://www.youtube.com/watch?v=Hr_9b-Lt-pk

pedrodiascoelho · 2021-06-30T13:35:11Z

Hello @snukky. As requested, I'll leave here a summarized version of my Master Thesis where you can check the results from the experiments that compare the concatenation vs sum (section 3.3).
Extended_Abstract_Pedro_Dias_Coelho_83719.pdf
Let me know if you need any further explanation or clarification regarding this.

pedrodiascoelho and others added 12 commits October 14, 2020 18:44

concatenation combining option added when embeding using factors

e2e1d87

crossMask not used by default

b748925

added an option to better clarify when choosing factor predictor options

3479252

fixed bug when choosing re-embedding option and not setting embedding…

4c7a1ad

… size

avoid uncessary string copy

979629e

minor refactor. Variable names more readable. Added comments.

4e0b0e5

Merge pull request #2 from pedrodiascoelho/factors-concatenation

db86ff4

Factors modification from Unbabel

Check in factors documentation

4a123ce

Merge github.com:marian-nmt/marian-dev

1c2a02a

Fix duplication in merge

74a4b01

Self-referential repository

c6c0318

Add description of factors change

062b474

pedrodiascoelho reviewed Nov 10, 2020

View reviewed changes

kpu commented Nov 11, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

snukky added 2 commits May 19, 2021 02:05

Merge branch 'master' of https://github.com/marian-cef/marian-dev int…

843a762

…o marian-cef-factors

Move backward compatibility checks for factors to happen after the mo…

a53daaf

…del.npz config is loaded

Merge pull request #7 from marian-nmt/marian-cef-factors

c228c77

Move backward compatibility checks for factors to config.cpp

graemenail self-requested a review May 19, 2021 14:35

snukky requested changes May 24, 2021

View reviewed changes

pedrocoelhounbabel added 2 commits May 24, 2021 17:48

Add comments. Fix typos

c2b03f3

Update factors.md

48d2a2c

graemenail requested changes Jun 1, 2021

View reviewed changes

pedrocoelhounbabel and others added 8 commits June 2, 2021 11:32

Update factors.md

1e905cc

Add explicit error msg if using concat on target

b933f67

Update func comments. Fix spaces

377aafe

Add Marian version requirement

fc78718

clean uncessary code

1920e8b

Update factors.md

2a36a67

delete experimental code

31a4aa7

Merge branch 'master' of https://github.com/marian-nmt/marian-dev

cb44bbc

snukky approved these changes Jun 18, 2021

View reviewed changes

graemenail approved these changes Jun 18, 2021

View reviewed changes

snukky requested a review from emjotde June 22, 2021 08:55

Roman Grundkiewicz added 2 commits September 7, 2021 01:46

Merge branch 'master' of https://github.com/marian-nmt/marian-dev

651f506

Merge branch 'master' of https://github.com/marian-nmt/marian-dev

69daccc

snukky merged commit 4dd30b5 into marian-nmt:master Sep 8, 2021

marianminion mentioned this pull request Nov 20, 2021

Jenkins marian-dev-cuda-11.4 #199 failed #889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factor concatenation improvements and documentation #748

Factor concatenation improvements and documentation #748

kpu commented Nov 1, 2020 •

edited by snukky

Loading

emjotde commented Nov 2, 2020 •

edited

Loading

emjotde commented Nov 2, 2020 •

edited

Loading

snukky commented Nov 2, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

snukky May 24, 2021

graemenail Jun 1, 2021

pedrodiascoelho Jun 2, 2021

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

pedrodiascoelho Nov 10, 2020

emjotde commented Nov 11, 2020

emjotde commented Nov 11, 2020

emjotde commented Nov 11, 2020

kpu left a comment

ykim362 commented Nov 11, 2020

pedrodiascoelho commented Nov 11, 2020

pedrodiascoelho commented May 11, 2021 •

edited

Loading

snukky commented May 19, 2021

snukky left a comment

graemenail left a comment

graemenail May 31, 2021

graemenail Jun 1, 2021

snukky commented Jun 18, 2021

pedrodiascoelho commented Jun 22, 2021

snukky commented Jun 22, 2021

kpu commented Jun 22, 2021

pedrodiascoelho commented Jun 30, 2021

Factor concatenation improvements and documentation #748

Factor concatenation improvements and documentation #748

Conversation

kpu commented Nov 1, 2020 • edited by snukky Loading

Description

How to test

Checklist

emjotde commented Nov 2, 2020 • edited Loading

emjotde commented Nov 2, 2020 • edited Loading

snukky commented Nov 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emjotde commented Nov 11, 2020

emjotde commented Nov 11, 2020

emjotde commented Nov 11, 2020

kpu left a comment

Choose a reason for hiding this comment

ykim362 commented Nov 11, 2020

pedrodiascoelho commented Nov 11, 2020

pedrodiascoelho commented May 11, 2021 • edited Loading

snukky commented May 19, 2021

snukky left a comment

Choose a reason for hiding this comment

graemenail left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snukky commented Jun 18, 2021

pedrodiascoelho commented Jun 22, 2021

snukky commented Jun 22, 2021

kpu commented Jun 22, 2021

pedrodiascoelho commented Jun 30, 2021

kpu commented Nov 1, 2020 •

edited by snukky

Loading

emjotde commented Nov 2, 2020 •

edited

Loading

emjotde commented Nov 2, 2020 •

edited

Loading

pedrodiascoelho commented May 11, 2021 •

edited

Loading