Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary shortlist #856

Merged
merged 110 commits into from
Jul 11, 2021
Merged

Binary shortlist #856

merged 110 commits into from
Jul 11, 2021

Conversation

qianqianzhu
Copy link
Contributor

@qianqianzhu qianqianzhu commented Apr 15, 2021

Description

This PR adds a new binary shortlist generator which allows the decoder to load a lexical shortlist from a binary blob.
The new binary shortlist can greatly improve the lexical shortlist loading time (x4309 speed-up).
This feature has been integrated into the bergamot branch.

List of changes:

  • Add BinaryShortlistGenerator Class to load lexical shortlist from a binary blob.
    To use a binary shortlist, you should provide the file path and whether the file needs checking (optional; true by default).
    For example:
--shortlist lex.esen.bin false
  • Add a shortlist converter which can convert a text lexical shortlist to a binary shortlist. This converter is integrated into marian-conv with --shortlist,-s option
    Shortlist conversion usage example:
./marian-conv --shortlist lex.esen.s2t 100 100 0 \
               --dump lex.esen.bin \
               --vocabs vocab.esen.spm vocab.esen.spm
  • Update Marian decoder with an option of loading binary shortlist
  • Update hash.h for checksum calculation

Added dependencies: none

How to test

Manually tested new BinaryShortlistGenerator with LexicalShortlistGenerator. The results are the same. Also passed unit tests.
Regression tests are passed with gcc 8.4.0 and cuda 10.1.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

…uicksand

This PR turns the LSH index and search into a set of operators that live in the expression graph. This makes creation etc. thread-safe (one index per graph) and allows to later implement GPU versions.

This allows to mmap the LSH as a Marian parameter since now we only need to turn the index into something that can be saved to disk using the existing tensors. This happens in marian_conv or the equivalent interface function in the Quicksand interface.
@emjotde
Copy link
Member

emjotde commented Jul 9, 2021

@hieuhoang can you look at the merge conflict in shortlist.cpp? I will look into marian_conv

@emjotde emjotde requested review from emjotde and hieuhoang July 9, 2021 22:16
@hieuhoang
Copy link
Collaborator

regression past, the usual tests failed:
Skipped:

  • tests/examples/mnist/test_mnist_ffnn.sh
  • tests/examples/iris/test_iris.sh
  • tests/examples/unit-tests/test_unit_tests.sh
  • tests/server/test_ende.sh
  • tests/server/test_ende_align.sh
  • tests/server/test_ende_batch32.sh
  • tests/server/test_ende_cpu.sh
  • tests/server/test_ende_with_empty_lines.sh
  • tests/decoder/shortlist/test_shortlist_server.sh
  • tests/interface/input-tsv/test_tsv_server.sh
  • tests/interface/input-tsv/test_tsv_server_dual_source.sh
  • tests/models/wngt19/test_model_base_fbgemm_packed16.sh
  • tests/models/wngt19/test_model_base_fbgemm_packed8.sh
  • tests/training/features/exp-smoothing/test_expsmooth_sync.sh
  • tests/training/multi-gpu/test_async_sgd_runs.sh
  • tests/training/multi-gpu/test_sync_sgd.sh
  • tests/training/restoring/optimizer/test_adam_params_async.sh
  • tests/training/restoring/optimizer/test_adam_params_sync.sh
  • tests/training/restoring/multi-gpu/test_adam_sync.sh
  • tests/training/restoring/multi-gpu/test_async.sh
  • tests/training/restoring/multi-gpu/test_sync.sh
  • tests/training/restoring/exp-smoothing/test_expsmooth_sync.sh
    Failed:
  • tests/decoder/intgemm/test_intgemm_16bit.sh
  • tests/decoder/intgemm/test_intgemm_16bit_avx2.sh
  • tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
  • tests/decoder/intgemm/test_intgemm_8bit.sh
  • tests/decoder/intgemm/test_intgemm_8bit_avx2.sh
  • tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh

Copy link
Collaborator

@hieuhoang hieuhoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good. Haven't tested the actual binary shortlist. You guys might wanna add a regression test to protect your investment

@emjotde
Copy link
Member

emjotde commented Jul 10, 2021

Hm, looks like Github didn't realize that we updated master too and now shows the merge as changes.

Copy link
Member

@emjotde emjotde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good. Sorry for the weird github merge chaos here. Let us know if we broke anything on your side.

cli->add<std::vector<std::string>>("--vocabs,-V", "Vocabulary file, required for ONNX export");
cli->add<std::vector<std::string>>("--shortlist,-s", "Shortlist conversion: filePath firstNum bestNum threshold");
cli->add<std::string>("--dump,-d", "Binary shortlist dump path","lex.bin");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to dump-shortlist, dump is too generic.

Ptr<const data::ShortlistGenerator> binaryShortlistGenerator
= New<data::BinaryShortlistGenerator>(options, srcVocab, trgVocab, 0, 1, vocabPaths[0] == vocabPaths[1]);
binaryShortlistGenerator->dump(dumpPath);
LOG(info, "Finished");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as above, mention that it's the dumping of the shortlist that finished, otherwise a bit too generic.

@@ -1,5 +1,8 @@
#include "data/shortlist.h"
#include "microsoft/shortlist/utils/ParameterTree.h"
#include "marian.h"
#include "layers/lsh.h"
#include <queue>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: we usually add an empty line between our and system headers.

@emjotde
Copy link
Member

emjotde commented Jul 10, 2021

@qianqianzhu after the option is renamed this is good to merge.

@emjotde emjotde merged commit 42f0b8b into master Jul 11, 2021
@qianqianzhu qianqianzhu deleted the binary_shortlist branch September 6, 2021 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants