Skip to content

Commit 24798ce

Browse files
committedMar 25, 2024
README, remove outdated license
1 parent d39b906 commit 24798ce

File tree

4 files changed

+39
-1756
lines changed

4 files changed

+39
-1756
lines changed
 

‎LICENSE.md

+1-5
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
1818
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1919
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
2020
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21-
SOFTWARE.
22-
23-
**Third Party Licenses**
24-
25-
This project also contains a modified version of the MIT licensed library `tokengrams-rs`, Copyright (c) 2021 Shunsuke Kanda.
21+
SOFTWARE.

‎README.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Tokengrams
2+
This library allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a [suffix array](https://en.wikipedia.org/wiki/Suffix_array) index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.
3+
4+
Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.
5+
6+
The backend is written in Rust, and the Python bindings are generated using [PyO3](https://github.com/PyO3/pyo3).
7+
8+
# Installation
9+
Currently you need to build and install from source using `maturin`. We plan to release wheels on PyPI soon.
10+
11+
```bash
12+
pip install maturin
13+
maturin develop
14+
```
15+
16+
# Usage
17+
```python
18+
from tokengrams import MemmapIndex
19+
20+
# Create a new index from an on-disk corpus called `document.bin` and save it to
21+
# `pile.idx`
22+
index = MemmapIndex.build(
23+
"/mnt/ssd-1/pile_preshuffled/standard/document.bin",
24+
"/mnt/ssd-1/nora/pile.idx",
25+
)
26+
27+
# Get the count of "hello world" in the corpus
28+
from transformers import AutoTokenizer
29+
30+
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
31+
print(index.count(tokenizer.encode("hello world")))
32+
33+
# You can now load the index from disk later using __init__
34+
index = MemmapIndex(
35+
"/mnt/ssd-1/pile_preshuffled/standard/document.bin",
36+
"/mnt/ssd-1/nora/pile.idx"
37+
)
38+
```

‎prototyping.ipynb

-1,730
This file was deleted.

‎src/table.rs

-21
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,10 @@
1-
/* This code is almost entirely based on suffix from BurntSushi. The original
2-
* program was licensed under the MIT license. We have modified it for
3-
* for two reasons:
4-
*
5-
* 1. The original implementation used u32 indices to point into the
6-
* suffix array. This is smaller and fairly cache efficient, but here
7-
* in the Real World we have to work with Big Data and our datasets
8-
* are bigger than 2^32 bytes. So we have to work with u64 instead.
9-
*
10-
* 2. The original implementation had a utf8 interface. This is very
11-
* convenient if you're working with strings, but we are working with
12-
* byte arrays almost exclusively, and so just cut out the strings.
13-
*
14-
* When the comments below contradict these two statements, that's why.
15-
*/
161
extern crate utf16_literal;
172

183
use rayon::prelude::*;
194
use serde::{Deserialize, Serialize};
205
use std::{fmt, ops::Deref, u64};
216

227
/// A suffix table is a sequence of lexicographically sorted suffixes.
23-
///
24-
/// This is distinct from a suffix array in that it *only* contains
25-
/// suffix indices. It has no "enhanced" information like the inverse suffix
26-
/// table or least-common-prefix lengths (LCP array). This representation
27-
/// limits what you can do (and how fast), but it uses very little memory
28-
/// (4 bytes per character in the text).
298
#[derive(Clone, Deserialize, Eq, PartialEq, Serialize)]
309
pub struct SuffixTable<T = Box<[u16]>, U = Box<[u64]>> {
3110
text: T,

0 commit comments

Comments
 (0)
Please sign in to comment.