Reducing cloning time #65

charlesreid1 · 2018-03-23T01:17:47Z

Cloning the repo is taking a lot of time (5+ minutes).

Here's a list of the 40 largest objects in the github repo (in order of increasing size):

$ git rev-list --all --objects | \
     sed -n $(git rev-list --objects --all | \
     cut -f1 -d' ' | \
     git cat-file --batch-check | \
     grep blob | \
     sort -n -k 3 | \
     tail -n40 | \
     while read hash type size; do
          echo -n "-e s/$hash/$size/p ";
     done) | \
     sort -n -k1
854346 workflows/assembly/spades_output_podar_metaG_50_quast_report/icarus_viewers/contig_size_viewer.html
854544 workflows/assembly/spades_output_podar_metaG_100_quast_report/icarus_viewers/contig_size_viewer.html
861286 workflows/assembly/megahit_output_podar_metaG_sub_10_quast_report/icarus_viewers/contig_size_viewer.html
861823 workflows/assembly/megahit_output_podar_metaG_sub_25_quast_report/icarus_viewers/contig_size_viewer.html
862217 workflows/assembly/megahit_output_podar_metaG_sub_50_quast_report/icarus_viewers/contig_size_viewer.html
862407 workflows/assembly/megahit_output_podar_metaG_100_quast_report/icarus_viewers/contig_size_viewer.html
1301928 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.tsv.gz
1343958 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.tsv.gz
1448417 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.err.gz
1838386 workflows/functional_inference/ResFinder.fasta
2744850 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
2833725 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.err.gz
2854395 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
2958599 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
3204044 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.tbl.gz
3674552 examples/data/mg_3.fna.gz
3756053 examples/data/mg_7.fna.gz
4120132 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.tbl.gz
4155453 examples/data/mg_1.fna.gz
4155453 examples/data/mg_5.fna.gz
4805518 examples/data/mg_6.fna.gz
4964025 workflows/functional_inference/srst2/srst2_output_SRR606249_subset50__SRR606249_subset50.ResFinder.pileup
5428570 examples/data/mg_2.fna.gz
5428570 examples/data/mg_4.fna.gz
6046025 examples/data/mg_8.fna.gz
13831838 examples/data/mg_1.fna
17322165 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.faa.gz
17733035 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.faa.gz
26409296 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.ffn.gz
27017067 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.ffn.gz
40342689 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.fna.gz
40637961 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.fsa.gz
44599134 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.gff.gz
48484167 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.fna.gz
48884831 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.fsa.gz
53736855 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.gff.gz
81016010 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.sqn.gz
84063255 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.gbk.gz
91417417 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.sqn.gz
93531380 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.gbk.gz

A couple of thoughts on this problem:

git lfs is intended to circumvent this problem (although the damage is done once a large file is added to the repo). If we have future large data files to add to the repo we should use git lfs to store a pointer to its location in the cloud.
If we are okay with rewriting some of the history of the repo, we can remove the large data files from the history of the repo (would rewrite commit histories) using a tool like git-forget-blob. This would erase the data files and any blobs that referred to that file, saving us some space.
If modifying history is forbidden, we could hold off doing anything about the inflated repo size before the release of dahak version 1.0, but create a new repo for dahak 2.0 (or, some variation on this idea).

As far as locations where these large files might be migrated:

data repo on Github
AWS/GC bucket
OSF file storage

It seems like the first option would be best, since these files were successfully added to a Github repo at some point.

The text was updated successfully, but these errors were encountered:

charlesreid1 · 2018-03-23T01:18:52Z

Last few items are related to #53

ctb · 2018-03-23T14:06:28Z

A couple of thoughts on this problem: * [git lfs](https://git-lfs.github.com/) is intended to circumvent this problem (although the damage is done once a large file is added to the repo). If we have future large data files to add to the repo we should use git lfs to store a pointer to its location in the cloud. * If we are okay with rewriting some of the history of the repo, we can remove the large data files from the history of the repo (would rewrite commit histories) using a tool like [git-forget-blob](https://gist.github.com/nachoparker/c93a8675ba9a93bc5f422b060561a169). This would erase the data files and any blobs that referred to that file, saving us some space. * If modifying history is forbidden, we could hold off doing anything about the inflated repo size before the release of dahak version 1.0, but create a new repo for dahak 2.0 (or, some variation on this idea).

I personally like the rewriting-history approach here.

As far as locations where these large files might be migrated: * data repo on Github * AWS/GC bucket * OSF file storage It seems like the first option would be best, since these files were successfully added to a Github repo at some point.

My one experience with git-lfs was that it was a real mess; dunno if things have improved since then. OSF is my preferred approach because (a) it's free (b) it's versioned. Contra opinions just fine :). But I don't want to use AWS or GC because we don't have extended funding for this. I think GitHub charges $$ over a certain amount, too; could we look into that? best, --titus -- C. Titus Brown, [email protected]

charlesreid1 · 2018-03-27T20:12:13Z

Okay, we can just keep using OSF for hosting until they cry uncle or something. :D I'll give git-lfs a try on some test repoos and see if I can't figure out a smooth setup. Some info on Github pricing via https://help.github.com/articles/about-storage-and-bandwidth-usage/:

All personal and organization accounts using Git LFS receive 1 GB of free storage and 1 GB a month of free bandwidth. "One data pack costs $5 per month, and provides a monthly quota of 50 GB for bandwidth and 50 GB for storage. You can purchase as many data packs as you need. For example, if you need 150 GB of storage, you'd buy three data packs."

That's $100/TB for traffic and $100/TB for storage. For comparison, AWS traffic is $90/TB, S3 (blob storage) is $20/TB, and elastic block storage (file system) is $120/TB. (So... nothing to write home about.) Charles

charlesreid1 · 2018-03-27T20:14:28Z

I'll give git-lfs a try on some test repoos and see if I can't figure out a smooth setup.

Same for git-forget-blob. It will be important to work out the correct steps. We don't want to jettison the escape pod before we've gotten into the escape pod. Charles

charlesreid1 · 2018-04-11T20:13:49Z

This is ready to go. Steps for performing a git-commit-ectomy: https://github.com/charlesreid1/git-commit-ectomy

There are a few key things to be aware of when using git-forget-blob, namely, (a) it requires GNU sed so it won't work out-of-the-box on Mac, and (b) you have to git push --force or you'll end up with a duplicate version of every commit. Also voids warranties.

stephenturner · 2018-04-11T20:27:52Z

Looks scary.

FWIW, you can brew install gnu-sed --with-default-names on mac os. i never have any use for BSD's sed anyhow.

charlesreid1 · 2018-04-17T19:01:54Z

This change should only impact contributors to the repo. Once the commits have been removed, contributors will need to clone a new copy of the repo. (Otherwise they might accidentally re-add the removed commits.)

dahak contributors

charlesreid1 · 2018-05-08T07:55:40Z

git-commit-ectomy: complete.
new repo size: 24 MB.
new clone time: 6 seconds.

⚡️⚡️⚡️ 💯 ⚡️⚡️⚡️

brooksph added the enhancement label Apr 5, 2018

charlesreid1 self-assigned this Apr 17, 2018

charlesreid1 closed this as completed May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing cloning time #65

Reducing cloning time #65

charlesreid1 commented Mar 23, 2018 •

edited

Loading

charlesreid1 commented Mar 23, 2018

ctb commented Mar 23, 2018 via email

charlesreid1 commented Mar 27, 2018 via email •

edited

Loading

charlesreid1 commented Mar 27, 2018 via email •

edited

Loading

charlesreid1 commented Apr 11, 2018

stephenturner commented Apr 11, 2018

charlesreid1 commented Apr 17, 2018

charlesreid1 commented May 8, 2018

Reducing cloning time #65

Reducing cloning time #65

Comments

charlesreid1 commented Mar 23, 2018 • edited Loading

charlesreid1 commented Mar 23, 2018

ctb commented Mar 23, 2018 via email

charlesreid1 commented Mar 27, 2018 via email • edited Loading

charlesreid1 commented Mar 27, 2018 via email • edited Loading

charlesreid1 commented Apr 11, 2018

stephenturner commented Apr 11, 2018

charlesreid1 commented Apr 17, 2018

charlesreid1 commented May 8, 2018

charlesreid1 commented Mar 23, 2018 •

edited

Loading

charlesreid1 commented Mar 27, 2018 via email •

edited

Loading

charlesreid1 commented Mar 27, 2018 via email •

edited

Loading