Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing cloning time #65

Closed
charlesreid1 opened this issue Mar 23, 2018 · 8 comments
Closed

Reducing cloning time #65

charlesreid1 opened this issue Mar 23, 2018 · 8 comments
Assignees

Comments

@charlesreid1
Copy link
Member

charlesreid1 commented Mar 23, 2018

Cloning the repo is taking a lot of time (5+ minutes).

Here's a list of the 40 largest objects in the github repo (in order of increasing size):

$ git rev-list --all --objects | \
     sed -n $(git rev-list --objects --all | \
     cut -f1 -d' ' | \
     git cat-file --batch-check | \
     grep blob | \
     sort -n -k 3 | \
     tail -n40 | \
     while read hash type size; do
          echo -n "-e s/$hash/$size/p ";
     done) | \
     sort -n -k1
854346 workflows/assembly/spades_output_podar_metaG_50_quast_report/icarus_viewers/contig_size_viewer.html
854544 workflows/assembly/spades_output_podar_metaG_100_quast_report/icarus_viewers/contig_size_viewer.html
861286 workflows/assembly/megahit_output_podar_metaG_sub_10_quast_report/icarus_viewers/contig_size_viewer.html
861823 workflows/assembly/megahit_output_podar_metaG_sub_25_quast_report/icarus_viewers/contig_size_viewer.html
862217 workflows/assembly/megahit_output_podar_metaG_sub_50_quast_report/icarus_viewers/contig_size_viewer.html
862407 workflows/assembly/megahit_output_podar_metaG_100_quast_report/icarus_viewers/contig_size_viewer.html
1301928 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.tsv.gz
1343958 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.tsv.gz
1448417 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.err.gz
1838386 workflows/functional_inference/ResFinder.fasta
2744850 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
2833725 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.err.gz
2854395 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
2958599 workflows/taxonomic_classification/sourmash/Taxonomic_classification_with_sourmash.ipynb
3204044 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.tbl.gz
3674552 examples/data/mg_3.fna.gz
3756053 examples/data/mg_7.fna.gz
4120132 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.tbl.gz
4155453 examples/data/mg_1.fna.gz
4155453 examples/data/mg_5.fna.gz
4805518 examples/data/mg_6.fna.gz
4964025 workflows/functional_inference/srst2/srst2_output_SRR606249_subset50__SRR606249_subset50.ResFinder.pileup
5428570 examples/data/mg_2.fna.gz
5428570 examples/data/mg_4.fna.gz
6046025 examples/data/mg_8.fna.gz
13831838 examples/data/mg_1.fna
17322165 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.faa.gz
17733035 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.faa.gz
26409296 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.ffn.gz
27017067 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.ffn.gz
40342689 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.fna.gz
40637961 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.fsa.gz
44599134 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.gff.gz
48484167 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.fna.gz
48884831 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.fsa.gz
53736855 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.gff.gz
81016010 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.sqn.gz
84063255 workflows/functional_inference/prokka_annotation_megahit/podar_metaG_sub_10_megahit.gbk.gz
91417417 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.sqn.gz
93531380 workflows/functional_inference/prokka_annotation_spades/podar_metaG_sub_10_spades.gbk.gz

A couple of thoughts on this problem:

  • git lfs is intended to circumvent this problem (although the damage is done once a large file is added to the repo). If we have future large data files to add to the repo we should use git lfs to store a pointer to its location in the cloud.
  • If we are okay with rewriting some of the history of the repo, we can remove the large data files from the history of the repo (would rewrite commit histories) using a tool like git-forget-blob. This would erase the data files and any blobs that referred to that file, saving us some space.
  • If modifying history is forbidden, we could hold off doing anything about the inflated repo size before the release of dahak version 1.0, but create a new repo for dahak 2.0 (or, some variation on this idea).

As far as locations where these large files might be migrated:

  • data repo on Github
  • AWS/GC bucket
  • OSF file storage

It seems like the first option would be best, since these files were successfully added to a Github repo at some point.

@charlesreid1
Copy link
Member Author

Last few items are related to #53

@ctb
Copy link
Contributor

ctb commented Mar 23, 2018 via email

@charlesreid1
Copy link
Member Author

charlesreid1 commented Mar 27, 2018 via email

@charlesreid1
Copy link
Member Author

charlesreid1 commented Mar 27, 2018 via email

@charlesreid1
Copy link
Member Author

This is ready to go. Steps for performing a git-commit-ectomy: https://github.com/charlesreid1/git-commit-ectomy

There are a few key things to be aware of when using git-forget-blob, namely, (a) it requires GNU sed so it won't work out-of-the-box on Mac, and (b) you have to git push --force or you'll end up with a duplicate version of every commit. Also voids warranties.

@stephenturner
Copy link

Looks scary.

FWIW, you can brew install gnu-sed --with-default-names on mac os. i never have any use for BSD's sed anyhow.

@charlesreid1 charlesreid1 self-assigned this Apr 17, 2018
@charlesreid1
Copy link
Member Author

This change should only impact contributors to the repo. Once the commits have been removed, contributors will need to clone a new copy of the repo. (Otherwise they might accidentally re-add the removed commits.)

dahak contributors

@charlesreid1
Copy link
Member Author

git-commit-ectomy: complete.
new repo size: 24 MB.
new clone time: 6 seconds.

⚡️⚡️⚡️ 💯 ⚡️⚡️⚡️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants