Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] s5cmd command support for COPY Persistent Data Storage #2176

Closed
turian opened this issue Jul 5, 2023 · 12 comments
Closed
Assignees
Milestone

Comments

@turian
Copy link
Contributor

turian commented Jul 5, 2023

Persistent Data Storage that is COPYéd can be quite slow to copy.

s5cmd is MUCH faster at uploading and downloading from S3 / R2 that s3cmd and aws-cli. Like MUCH faster: "For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively."

NOTE: s5cmd is only fast when it can parallelize into many downloads. It is of comparable speed if you are just downloading one big file.

@romilbhardwaj
Copy link
Collaborator

Thanks for the pointer @turian! We'll look into using s5cmd and integrate it if it improves perf without breaking user workflows.

cc @landscapepainter

@romilbhardwaj romilbhardwaj added this to the Storage milestone Jul 5, 2023
@romilbhardwaj romilbhardwaj added enhancement New feature or request good first issue Good for newcomers nice to have investigate and removed good first issue Good for newcomers labels Jul 5, 2023
@landscapepainter
Copy link
Collaborator

I will go ahead and benchmark for a comparison and see if it has all the necessary functionality for Skypilot as well. Thanks @turian !

@turian
Copy link
Contributor Author

turian commented Jul 6, 2023

@landscapepainter @romilbhardwaj Awesome! It's a nifty little tool, because usually the most painful step when you spin up a machine is getting the data onto disk from blob storage. Total dream come true.

I'm crossing my fingers it's an easy integration from the skypilot team.

@landscapepainter landscapepainter self-assigned this Jul 9, 2023
@landscapepainter
Copy link
Collaborator

landscapepainter commented Jul 16, 2023

Confirmed that s5cmd supports for symlink exclusion flag, --exclude option, md5 checksum for upload, and also supports for specifying profile with --profile/--credentials-file/--endpoint-url (for R2 support), but it lacks --include option yet, which is used in storage.py/get_file_sync_command for R2Store and S3Store.

Ran a quick benchmark and it is impressively quick compared to aws-cli:
1000 of 1MB files from local to s3 sync on a medium tier disk:

  • s5cmd sync: 7.53s
  • aws s3 sync with 10 threads(default): 20.73s

6 of 10GB files from local to s3 sync on a medium tier disk:

  • s5cmd sync: 386.55s
  • aws s3 sync with 10 threads(default): 612.96s

image

We may need to do more checks before we actually implement it, but this seems promising! @romilbhardwaj @turian

Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 14, 2023
Copy link
Contributor

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2023
@romilbhardwaj
Copy link
Collaborator

#2291 is fixing this

@romilbhardwaj romilbhardwaj reopened this Nov 24, 2023
@github-actions github-actions bot removed the Stale label Nov 25, 2023
@turian
Copy link
Contributor Author

turian commented Nov 26, 2023

Amazing! Looking forward to seeing it land

Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Mar 26, 2024
Copy link
Contributor

github-actions bot commented Apr 5, 2024

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2024
@github-actions github-actions bot removed the Stale label Apr 6, 2024
Copy link
Contributor

github-actions bot commented Aug 5, 2024

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Aug 5, 2024
Copy link
Contributor

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment