-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need better error handling when there is a failure to fetch a file #78
Comments
yeah we should handle this better. I'm not sure how, but if we could abort the whole download that's probably the best option... |
i googled a bit for some possible solutions here... https://stackoverflow.com/questions/15305203/what-to-do-with-errors-when-streaming-the-body-of-an-http-request these two options seem like the best leads:
the trailer seems pretty robust, but i'm not sure how to do that. a malformed chunk? not sure how to make rails do that either. if anybody else has any clues, let me know |
Hey, just an idea, it may be stupid, but may be not, so... If zipline fails to fetch a file, or any other error occurs:
I don't know the internal mechanism of zipline, so, if skipping is not possible, then write the closing sequence for the broken file (if needed/possible), write the "failures.log", and finish up with proper zip header (well, footer, technically) - to allow client to unarchive at least what have been fetched so far. Optionally, add a configuration/options/settings/initialiser to customise the name and content (using proc) of the failures log file (or disable it completely). WDYT?... |
this might be possible if the failure happened at the beginning. if it
happened in the middle of the file that data has already been streamed out
to the user and it's not possible to walk it back as far as i know. i think
i'd rather abort than have a chance that the user thinks this file is valid.
…On Wed, Jun 21, 2023 at 3:38 AM Urist McUristurister < ***@***.***> wrote:
Hey, just an idea, it may be stupid, but may be not, so...
If zipline fails to fetch a file, or any other error occurs:
- log this
- if it was a failure to fetch a file, add the file url to some kind
of internal @Skipped list
- if possible, do NOT add the failed file to archive. If not, better
to have a working archive with a few broken files, than a whole broken
archive. Especially if it's multiple GBs in size.
- move on to the next item on the list
- In the end, add a text file named, say, "failures.log", in which
list the @Skipped files with corresponding http errors
I don't know the internal mechanism of zipline, so, if skipping is not
possible, then write the closing sequence for the broken file (if
needed/possible), write the "failures.log", and finish up with proper zip
header (well, footer, technically) - to allow client to unarchive at least
what have been fetched so far.
Optionally, add a configuration/options/settings/initialiser to customise
the name and content (using proc) of the failures log file (or disable it
completely).
WDYT?...
—
Reply to this email directly, view it on GitHub
<#78 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACCSRKTWH6YT5OECDUXJ7DXMKQJFANCNFSM45UQOWHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
that said, an errors.log would be better than nothing. i'll keep this in mind if we try the other options and can't get them working. |
But there's no need to walk back. Before switching to zipline, for many years we've used a "classic" approach of downloading files to the worker machine, zipping them, and then sending to the client. And from that experience, vast majority of errors related to remote files happen when trying to fetch a file - for example, a 404 or 5XX, or any other type of error. For something to happen when the remote file is in the middle of transfer - I can't say that I have ever encountered this. So if it's possible to just log it and skip it - that would be best. But even in the case where we've fetched, say, 20mb out of 25mb file, and in the middle of transfer the connection to remote host went down or timed out — is it possible to stream to the client "ok, that's the end of this file, here's the next one" and just jump to the next file in the list? Because it is still much more preferable to have a log + corrupted 20mb file + 1999 full files, rather than a broken 15Gb archive that you can't even examine nor salvage. |
yeah actually that makes some sense to me... this gem is mostly for those giant downloads, so maybe just getting partial success is generally better. my worry is the user will not know that the 1 file in there is actually corrupt. |
Hence the error.log ;) Better 1 corrupt file, than 15gb corrupt archive... |
The problem is that a ZIP file which has been "chopped off" at the end is likely not to unarchive well, because the list of files inside the ZIP comes at the end of file. You can read the ZIP "straight-ahead" but in presence of data descriptors this is likely to fail on some apps, and it is very inefficient. It can also be faulty - for example, if entries are themselves ZIP files. What you need to make this work is a resumable download, similar to what we've done at WeTransfer back in the day. You might be able to put something together using raw zip_tricks and interval_response so that when you address the download again you will be able to resume your download. It is not trivial to do by any means but doable. error.log would be a nice feature, albeit that when your download interrupts you already can't add anything to the response.
If your concern is the "upstream" (where you are pulling the file from) timing out on you or similar, you can pull data from it in chunks and resume where it interrupts. HTTP |
@julik I'm not talking about "chopping off" zip files, I'm talking about chopped off files that are being requested. Or, which is actually more frequent, non existing, 404 files, that zipline is fetching from the url. If I specify a list of 500+ urls (S3, for example), and one of them in the middle of the list does not exist, then the whole 50gb archive is botched. And the frustration about downloading 30Gb file which is suddenly broken, and then you try to download it again, just to find out that it's still broken, is immense. I'm not talking about chopping the zip file. I'm talking about at least skipping the failed archivable file. If the file being fetched is throwing an error, any error, then either skip writing it altogether or write what have been fetched so far (say, 50mb of 150), if it's a partially fetched file, and move on to the next file. And in the very end, in the list of the zipped files, say "ok, this file is 50mb in size, next". That way when user downloads the archive, it will have only 1 file in the whole archive "broken" (zero-sized, partial, or not present at all), not the whole archive itself. |
That should actually work if you surround your call to Update - I was wrong, because even though ZipTricks will not finalize the file it will still add it to the central directory - albeit in a "broken" form. You would need to muck about with the internals a bit: module SaferZipWrite
# Removes a file that does not finish writing from the
# @files array in the Streamer, so that the file will not
# be present in the ZIP central directory. Some unarchivers
# might then still be able to unarchive files which do complete writing.
class StreamerWrapper < SimpleDelegator
def write_deflated_file(...)
files = __getobj__.instance_variable_get("@files")
n_files_before_writing = files.size
super
rescue => e
# do something with the exception
files.pop if files.size > n_files_before_writing
end
def write_stored_file(...)
files = __getobj__.instance_variable_get("@files")
n_files_before_writing = files.size
super
rescue => e
# do something with the exception
files.pop if files.size > n_files_before_writing
end
end
def write_file(streamer, file, name, options)
super(StreamerWrapper.new(streamer), file, name, options)
end
end
Zipline::ZipGenerator.prepend(SaferZipWrite) |
Well... I actually did something similar: I've patched the I'll try your suggestion, though, thanks! 👍 Maybe I don't have to download the whole file after all... Still, would be nice to have this feature built-in ;) |
That's still much more preferable than a broken archive! 👍 😉 |
that is for @fringd to ponder 😆 The feature would be nice, indeed - I'd wager it would have to be added in zip_tricks though and I haven't yet sorted the maintenance there. There could be a heuristic in |
PRs welcome of course. I'm assuming there isn't enough local storage for files generally though, so anything that waits for a whole file before continuing is a non-starter 😕
1. Maybe we could just abandon the data and remove the failed file from the list of central directory entries. Even include an "ERRORLOG.txt" file or something in the archive saying that one or more files failed to fetch. i'm not 100% sure, but i think it's valid to have a bunch of garbage data in the zip file that is never pointed to by the central directory entries.
2. Code that at least attempts to recover if the download fails would be welcome... trying to restart or resume and continue feeding data into the output stream some way without abandoning the written data... or alternatly abandon the already written data, again discarding the central directory entry, but this time adding a new one and starting over and hoping this works.
3. i think the cost of deflating files that don't benefit is negligible, so it's probably not worth the trouble to do any fancy extension heuristics
|
For those pondering "heuristic" choice between using stored or deflated modes, here is something that could work nicely. It will monitor how well the input compresses (you could also pick a ratio or something), and buffer roughly that amount of data. Then it will pick the more efficient option and open a file for you in the Streamer. module HeuristicWrite
class DeflateHeuristic
BYTES_WRITTEN_THRESHOLD = 65*1024
def initialize(filename, streamer)
@filename = filename
@streamer = streamer
@buf = StringIO.new.binmode
@deflater = ::Zlib::Deflate.new(Zlib::DEFAULT_COMPRESSION, -::Zlib::MAX_WBITS)
@bytes_deflated = 0
@winner = nil
end
def <<(bytes)
if @winner
@winner << bytes
else
@buf << bytes
@deflater.deflate(bytes) { |chunk| @bytes_deflated += chunk.bytesize }
decide if @buf.size > BYTES_WRITTEN_THRESHOLD
end
self
end
def close
decide unless @winner
@winner.close
end
private def decide
# Finish and then close the deflater - it has likely buffered some data
@bytes_deflated += @deflater.finish.bytesize until @deflater.finished?
# If the deflated version is smaller than the stored one
# - use deflate, otherwise stored
ratio = @bytes_deflated / @buf.size.to_f
if ratio < 0.75
warn "using flate"
@winner = @streamer.write_deflated_file(@filename)
else
warn "using stored"
@winner = @streamer.write_stored_file(@filename)
end
@buf.rewind
IO.copy_stream(@buf, @winner)
end
end
def write_file(name)
writable = DeflateHeuristic.new(name, self)
yield(writable)
writable.close
end
end
ZipTricks::Streamer.include(HeuristicWrite)
io = StringIO.new
ZipTricks::Streamer.open(io) do |zip|
zip.write_file("this will be stored.bin") do |io|
90_000.times do
io << Random.bytes(1024)
end
end
zip.write_file("this will be deflated.bin") do |io|
90_000.times do
io << "many many delicious, compressible words"
end
end
end
warn io.size |
This allows us to rescue during writes if, for example, the upstream server has a timeout. There is an interesting discussion here fringd/zipline#78
This allows us to rescue during writes if, for example, the upstream server has a timeout. There is an interesting discussion here fringd/zipline#78
@Urist-McUristurister @suhastech @fringd Now that the zip_tricks situation has been resolved - maybe we should think about an API for this? I've added a feature to zip_kit which excludes the failed file from the ZIP central directory and removes it from the path set - so you can add the same file multiple times if it fails halfway. The issue is that you need to specify a retry policy - or have some code for it in some way. For example, you could have a retry policy that if one of With "naked" zip_kit you could do that with something like https://github.com/kamui/retriable by doing Retriable.retriable(on: TimeoutError) do
zip.write_file("my_stuff.bin") do |sink|
HTTP.get(the_url) { |response_body| IO.copy_stream(response_body, sink) }
end
end The configuration for |
if fetching a file over http fails after x/n bytes written it would be nice if we could try to resume using range requests to start the download at x bytes if supported |
i'm guessing most CDNs support range requests |
Storage backends like S3 do, but it is not guaranteed - also S3 will not respond to a HEAD, only to a GET - you will need to do a "start request", and so on. I mean - all of these are doable, what I am wondering about is how to make it sufficiently generic so that it would be "good enough for most folks" - or how to provide an extension point for this, alternatively. If you can resume an upstream download you can rescue inside of the |
Okay, attempting to resume gracefully would be a nice to have feature request i think. we can attempt it later, just handling errors gracefully would be a good start for now |
Ok, I will need to a bit of refactoring then. What would be the default - "give up" if a file fails and print to the Rails log? This would need a refactor for all the file types I think (since you need to match on storage-specific exceptions) |
i think the default would be to remove the file from the list of files to be written at the header at the end and move on to the next file, and yeah log. that's at least a base-line we can improve from |
Aight |
Hoping progress is going well on this. Tried using zipline in our rails app to serve many wav files (up to 80) to clients quicker than we currently do and it works great! ...on local, but totally crashes when running in Heroku. 😅
|
Is that particular error on your side though, @strozzi1 ? There can often be timeouts where the downstream (heroku router, in this case) will close the connection on you if you are not writing fast enough or if the client reading from heroku is not reading fast enough. At the point where there is a Puma connection error writing to the client socket reattempting to fetch data you are proxying is not going to help, likely. The "bad chunk" error looks odd though. I suspect there might be a different issue at play here:
This might need some investigation on your part, I'm afraid. |
@julik I was afraid you'd say that. I ended up trying out removing some code on the client side that would slow down the ingestion of the server data stream and THAT DID get me a lot farther in the download before I ran into the error again, but I got a new error!
I'll keep digging around and see what I come up with, thanks for responding and giving me ideas to work with! 🙏 |
If I look at it in the Heroku docs here https://devcenter.heroku.com/articles/error-codes#h28-client-connection-idle I see that this likely means a slow HTTP client (if their usage of the term "client" matches the standard HTTP semantics). When I had a streaming proxy server (not using zipline but similar in purpose) I ended up ignoring these errors because they are, fundamentally, a "client disconnect" or "client sleep" - something that cannot be either controlled or recovered from on the sending (server) side. Re. mem bloat - sending data through the Rack interface should not be causing this. If I were you I would look at the ActiveRecord select (whether you have wide SQL rows that you fetch with SELECT *) and the HTTP retrieval, to see whether there are any buffers there that may be causing this. I do have a lot of experience debugging stuff like this, but I don't do it pro bono 😄 |
By "the client" - do you mean the HTTP client config in the server that runs zipline or the client itself? If you have a custom client that calls your Heroku app and you are twiddling with its code - this is where I would start stripping the "clever code" that manages the HTTP download to be honest. |
@julik, By "client" I meant client-side (JS/React), as in the front-end where a user triggers the download. I removed code that ran everytime a chunk was ingested thinking if I sped up front-end ingestion speed it'd prevent timeout error, but it turns out that this wasn't a fix at all, just luck that I made it further before it failed during testing. The only server code I have is the CarrierWave files zipline example from the readme. I can't find anything in any config files, no middleware that is causing this, at least not obviously, though this is an big/old stack with many skeletons in the closet.. I think I may have spun my wheels enough on this without success. Thanks for giving me ideas to persue, but I don't think I can get zipline to work on Heroku. Sorry to clog up the thread, I really thought this issue was related 😅 |
We stumbled upon a tricky bug where no matter what, the ZIP produced by zipline was corrupt. Tracking it down was quite difficult.
It seems that the download of one of the S3 files failed here:
https://github.com/fringd/zipline/blob/master/lib/zipline/zip_generator.rb#L96
Possible solutions:
The text was updated successfully, but these errors were encountered: