-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise processing of line art #905
Comments
Thanks a lot for this detailed performance analysis. We are definitely going to do further work on improving the performance. I'm not sure if we'd be able to reduce the number of objects (there might be millions of objects in PDF even if it is just 1Mb), but we do plan implement thread-safeness in the next release getting rid of static containers. We'll also check if we can parallelize checks within a single document. We'll also inspect if we can optimize the code for most common checks. It would be great, if you could also send us your benchmark documents. |
Duplicate of #896 |
We have implemented caching of all data structures related to fonts. This significantly increases pefromance on large text files with a limited number of fonts. Would you please check the latest dev version? It would be great to know its performance on your model test files. Note also that in case of the first validation job (both in the GUI and in the CLI batch job) there is also some time (up to a second) taken by compiling all JavaScript expressions. So, if you need a clean time on a single file, please rerun validation from the GUI or include it into a batch job when executing the CLI. |
BTW, veraPDF needs about 5 minutes for validating my test document for the 8388607 indirect objects limit (6.1.13 of ISO 19005-2), font caching doesn't help in that case as there are no fonts. My own PDF/A validator needs about 1 second. I'm too lazy for profiling veraPDF... |
@a20god would be great if you could share this test file with us. We'd be happy to profile veraPDF and further optimize it. |
The file is probably too big (35 MB) for attaching it here. Do you have any preferred alternative? |
You can send the file (or place the file in the cloud and send me the link) directly to [email protected] |
Here's the version with 8388608 indirect objects: |
Here's another one with 8388608 indirect objects: |
Is there any update related to the mentioned issue? We have been experiencing the same complications with files containing complex vector graphics, causing the application to force it's memory usage to ridiculous proportions due to the mentioned number of objects in memory. This makes the validator completely unreliable and unusable to us since we can't use a library that can cause a system crash on an application that requires high availability. In case there's been no progress on solving this problem, and if it's just a problem that is triggered with a subset of the assertions, is there in any way a work around that allows us to avoid triggering those specific validations while keeping the general validations? |
We constantly work on the improvements of the memory use. Are you using the latest stable release (1.12)? Would you be able to provide test files where you experience issues? |
@bdoubrov the files provided by @a20god and the assessment by @simnd describe the problem exactly, which is why I chose to respond to this ticket. Were you able to get to the issue described here? I can imagine that unless the problem with the high memory use due to all objects being held in memory during validation is solved, the problem will persist, no matter how much is tweaked on optimizations elsewhere. This does not seem to be a ticked that is "solved in time due to potentially related performance tweaks". In addition, the respective files I experienced it with, and which caused a complete system crash due to memory allocation issues, contain classified information and cannot be shared. |
The files provided by @a20god require 2.5Gb memory of JVM, which is indeed not exceptional for files of this size. At the moment veraPDF does indeed keep in memory the array of objects, linked from some other object, but not more. That is, in the test file of @a20god it would indeed create an array of 8388608 indirect objects in memory. When it comes to page contents, veraPDF keeps in memory only the objects from a single page and moreover tries to cache all text objects. I can imagine the issues in case the files contain a lot of line art (vector graphics). This is exactly why I was asking for test documents in order to reproduce the issue. Anyway, we'll run another optimization cycle before the next release and will try to reduce the memory use even further. Real world test files would still help us a lot. |
@bdoubrov Thank you for this response. I understand the complication. Like I mentioned, sharing the file is not an option, but to illustrate the situation: the file contained a blueprint for a building. You can imagine the amount of vectors included in that file. Since we have no way of enforcing that such images are not included as vectors we're in a big puzzle due to the aforementioned problem. I don't know if it is an option to let the validator skip SVG deep analysis when it comes to this ISO requirement, or if it is part of that requirement, but if you could skip it, or make the skipping of SVG's optional, that would immediately solve the problem on our end. |
While we have improved performance and memory usage do we have any intention of supporting the skipping of SVG processing? Keeping this open until we at least answer the question. |
I think there is still some space for optimization in case of a large number of lineart objects on the page. Please keep it open for now. |
Dear veraPDF team,
We've been evaluating veraPDF (Release 1.8,
VeraGreenfieldFoundryProvider
) as our tool of choice for validating PDF documents against, mainly, pdfa-2b compliance. We're committed to certain performance requirements which we definitely can't reach with this library. For example, there's a given reference document with complex content with a size of only 1MB. By requirement the validation must not last longer than 3500 milliseconds, which actually takes about 12000 ms with veraPDF though.We've been looking at your code base and found some issues I'd like to address:
BaseValidator
, is doing millions of assertions (Rhino) sequentially in one singleThread
, it feels like concurrency is necessary here.StaticContainers
), also bad for threadingProfiling
We've also run the validation with a JProfiler to see what's happening on the JVM when veraPDF is running. It's obvious, that the validation creates very many objects and requires quite a lot heap memory

during runtime. Even if the document is only 1MB, millions of objects occupying hundreds of MB on the heap are created during validation.
Another example is a ~4MB document with simple text on ~3500 pages, this takes about 8 minutes and the profiling leads to even more critical results (8 Minutes for validation)

Are you aware of these issues? Isn't veraPDF meant to be used for larger documents?
Thanks in advance!
Simon
The text was updated successfully, but these errors were encountered: