Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise processing of line art #905

Closed
simnd opened this issue Aug 23, 2017 · 16 comments · Fixed by #1060 or #1063
Closed

Optimise processing of line art #905

simnd opened this issue Aug 23, 2017 · 16 comments · Fixed by #1060 or #1063
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release

Comments

@simnd
Copy link

simnd commented Aug 23, 2017

Dear veraPDF team,

We've been evaluating veraPDF (Release 1.8, VeraGreenfieldFoundryProvider) as our tool of choice for validating PDF documents against, mainly, pdfa-2b compliance. We're committed to certain performance requirements which we definitely can't reach with this library. For example, there's a given reference document with complex content with a size of only 1MB. By requirement the validation must not last longer than 3500 milliseconds, which actually takes about 12000 ms with veraPDF though.

We've been looking at your code base and found some issues I'd like to address:

  • the validation, taking part in BaseValidator, is doing millions of assertions (Rhino) sequentially in one single Thread, it feels like concurrency is necessary here.
  • the code isn't thread-safe, which makes it difficult to implement threading
  • State is managed in static classes (StaticContainers), also bad for threading

Profiling

We've also run the validation with a JProfiler to see what's happening on the JVM when veraPDF is running. It's obvious, that the validation creates very many objects and requires quite a lot heap memory
during runtime. Even if the document is only 1MB, millions of objects occupying hundreds of MB on the heap are created during validation.
heap_1mb

Another example is a ~4MB document with simple text on ~3500 pages, this takes about 8 minutes and the profiling leads to even more critical results (8 Minutes for validation)
profiler_memory

Are you aware of these issues? Isn't veraPDF meant to be used for larger documents?

Thanks in advance!
Simon

@bdoubrov
Copy link
Contributor

Thanks a lot for this detailed performance analysis. We are definitely going to do further work on improving the performance. I'm not sure if we'd be able to reduce the number of objects (there might be millions of objects in PDF even if it is just 1Mb), but we do plan implement thread-safeness in the next release getting rid of static containers.

We'll also check if we can parallelize checks within a single document. We'll also inspect if we can optimize the code for most common checks.

It would be great, if you could also send us your benchmark documents.

@bdoubrov
Copy link
Contributor

Duplicate of #896

@bdoubrov bdoubrov marked this as a duplicate of #896 Aug 25, 2017
@bdoubrov
Copy link
Contributor

bdoubrov commented Sep 6, 2017

We have implemented caching of all data structures related to fonts. This significantly increases pefromance on large text files with a limited number of fonts.

Would you please check the latest dev version? It would be great to know its performance on your model test files.

Note also that in case of the first validation job (both in the GUI and in the CLI batch job) there is also some time (up to a second) taken by compiling all JavaScript expressions. So, if you need a clean time on a single file, please rerun validation from the GUI or include it into a batch job when executing the CLI.

@a20god
Copy link

a20god commented Sep 6, 2017

BTW, veraPDF needs about 5 minutes for validating my test document for the 8388607 indirect objects limit (6.1.13 of ISO 19005-2), font caching doesn't help in that case as there are no fonts. My own PDF/A validator needs about 1 second. I'm too lazy for profiling veraPDF...

@bdoubrov
Copy link
Contributor

bdoubrov commented Sep 6, 2017

@a20god would be great if you could share this test file with us. We'd be happy to profile veraPDF and further optimize it.

@a20god
Copy link

a20god commented Sep 6, 2017

The file is probably too big (35 MB) for attaching it here. Do you have any preferred alternative?

@bdoubrov
Copy link
Contributor

bdoubrov commented Sep 6, 2017

You can send the file (or place the file in the cloud and send me the link) directly to [email protected]

@a20god
Copy link

a20god commented Sep 6, 2017

Here's the version with 8388608 indirect objects:
https://www.mediafire.com/file/4myw01o3kb442c4/6.1.13-07-fail-1.pdf
Note that you have to increase the Java heap size.

@a20god
Copy link

a20god commented Sep 7, 2017

Here's another one with 8388608 indirect objects:
https://www.mediafire.com/file/po5b9x8n8ycp66s/6.1.13-07-fail-2.zip
6.1.13-07-fail-1.pdf uses an ObjStm, this one doesn't. Validation of this document with veraPDF takes about 9 minutes.

@cmkramer
Copy link

Is there any update related to the mentioned issue? We have been experiencing the same complications with files containing complex vector graphics, causing the application to force it's memory usage to ridiculous proportions due to the mentioned number of objects in memory.

This makes the validator completely unreliable and unusable to us since we can't use a library that can cause a system crash on an application that requires high availability.

In case there's been no progress on solving this problem, and if it's just a problem that is triggered with a subset of the assertions, is there in any way a work around that allows us to avoid triggering those specific validations while keeping the general validations?

@bdoubrov
Copy link
Contributor

We constantly work on the improvements of the memory use. Are you using the latest stable release (1.12)? Would you be able to provide test files where you experience issues?

@cmkramer
Copy link

@bdoubrov the files provided by @a20god and the assessment by @simnd describe the problem exactly, which is why I chose to respond to this ticket. Were you able to get to the issue described here? I can imagine that unless the problem with the high memory use due to all objects being held in memory during validation is solved, the problem will persist, no matter how much is tweaked on optimizations elsewhere.

This does not seem to be a ticked that is "solved in time due to potentially related performance tweaks".

In addition, the respective files I experienced it with, and which caused a complete system crash due to memory allocation issues, contain classified information and cannot be shared.

@bdoubrov
Copy link
Contributor

The files provided by @a20god require 2.5Gb memory of JVM, which is indeed not exceptional for files of this size.

At the moment veraPDF does indeed keep in memory the array of objects, linked from some other object, but not more. That is, in the test file of @a20god it would indeed create an array of 8388608 indirect objects in memory. When it comes to page contents, veraPDF keeps in memory only the objects from a single page and moreover tries to cache all text objects.

I can imagine the issues in case the files contain a lot of line art (vector graphics). This is exactly why I was asking for test documents in order to reproduce the issue.

Anyway, we'll run another optimization cycle before the next release and will try to reduce the memory use even further. Real world test files would still help us a lot.

@cmkramer
Copy link

cmkramer commented Oct 31, 2018

@bdoubrov Thank you for this response. I understand the complication. Like I mentioned, sharing the file is not an option, but to illustrate the situation: the file contained a blueprint for a building. You can imagine the amount of vectors included in that file. Since we have no way of enforcing that such images are not included as vectors we're in a big puzzle due to the aforementioned problem.

I don't know if it is an option to let the validator skip SVG deep analysis when it comes to this ISO requirement, or if it is part of that requirement, but if you could skip it, or make the skipping of SVG's optional, that would immediately solve the problem on our end.

@ghost ghost added bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release labels Jan 3, 2019
@ghost ghost added this to the v1.14-m4 milestone Jan 3, 2019
@carlwilson
Copy link
Contributor

While we have improved performance and memory usage do we have any intention of supporting the skipping of SVG processing? Keeping this open until we at least answer the question.

@carlwilson carlwilson removed this from the v1.14-m4 milestone Aug 22, 2019
@bdoubrov
Copy link
Contributor

I think there is still some space for optimization in case of a large number of lineart objects on the page. Please keep it open for now.

@carlwilson carlwilson changed the title Perfomance and Memory issues Optimise processing of line art Oct 24, 2019
@ghost ghost added this to the 1.16 milestone Oct 24, 2019
@ghost ghost removed the duplicate label Oct 24, 2019
@ghost ghost unassigned BezrukovM Oct 24, 2019
BezrukovM added a commit that referenced this issue Feb 14, 2020
BezrukovM added a commit that referenced this issue Feb 14, 2020
@carlwilson carlwilson removed this from the 1.20 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants