Optimise processing of line art #905

simnd · 2017-08-23T10:14:59Z

Dear veraPDF team,

We've been evaluating veraPDF (Release 1.8, VeraGreenfieldFoundryProvider) as our tool of choice for validating PDF documents against, mainly, pdfa-2b compliance. We're committed to certain performance requirements which we definitely can't reach with this library. For example, there's a given reference document with complex content with a size of only 1MB. By requirement the validation must not last longer than 3500 milliseconds, which actually takes about 12000 ms with veraPDF though.

We've been looking at your code base and found some issues I'd like to address:

the validation, taking part in BaseValidator, is doing millions of assertions (Rhino) sequentially in one single Thread, it feels like concurrency is necessary here.
the code isn't thread-safe, which makes it difficult to implement threading
State is managed in static classes (StaticContainers), also bad for threading

Profiling

We've also run the validation with a JProfiler to see what's happening on the JVM when veraPDF is running. It's obvious, that the validation creates very many objects and requires quite a lot heap memory
during runtime. Even if the document is only 1MB, millions of objects occupying hundreds of MB on the heap are created during validation.

Another example is a ~4MB document with simple text on ~3500 pages, this takes about 8 minutes and the profiling leads to even more critical results (8 Minutes for validation)

Are you aware of these issues? Isn't veraPDF meant to be used for larger documents?

Thanks in advance!
Simon

The text was updated successfully, but these errors were encountered:

bdoubrov · 2017-08-23T10:58:20Z

Thanks a lot for this detailed performance analysis. We are definitely going to do further work on improving the performance. I'm not sure if we'd be able to reduce the number of objects (there might be millions of objects in PDF even if it is just 1Mb), but we do plan implement thread-safeness in the next release getting rid of static containers.

We'll also check if we can parallelize checks within a single document. We'll also inspect if we can optimize the code for most common checks.

It would be great, if you could also send us your benchmark documents.

bdoubrov · 2017-08-25T07:32:11Z

Duplicate of #896

bdoubrov · 2017-09-06T13:20:57Z

We have implemented caching of all data structures related to fonts. This significantly increases pefromance on large text files with a limited number of fonts.

Would you please check the latest dev version? It would be great to know its performance on your model test files.

Note also that in case of the first validation job (both in the GUI and in the CLI batch job) there is also some time (up to a second) taken by compiling all JavaScript expressions. So, if you need a clean time on a single file, please rerun validation from the GUI or include it into a batch job when executing the CLI.

a20god · 2017-09-06T13:26:42Z

BTW, veraPDF needs about 5 minutes for validating my test document for the 8388607 indirect objects limit (6.1.13 of ISO 19005-2), font caching doesn't help in that case as there are no fonts. My own PDF/A validator needs about 1 second. I'm too lazy for profiling veraPDF...

bdoubrov · 2017-09-06T13:31:33Z

@a20god would be great if you could share this test file with us. We'd be happy to profile veraPDF and further optimize it.

a20god · 2017-09-06T14:36:37Z

The file is probably too big (35 MB) for attaching it here. Do you have any preferred alternative?

bdoubrov · 2017-09-06T14:40:35Z

You can send the file (or place the file in the cloud and send me the link) directly to [email protected]

a20god · 2017-09-06T15:07:44Z

Here's the version with 8388608 indirect objects:
https://www.mediafire.com/file/4myw01o3kb442c4/6.1.13-07-fail-1.pdf
Note that you have to increase the Java heap size.

a20god · 2017-09-07T08:36:09Z

Here's another one with 8388608 indirect objects:
https://www.mediafire.com/file/po5b9x8n8ycp66s/6.1.13-07-fail-2.zip
6.1.13-07-fail-1.pdf uses an ObjStm, this one doesn't. Validation of this document with veraPDF takes about 9 minutes.

cmkramer · 2018-10-22T14:09:53Z

Is there any update related to the mentioned issue? We have been experiencing the same complications with files containing complex vector graphics, causing the application to force it's memory usage to ridiculous proportions due to the mentioned number of objects in memory.

This makes the validator completely unreliable and unusable to us since we can't use a library that can cause a system crash on an application that requires high availability.

In case there's been no progress on solving this problem, and if it's just a problem that is triggered with a subset of the assertions, is there in any way a work around that allows us to avoid triggering those specific validations while keeping the general validations?

bdoubrov · 2018-10-23T07:36:51Z

We constantly work on the improvements of the memory use. Are you using the latest stable release (1.12)? Would you be able to provide test files where you experience issues?

cmkramer · 2018-10-23T11:08:20Z

@bdoubrov the files provided by @a20god and the assessment by @simnd describe the problem exactly, which is why I chose to respond to this ticket. Were you able to get to the issue described here? I can imagine that unless the problem with the high memory use due to all objects being held in memory during validation is solved, the problem will persist, no matter how much is tweaked on optimizations elsewhere.

This does not seem to be a ticked that is "solved in time due to potentially related performance tweaks".

In addition, the respective files I experienced it with, and which caused a complete system crash due to memory allocation issues, contain classified information and cannot be shared.

bdoubrov · 2018-10-23T17:12:01Z

The files provided by @a20god require 2.5Gb memory of JVM, which is indeed not exceptional for files of this size.

At the moment veraPDF does indeed keep in memory the array of objects, linked from some other object, but not more. That is, in the test file of @a20god it would indeed create an array of 8388608 indirect objects in memory. When it comes to page contents, veraPDF keeps in memory only the objects from a single page and moreover tries to cache all text objects.

I can imagine the issues in case the files contain a lot of line art (vector graphics). This is exactly why I was asking for test documents in order to reproduce the issue.

Anyway, we'll run another optimization cycle before the next release and will try to reduce the memory use even further. Real world test files would still help us a lot.

cmkramer · 2018-10-31T15:55:58Z

@bdoubrov Thank you for this response. I understand the complication. Like I mentioned, sharing the file is not an option, but to illustrate the situation: the file contained a blueprint for a building. You can imagine the amount of vectors included in that file. Since we have no way of enforcing that such images are not included as vectors we're in a big puzzle due to the aforementioned problem.

I don't know if it is an option to let the validator skip SVG deep analysis when it comes to this ISO requirement, or if it is part of that requirement, but if you could skip it, or make the skipping of SVG's optional, that would immediately solve the problem on our end.

carlwilson · 2019-08-22T10:00:59Z

While we have improved performance and memory usage do we have any intention of supporting the skipping of SVG processing? Keeping this open until we at least answer the question.

bdoubrov · 2019-08-22T11:50:00Z

I think there is still some space for optimization in case of a large number of lineart objects on the page. Please keep it open for now.

Fixed issue: #905.

bdoubrov marked this as a duplicate of #896 Aug 25, 2017

bdoubrov added the duplicate label Aug 25, 2017

ghost added bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release labels Jan 3, 2019

ghost added this to the v1.14-m4 milestone Jan 3, 2019

bdoubrov mentioned this issue Jan 3, 2019

Java heapspace errors #922

Closed

BezrukovM mentioned this issue May 27, 2019

Performance upgrade veraPDF/veraPDF-parser#375

Merged

carlwilson removed this from the v1.14-m4 milestone Aug 22, 2019

carlwilson changed the title ~~Perfomance and Memory issues~~ Optimise processing of line art Oct 24, 2019

ghost added this to the 1.16 milestone Oct 24, 2019

ghost removed the duplicate label Oct 24, 2019

carlwilson assigned BezrukovM Oct 24, 2019

ghost unassigned BezrukovM Oct 24, 2019

RomaPrograms mentioned this issue Feb 13, 2020

Fixed issue: #905. #1060

Merged

BezrukovM closed this as completed in #1060 Feb 14, 2020

BezrukovM added a commit that referenced this issue Feb 14, 2020

Merge pull request #1060 from RomaPrograms/overflow-memory-todo

9e3ebb3

Fixed issue: #905.

RomaPrograms mentioned this issue Feb 14, 2020

Fixed issue: #905. #1063

Merged

BezrukovM added a commit that referenced this issue Feb 14, 2020

Merge pull request #1063 from RomaPrograms/overflow-memory-todo

5b962e9

Fixed issue: #905.

carlwilson removed this from the 1.20 milestone Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise processing of line art #905

Optimise processing of line art #905

simnd commented Aug 23, 2017 •

edited by ghost

Loading

bdoubrov commented Aug 23, 2017

bdoubrov commented Aug 25, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

a20god commented Sep 7, 2017

cmkramer commented Oct 22, 2018

bdoubrov commented Oct 23, 2018

cmkramer commented Oct 23, 2018

bdoubrov commented Oct 23, 2018

cmkramer commented Oct 31, 2018 •

edited

Loading

carlwilson commented Aug 22, 2019

bdoubrov commented Aug 22, 2019

Optimise processing of line art #905

Optimise processing of line art #905

Comments

simnd commented Aug 23, 2017 • edited by ghost Loading

bdoubrov commented Aug 23, 2017

bdoubrov commented Aug 25, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

bdoubrov commented Sep 6, 2017

a20god commented Sep 6, 2017

a20god commented Sep 7, 2017

cmkramer commented Oct 22, 2018

bdoubrov commented Oct 23, 2018

cmkramer commented Oct 23, 2018

bdoubrov commented Oct 23, 2018

cmkramer commented Oct 31, 2018 • edited Loading

carlwilson commented Aug 22, 2019

bdoubrov commented Aug 22, 2019

simnd commented Aug 23, 2017 •

edited by ghost

Loading

cmkramer commented Oct 31, 2018 •

edited

Loading