Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/A-1 6.1.5 and tab vs. space #892

Closed
a20god opened this issue Aug 11, 2017 · 6 comments
Closed

PDF/A-1 6.1.5 and tab vs. space #892

a20god opened this issue Aug 11, 2017 · 6 comments
Assignees
Labels
feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release

Comments

@a20god
Copy link

a20god commented Aug 11, 2017

Dev Effort

3D

Description

This is not a veraPDF issue, I think veraPDF behaves correctly. But the behavior is a bit unexpected, so this is something the Validation TWG might be interested in.

Look at these two documents:
6.1.5-pass-03.pdf
6.1.5-fail-14.pdf
Both have a TAB (U+0009) in the Metadata property and a space (U+0020) in the Info dictionary.

The difference comes from one property being an XML attribute, the other being an XML element. For the former, Attribute-Value Normalization is applied, unlike for the latter. For the Info entries which correspond to XML attributes in Metadata, it's impossible to have a TAB in the value without violating 6.1.5.

Well, after finding out that ModDate "2017-08-09" does not match xmp:ModifyDate "2017-08-09", this isn't that big a surprise.

The question is: Is this really the behavior intented by the ISO 19005 committee? That is, did they choose attribute vs. element based on whether Attribute-Value Normalization is to be applied or not? Or is this just some unfortunate side effect?

@bdoubrov bdoubrov self-assigned this Aug 14, 2017
@bdoubrov
Copy link
Contributor

bdoubrov commented Aug 14, 2017

The ISO 19005-1 specification says:

The value of the document information dictionary entries and their analogous XMP properties shall be
equivalent. For properties that map from the PDF text string type to the XMP Text type, value equivalence shall be on a character-by-character basis, independent of encoding, comparing the numeric ISO/IEC 10646-1 code points for the characters.

I guess the specification uses the word "equivalent" and not equal exactly because direct comparison of essentially different data types is not possible. The following explanation about the text case is yet another hint.

The normalization of TAB to SPACE is a part of XML specification (https://www.w3.org/TR/2004/REC-xml-20040204/#AVNormalize). So, I believe in this case TAB in XMP values is equivalent to the space in the values of Info dictionary keys.

@a20god
Copy link
Author

a20god commented Aug 14, 2017

I tried to make veraPDF with --fixmetadata craete a Metadata property (stored in an attribute) containing a TAB, but I found out that --fixmetadata is a misnomer: it sets the Info dictionary from Metadata rather than vice versa.

@bdoubrov
Copy link
Contributor

Well, this is by design -- veraPDF assumes that XMP values take precedence over Info dictionary. But this is not the first time we get a request to make it an option: whether to sync XMP package from the Info dictionary or vice versa. So, we'll raise the priority of this one.

@ghost ghost added feature New functionality to be developed P3 Low priority bugs labels Jan 3, 2019
@ghost ghost added this to the v1.14-m4 milestone Jan 3, 2019
@carlwilson carlwilson removed this from the v1.14-m4 milestone Aug 22, 2019
@carlwilson
Copy link
Contributor

This still feels like an open issue and would probably require a specific option to control. It's interesting from a preservation POV also.

@bdoubrov
Copy link
Contributor

yes, this is still open. We can raise the priority and include it in the next release

@ghost ghost added P2 Medium priority issues to be scheduled in a future release and removed P3 Low priority bugs labels Oct 24, 2019
@ghost ghost added this to the 1.16 milestone Oct 24, 2019
@bdoubrov
Copy link
Contributor

TAB and SPACE are two different (Unicode) characters. As Info strings and the corresponding XMP strings are compared for comparison of Unicode characters, the use of SPACE in XMP and TAB in Info dictionary string will result in the validation error.

@bdoubrov bdoubrov removed the question label Dec 13, 2019
@bdoubrov bdoubrov closed this as completed Feb 7, 2020
@carlwilson carlwilson removed this from the 1.20 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

No branches or pull requests

3 participants