Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft support for winansi encoding #2

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from
Open

draft support for winansi encoding #2

wants to merge 5 commits into from

Conversation

davidcarlisle
Copy link
Member

@davidcarlisle davidcarlisle commented Feb 15, 2025

@zauguin apologise in advance for any horrors I've commited on your code...

I marked this as draft as there are some parts I'm not happy with notably having to carry the encoding through by hand with a new enc field in the ctx.current_font table. But as far as I can tell it does work and if a font has no toUnicode mapping outside the font data, but does have a known encoding (currently just "WinAnsi") then that mapping gets applied.

This means that it shows text in more cases (only PDFUA-Ref-2-02_Invoice.xml in the pdf ref suite sadly) but Tagged-PDF-Best-Practice-Guide.pdf does better, as well as some other tests.

No pressure to merge this, but it's checked in so I don't lose it.

@davidcarlisle davidcarlisle marked this pull request as ready for review February 16, 2025 10:04
@davidcarlisle
Copy link
Member Author

I removed the draft status, it's still only partial, in particular it ignores any /Differences array and it doesn't handle the other pre-defined encodings usch as MacRoman, however this branch makes no difference to any of the latex-derived examples, (which have a toUnicode map) and it does much better on the other test files such as the pdf/ua-1 suite and tagged examples from pdfa.org, which use WinAnsi quite a bit.

So for now at least I've swiched https://texlive.net/showtags to use this branch

['\xFD'] = '\u{00FD}',
['\xFE'] = '\u{00FE}',
['\xFF'] = '\u{00FF}'
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A0 to AF is in fact the identity so we could delete those entries and add that range to the range using the utf8.char function below (not sure which is simpler/faster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant