draft support for winansi encoding #2

davidcarlisle · 2025-02-15T10:48:40Z

@zauguin apologise in advance for any horrors I've commited on your code...

I marked this as draft as there are some parts I'm not happy with notably having to carry the encoding through by hand with a new enc field in the ctx.current_font table. But as far as I can tell it does work and if a font has no toUnicode mapping outside the font data, but does have a known encoding (currently just "WinAnsi") then that mapping gets applied.

This means that it shows text in more cases (only PDFUA-Ref-2-02_Invoice.xml in the pdf ref suite sadly) but Tagged-PDF-Best-Practice-Guide.pdf does better, as well as some other tests.

No pressure to merge this, but it's checked in so I don't lose it.

davidcarlisle · 2025-02-16T10:09:31Z

I removed the draft status, it's still only partial, in particular it ignores any /Differences array and it doesn't handle the other pre-defined encodings usch as MacRoman, however this branch makes no difference to any of the latex-derived examples, (which have a toUnicode map) and it does much better on the other test files such as the pdf/ua-1 suite and tagged examples from pdfa.org, which use WinAnsi quite a bit.

So for now at least I've swiched https://texlive.net/showtags to use this branch

davidcarlisle · 2025-02-18T12:03:33Z

show_pdf_tags/decode.lua

+  ['\xFD'] = '\u{00FD}',
+  ['\xFE'] = '\u{00FE}',
+  ['\xFF'] = '\u{00FF}'
+}


A0 to AF is in fact the identity so we could delete those entries and add that range to the range using the utf8.char function below (not sure which is simpler/faster)

draft support for winansi encoding

cd62c8b

davidcarlisle requested a review from zauguin February 15, 2025 10:48

avoid double utf8 encoding

6107f26

davidcarlisle marked this pull request as ready for review February 16, 2025 10:04

davidcarlisle commented Feb 18, 2025

View reviewed changes

davidcarlisle added 3 commits February 18, 2025 23:22

macroman, just in decode for now

aabf056

update for new verbatim tagging

295cb6d

phoneme, phoneticAlphabet and (in schema) revision

6b982e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft support for winansi encoding #2

draft support for winansi encoding #2

davidcarlisle commented Feb 15, 2025 •

edited

Loading

davidcarlisle commented Feb 16, 2025

davidcarlisle Feb 18, 2025

draft support for winansi encoding #2

Are you sure you want to change the base?

draft support for winansi encoding #2

Conversation

davidcarlisle commented Feb 15, 2025 • edited Loading

davidcarlisle commented Feb 16, 2025

davidcarlisle Feb 18, 2025

Choose a reason for hiding this comment

davidcarlisle commented Feb 15, 2025 •

edited

Loading