Module: Ucode::Glyphs::EmbeddedFonts::ToUnicode
- Defined in:
- lib/ucode/glyphs/embedded_fonts/tounicode.rb
Overview
Parses a PDF ToUnicode CMap stream into a ‘=> codepoint` Hash.
PDF ToUnicode CMaps (Adobe Technical Note #5014) use a small PostScript-like syntax with three constructs that matter to us:
* `N begincodespacerange ... endcodespacerange` — declares the
valid code space. We ignore this; we just take whatever the
bfchar/bfrange entries hand us.
* `N beginbfchar ... endbfchar` — one-to-one cid → unicode
mappings, one pair per line: `<cid_hex> <uni_hex>`.
* `N beginbfrange ... endbfrange` — range mappings. Two forms:
* `<lo> <hi> <start>` — cids lo..hi map to consecutive
codepoints starting at `start`.
* `<lo> <hi> [<u1> <u2> ... <un>]` — explicit per-cid
mapping within the range.
The unicode target string may encode one codepoint (4 hex digits for BMP, 8 for an astral codepoint via UTF-16 surrogate pair) or a sequence (multiple codepoints, used for ligatures). For our purposes — attributing one Code Charts glyph to one codepoint —we take the first codepoint of the target string and ignore the rest.
Class Method Summary collapse
-
.parse(cmap_text) ⇒ Hash{Integer=>Integer}
Frozen cid → codepoint map.
Class Method Details
.parse(cmap_text) ⇒ Hash{Integer=>Integer}
Returns frozen cid → codepoint map.
31 32 33 34 35 36 |
# File 'lib/ucode/glyphs/embedded_fonts/tounicode.rb', line 31 def self.parse(cmap_text) result = {} scan_bfchar(cmap_text, result) scan_bfrange(cmap_text, result) result.freeze end |