Module: Ucode::Glyphs::EmbeddedFonts::ToUnicode

Defined in:
lib/ucode/glyphs/embedded_fonts/tounicode.rb

Overview

Parses a PDF ToUnicode CMap stream into a ‘=> codepoint` Hash.

PDF ToUnicode CMaps (Adobe Technical Note #5014) use a small PostScript-like syntax with three constructs that matter to us:

* `N begincodespacerange ... endcodespacerange` — declares the
  valid code space. We ignore this; we just take whatever the
  bfchar/bfrange entries hand us.
* `N beginbfchar ... endbfchar` — one-to-one cid → unicode
  mappings, one pair per line: `<cid_hex> <uni_hex>`.
* `N beginbfrange ... endbfrange` — range mappings. Two forms:
    * `<lo> <hi> <start>` — cids lo..hi map to consecutive
      codepoints starting at `start`.
    * `<lo> <hi> [<u1> <u2> ... <un>]` — explicit per-cid
      mapping within the range.

The unicode target string may encode one codepoint (4 hex digits for BMP, 8 for an astral codepoint via UTF-16 surrogate pair) or a sequence (multiple codepoints, used for ligatures). For our purposes — attributing one Code Charts glyph to one codepoint —we take the first codepoint of the target string and ignore the rest.

Class Method Summary collapse

Class Method Details

.parse(cmap_text) ⇒ Hash{Integer=>Integer}

Returns frozen cid → codepoint map.

Parameters:

  • cmap_text (String)

    raw decoded CMap stream text

Returns:

  • (Hash{Integer=>Integer})

    frozen cid → codepoint map



31
32
33
34
35
36
# File 'lib/ucode/glyphs/embedded_fonts/tounicode.rb', line 31

def self.parse(cmap_text)
  result = {}
  scan_bfchar(cmap_text, result)
  scan_bfrange(cmap_text, result)
  result.freeze
end