Class: Ucode::Glyphs::EmbeddedFonts::ContentStreamCorrelator

Inherits:
Object
  • Object
show all
Defined in:
lib/ucode/glyphs/embedded_fonts/content_stream_correlator.rb

Overview

Pillar 2 fallback: build a ‘=> gid` map for a Type0 font whose PDF object graph has no `/ToUnicode` CMap stream.

The Code Charts draw every chart cell as a ‘<use>` element that references the font’s GID via an ‘href` of the form `#font_<font_obj_id>_<gid>`. The chart also prints the row + column codepoint labels using one or more “label” fonts (small Latin glyphs) that show the hex codepoint as text. By clustering the labels positionally (Y-bucket for the row, X-bucket for the column) we recover the codepoint each cluster represents, then match each cluster positionally to the specimen glyph at the same Y/X position.

The algorithm generalizes the Tai Yo correlator that was tested against ‘data/pdfs/U1E6C0.pdf` (50/52 specimen codepoints matched, with the two missing being layout edge cases). The bucket sizes are configurable because some blocks use a tighter grid than others.

Inputs are deliberately pure: a string of SVG markup plus a Config. The catalog is responsible for sourcing the SVG (by rendering the relevant PDF page(s) via ‘mutool draw -F svg`) and for knowing which font_obj_ids are labels vs specimen on that page. That keeps this class trivially testable with synthetic SVG fixtures.

Defined Under Namespace

Classes: Config, Use

Constant Summary collapse

DEFAULT_Y_BUCKET =
1.5
DEFAULT_X_BUCKET =
50.0

Instance Method Summary collapse

Constructor Details

#initialize(config) ⇒ ContentStreamCorrelator

Returns a new instance of ContentStreamCorrelator.

Parameters:



65
66
67
68
69
# File 'lib/ucode/glyphs/embedded_fonts/content_stream_correlator.rb', line 65

def initialize(config)
  @config = config
  @y_bucket = config.y_bucket || DEFAULT_Y_BUCKET
  @x_bucket = config.x_bucket || DEFAULT_X_BUCKET
end

Instance Method Details

#correlate(svg) ⇒ Hash{Integer=>Integer}

Returns codepoint => gid. Empty if no clusters could be matched.

Parameters:

  • svg (String)

    rendered PDF page(s) as SVG markup. May contain multiple ‘<svg>` documents concatenated (one per page); the regex scan handles either case.

Returns:

  • (Hash{Integer=>Integer})

    codepoint => gid. Empty if no clusters could be matched.



76
77
78
79
80
81
# File 'lib/ucode/glyphs/embedded_fonts/content_stream_correlator.rb', line 76

def correlate(svg)
  uses = parse_uses(svg)
  return {} if uses.empty?

  partition_and_map(uses)
end