Class: Ucode::Glyphs::GridDetector

Inherits:
Object
  • Object
show all
Defined in:
lib/ucode/glyphs/grid_detector.rb

Overview

Detects the chart grid in a Code Charts PDF page rendered to SVG.

The PDF page produced by pdftocairo / pdf2svg / dvisvgm contains every visible element (title, block name, row labels, codepoint digits, and the actual character glyphs) as positioned ‘<use>` references into a `<defs>` block of named glyph outlines. The character cells we want to extract correspond to glyphs whose bounding box is larger than every label or digit font on the page — the chart’s character samples are drawn at a larger size than any of the surrounding text.

Algorithm:

1. Walk `<defs>`, estimate each glyph's bbox via `PathBbox`.
2. Classify a glyph as "character-sized" when its width and
   height both exceed `CharSizeThreshold` (default 8 pt).
   This excludes title, row-label, and digit glyphs while
   keeping every actual character sample — including pages
   where the chart mixes multiple character fonts (e.g. the
   Basic Latin page uses one font for punctuation/digits and
   another for letters).
3. Collect every `<use>` that references a character-sized
   glyph; these are the cell origins.
4. Cluster the Y values of those uses into rows, and within
   each row cluster the X values into columns.
5. Drop rows whose column count diverges from the modal value
   (these are footer/header artifacts, not chart rows).
6. Return a `Grid` value object anchored at the top-left cell
   with uniform column/row pitches derived from the median
   spacing between adjacent clusters.

This is pure (no I/O). The detector takes a parsed Nokogiri document and returns a ‘Grid`.

Defined Under Namespace

Classes: UsePosition

Class Method Summary collapse

Class Method Details

.detect(doc, block_first_cp:) ⇒ Ucode::Glyphs::Grid?

Returns nil if no character grid could be detected.

Parameters:

  • doc (Nokogiri::XML::Document)
  • block_first_cp (Integer)

    first codepoint of the block; stored on the Grid so callers can map codepoint ↔ cell.

Returns:



53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/ucode/glyphs/grid_detector.rb', line 53

def detect(doc, block_first_cp:)
  uses = collect_uses(doc)
  return nil if uses.empty?

  char_glyph_ids = char_sized_glyph_ids(doc)
  return nil if char_glyph_ids.empty?

  cell_uses = uses.select { |u| char_glyph_ids.include?(u.glyph_id) }
  return nil if cell_uses.empty?

  build_grid(cell_uses, block_first_cp)
end