Class: Ucode::Glyphs::MonolithPageMap
- Inherits:
-
Object
- Object
- Ucode::Glyphs::MonolithPageMap
- Defined in:
- lib/ucode/glyphs/monolith_page_map.rb
Overview
Maps a Unicode block’s first codepoint to its page range inside the monolith ‘CodeCharts.pdf` by parsing the PDF’s bookmark outline and matching each bookmark title to a Block.name from ‘Blocks.txt`.
Each chart cluster printed by the Unicode Consortium is a single bookmark entry:
BookmarkTitle: Greek and Coptic
BookmarkLevel: 1
BookmarkPageNumber: 415
The cluster title usually equals a Block.name verbatim, but a few clusters carry a heading that prepends “C0 Controls and ” / “C1 Controls and ” to the block name. We resolve both forms.
End-page of a cluster is one page before the next cluster’s start page (last cluster’s end-page is the PDF’s last page).
The map is cached as JSON at ‘data/codecharts_page_map.json` so we don’t re-scan the 3,156-page monolith on every run.
Defined Under Namespace
Classes: MapEntry
Class Method Summary collapse
-
.attach_end_pages(entries, total_pages = nil) ⇒ Array<MapEntry>
Pure: attach end_pages by sorting entries and assigning each entry’s end to one page before the next entry’s start.
-
.build(monolith_path:, blocks:) ⇒ Hash{Integer => MapEntry}
Build the map by parsing the monolith’s outline and matching each bookmark title to a Block.
-
.dump_bookmarks(monolith_path) ⇒ Object
—- I/O helpers (impure) ————————————–.
-
.load(monolith_path:, blocks:, cache_path: nil) ⇒ Hash{Integer => MapEntry}
Load from cache, or build and cache.
- .page_count(monolith_path) ⇒ Object
-
.parse_bookmarks(dump, name_to_first_cp) ⇒ Array<MapEntry>
Pure: parse a ‘pdftk dump_data` string into a list of MapEntry rows (without end_pages).
-
.range_for(map, block_first_cp) ⇒ MapEntry?
Look up a block’s page range by its first cp.
Class Method Details
.attach_end_pages(entries, total_pages = nil) ⇒ Array<MapEntry>
Pure: attach end_pages by sorting entries and assigning each entry’s end to one page before the next entry’s start.
96 97 98 99 100 101 102 103 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 96 def attach_end_pages(entries, total_pages = nil) sorted = entries.sort_by(&:start_page) sorted.each_with_index do |entry, i| next_entry = sorted[i + 1] entry.end_page = next_entry ? next_entry.start_page - 1 : total_pages end sorted end |
.build(monolith_path:, blocks:) ⇒ Hash{Integer => MapEntry}
Build the map by parsing the monolith’s outline and matching each bookmark title to a Block.
53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 53 def build(monolith_path:, blocks:) name_to_first_cp = blocks.each_with_object({}) do |b, h| h[b.name] = b.range_first end total_pages = page_count(monolith_path) entries = parse_bookmarks(dump_bookmarks(monolith_path), name_to_first_cp) attach_end_pages(entries, total_pages) entries.each_with_object({}) do |e, h| h[e.first_cp] = e end end |
.dump_bookmarks(monolith_path) ⇒ Object
—- I/O helpers (impure) ————————————–
131 132 133 134 135 136 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 131 def dump_bookmarks(monolith_path) out, status = Open3.capture2e("pdftk", monolith_path.to_s, "dump_data") return "" unless status.success? out end |
.load(monolith_path:, blocks:, cache_path: nil) ⇒ Hash{Integer => MapEntry}
Load from cache, or build and cache.
110 111 112 113 114 115 116 117 118 119 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 110 def load(monolith_path:, blocks:, cache_path: nil) cache = cache_path && Pathname.new(cache_path) if cache&.exist? return load_from_json(cache.read) end map = build(monolith_path: monolith_path, blocks: blocks) write_cache(map, cache) if cache map end |
.page_count(monolith_path) ⇒ Object
138 139 140 141 142 143 144 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 138 def page_count(monolith_path) out, status = Open3.capture2e("pdfinfo", monolith_path.to_s) return nil unless status.success? match = out.match(/^Pages:\s+(\d+)/) match ? match[1].to_i : nil end |
.parse_bookmarks(dump, name_to_first_cp) ⇒ Array<MapEntry>
Pure: parse a ‘pdftk dump_data` string into a list of MapEntry rows (without end_pages). Exposed for unit tests and any caller that already has the dump cached.
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 72 def parse_bookmarks(dump, name_to_first_cp) entries = [] current_title = nil dump.each_line do |line| case line when BookmarkTitleRegex current_title = Regexp.last_match(1).strip when BookmarkPageRegex page = Regexp.last_match(1).to_i cp = resolve_first_cp(current_title, name_to_first_cp) entries << MapEntry.new(first_cp: cp, start_page: page) if cp current_title = nil end end entries.sort_by(&:start_page) end |
.range_for(map, block_first_cp) ⇒ MapEntry?
Look up a block’s page range by its first cp.
125 126 127 |
# File 'lib/ucode/glyphs/monolith_page_map.rb', line 125 def range_for(map, block_first_cp) map[block_first_cp] end |