Module: Pikuri::Extractor::HTML
- Defined in:
- lib/pikuri/extractor/html.rb
Overview
HTML → Markdown extractor.
Matched by content-type only (text/html / application/xhtmlxml+) — deliberately no byte sniff. The web path always has the header; for local files a sniff would route Workspace::Read of an .html source file through readability extraction, when a developer reading an HTML file wants the source. Local HTML stays on the Passthrough arm until a consumer genuinely needs otherwise.
Always renders both views of the page when available:
-
JSON-LD section. Any <script type=“application/ldjson”>+ node whose @type matches a substantive schema.org content type (Product, Article, Recipe, …) is rendered as a header — title, metadata bullets (brand, SKU, price, rating, author, published), and the
articleBody/descriptioncopy when present. -
Readability section. The page is run through
Readability+reverse_markdown, with a <main>/<article> fallback for pages whose content sits mostly outside <p> tags.
Concatenated with a horizontal rule, so the LLM gets both the structured metadata and the rendered body and can pick whichever is more useful for the task. Trades some duplication (when a publisher embeds the article body in JSON-LD AND in HTML) for fewer type-based heuristics on which branch should win — the earlier “is this Article’s description a teaser or the real body?” carve-out is no longer needed because both end up in the output regardless.
Constant Summary collapse
- CONTENT_TYPES =
Returns content-types this extractor claims.
%w[text/html application/xhtml+xml].freeze
- INTERESTING_TYPES =
Returns schema.org @type values that we treat as “the primary entity of this page” when picking a JSON-LD node to render. Order does not matter — the first matching node wins. Skips noise nodes (Organization, BreadcrumbList, WebSite, …) that ship on most pages but carry no page content.
%w[ Product Article NewsArticle BlogPosting Recipe Event Book Movie ].freeze
- READABILITY_TAGS =
Returns HTML tags preserved by the readability pass. Anything outside this list is stripped before Markdown conversion.
%w[ h1 h2 h3 h4 h5 h6 p div span ul ol li blockquote pre code a img strong em b i br hr table thead tbody tr td th ].freeze
- READABILITY_ATTRS =
Returns HTML attributes preserved by the readability pass; everything else (class, id, style, data-*) is dropped before Markdown conversion.
%w[href src alt title].freeze
- MAIN_FALLBACK_RATIO =
Returns minimum <main>
/<article> to Readability text-length ratio that triggers the semantic-container fallback in readability_to_markdown. Picked low enough to catch the failure mode (Readability collapsing a page that uses divs/lists instead of <p> — e.g.vaadin.com/company, ~5x) but high enough that pages where both produce comparable output keep Readability’s noise filtering. 2.0- MAIN_FALLBACK_MIN_CHARS =
Returns minimum text length the <main>
/<article> container must hold before the fallback in readability_to_markdown can fire. Below this, the ratio comparison is dominated by noise and we’d swap on tiny pages where Readability is doing the right thing. 500
Class Method Summary collapse
-
.extract(io) ⇒ String
Render the HTML document behind
ioas Markdown by emitting both the JSON-LD section (when an interesting node is present) and the readability / <main> section, joined by a horizontal rule. -
.jsonld_section(html) ⇒ String?
Pick the first JSON-LD node whose @type matches one of INTERESTING_TYPES and render it as Markdown.
-
.jsonld_to_markdown(node) ⇒ String
Render a single JSON-LD
nodeas Markdown: a top-level title fromname/headline, a bullet list of common useful fields (brand, SKU, price, rating, author, published date, …), the body copy, and the lead image. -
.kind ⇒ Symbol
Page#kind tag.
- .matches?(sample:, content_type:) ⇒ Boolean
-
.parse_jsonld(html) ⇒ Array<Hash>
Collect every JSON-LD payload embedded in
html, flattening @graph wrappers so callers see one flat array of schema.org nodes. -
.readability_to_markdown(html) ⇒ String
Run
Readabilityoverhtmlto isolate the main content node, then convert that to Markdown viareverse_markdown.
Class Method Details
.extract(io) ⇒ String
Render the HTML document behind io as Markdown by emitting both the JSON-LD section (when an interesting node is present) and the readability / <main> section, joined by a horizontal rule. Either section may be missing — pages with no JSON-LD return only the readability output, and a malformed page with no extractable body returns only the JSON-LD render.
104 105 106 107 108 109 |
# File 'lib/pikuri/extractor/html.rb', line 104 def self.extract(io) html = io.read sections = [jsonld_section(html), readability_to_markdown(html)] sections.reject! { |s| s.nil? || s.strip.empty? } sections.join("\n\n---\n\n") end |
.jsonld_section(html) ⇒ String?
Pick the first JSON-LD node whose @type matches one of INTERESTING_TYPES and render it as Markdown. Returns nil when no such node exists, in which case extract emits only the readability section.
No content-field gating: a node carrying just name/author/ datePublished still renders (as a metadata-only header), because the readability pass independently produces the page body. That is the trade-off that lets us drop the type-based “is this teaser or article copy?” heuristics — duplication is acceptable when both views are available, and the LLM can pick whichever it needs.
127 128 129 130 131 132 |
# File 'lib/pikuri/extractor/html.rb', line 127 def self.jsonld_section(html) node = parse_jsonld(html).find do |n| Array(n['@type']).any? { |t| INTERESTING_TYPES.include?(t) } end node ? jsonld_to_markdown(node) : nil end |
.jsonld_to_markdown(node) ⇒ String
Render a single JSON-LD node as Markdown: a top-level title from name/headline, a bullet list of common useful fields (brand, SKU, price, rating, author, published date, …), the body copy, and the lead image.
When the node carries articleBody (the full publisher-supplied article text), that wins over description — the description is typically a lede teaser and would just repeat the article’s opening lines.
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
# File 'lib/pikuri/extractor/html.rb', line 172 def self.jsonld_to_markdown(node) out = +'' name = node['name'] || node['headline'] out << "# #{name}\n\n" if name offer = first_obj(node['offers']) = first_obj(node['aggregateRating']) brand = first_obj_or_string(node['brand']) = first_obj_or_string(node['author']) brand_name = brand.is_a?(Hash) ? brand['name'] : brand = .is_a?(Hash) ? ['name'] : fields = { 'Brand' => brand_name, 'SKU' => node['sku'], 'GTIN' => node['gtin13'] || node['gtin'], 'Price' => [offer['price'], offer['priceCurrency']].compact.join(' '), 'Availability' => offer['availability'], 'Rating' => ['ratingValue'], 'Reviews' => ['reviewCount'], 'Author' => , 'Published' => node['datePublished'] }.reject { |_, v| v.nil? || v.to_s.strip.empty? } unless fields.empty? fields.each { |k, v| out << "- **#{k}:** #{v}\n" } out << "\n" end if (body = node['articleBody'] || node['description']) out << "#{body}\n\n" end if (img = node['image']) img = img.first if img.is_a?(Array) img = img['url'] if img.is_a?(Hash) out << "\n\n" if img end out end |
.kind ⇒ Symbol
Returns Page#kind tag.
83 84 85 |
# File 'lib/pikuri/extractor/html.rb', line 83 def self.kind :html end |
.matches?(sample:, content_type:) ⇒ Boolean
91 92 93 |
# File 'lib/pikuri/extractor/html.rb', line 91 def self.matches?(sample:, content_type:) CONTENT_TYPES.include?(content_type) end |
.parse_jsonld(html) ⇒ Array<Hash>
Collect every JSON-LD payload embedded in html, flattening @graph wrappers so callers see one flat array of schema.org nodes. Malformed JSON blocks are silently skipped — sites frequently ship broken JSON-LD and we only need at least one parseable block.
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
# File 'lib/pikuri/extractor/html.rb', line 142 def self.parse_jsonld(html) doc = Nokogiri::HTML(html) blobs = doc.css('script[type="application/ld+json"]').map(&:text) blobs.flat_map do |raw| parsed = begin JSON.parse(raw) rescue JSON::ParserError nil end next [] unless parsed nodes = parsed.is_a?(Array) ? parsed : [parsed] nodes.flat_map { |n| n['@graph'].is_a?(Array) ? n['@graph'] : [n] } end end |
.readability_to_markdown(html) ⇒ String
Run Readability over html to isolate the main content node, then convert that to Markdown via reverse_markdown. The page <title> is rendered as a top-level heading.
When the page uses semantic HTML5 (+<main>+ or <article>) but leaves most of its content outside <p> tags — divs, lists, spans — Readability’s paragraph-density scoring collapses the extraction to a sliver of the page. In that case we render the <main>/<article> container directly. The fallback only fires when the container holds substantially more text than Readability picked up (see MAIN_FALLBACK_RATIO / MAIN_FALLBACK_MIN_CHARS); on pages where both agree we keep Readability so its noise filtering still strips nav/ads/etc.
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
# File 'lib/pikuri/extractor/html.rb', line 231 def self.readability_to_markdown(html) rdoc = Readability::Document.new( html, tags: READABILITY_TAGS, attributes: READABILITY_ATTRS, remove_empty_nodes: true ) readability_html = rdoc.content title = rdoc.title body_html = main_fallback_html(html, readability_html) || readability_html body = ReverseMarkdown.convert(body_html, unknown_tags: :bypass, github_flavored: true) out = +'' out << "# #{title.strip}\n\n" if title && !title.strip.empty? out << body out end |