Module: Pikuri::Extractor::HTML

Defined in:
lib/pikuri/extractor/html.rb

Overview

HTML → Markdown extractor.

Matched by content-type only (text/html / application/xhtmlxml+) — deliberately no byte sniff. The web path always has the header; for local files a sniff would route Workspace::Read of an .html source file through readability extraction, when a developer reading an HTML file wants the source. Local HTML stays on the Passthrough arm until a consumer genuinely needs otherwise.

Always renders both views of the page when available:

  1. JSON-LD section. Any <script type=“application/ldjson”>+ node whose @type matches a substantive schema.org content type (Product, Article, Recipe, …) is rendered as a header — title, metadata bullets (brand, SKU, price, rating, author, published), and the articleBody/description copy when present.

  2. Readability section. The page is run through Readability + reverse_markdown, with a <main>/<article> fallback for pages whose content sits mostly outside <p> tags.

Concatenated with a horizontal rule, so the LLM gets both the structured metadata and the rendered body and can pick whichever is more useful for the task. Trades some duplication (when a publisher embeds the article body in JSON-LD AND in HTML) for fewer type-based heuristics on which branch should win — the earlier “is this Article’s description a teaser or the real body?” carve-out is no longer needed because both end up in the output regardless.

Constant Summary collapse

CONTENT_TYPES =

Returns content-types this extractor claims.

Returns:

  • (Array<String>)

    content-types this extractor claims.

%w[text/html application/xhtml+xml].freeze
INTERESTING_TYPES =

Returns schema.org @type values that we treat as “the primary entity of this page” when picking a JSON-LD node to render. Order does not matter — the first matching node wins. Skips noise nodes (Organization, BreadcrumbList, WebSite, …) that ship on most pages but carry no page content.

Returns:

  • (Array<String>)

    schema.org @type values that we treat as “the primary entity of this page” when picking a JSON-LD node to render. Order does not matter — the first matching node wins. Skips noise nodes (Organization, BreadcrumbList, WebSite, …) that ship on most pages but carry no page content.

%w[
  Product Article NewsArticle BlogPosting Recipe Event Book Movie
].freeze
READABILITY_TAGS =

Returns HTML tags preserved by the readability pass. Anything outside this list is stripped before Markdown conversion.

Returns:

  • (Array<String>)

    HTML tags preserved by the readability pass. Anything outside this list is stripped before Markdown conversion.

%w[
  h1 h2 h3 h4 h5 h6 p div span ul ol li blockquote pre code a img
  strong em b i br hr table thead tbody tr td th
].freeze
READABILITY_ATTRS =

Returns HTML attributes preserved by the readability pass; everything else (class, id, style, data-*) is dropped before Markdown conversion.

Returns:

  • (Array<String>)

    HTML attributes preserved by the readability pass; everything else (class, id, style, data-*) is dropped before Markdown conversion

%w[href src alt title].freeze
MAIN_FALLBACK_RATIO =

Returns minimum <main>/<article> to Readability text-length ratio that triggers the semantic-container fallback in readability_to_markdown. Picked low enough to catch the failure mode (Readability collapsing a page that uses divs/lists instead of <p> — e.g. vaadin.com/company, ~5x) but high enough that pages where both produce comparable output keep Readability’s noise filtering.

Returns:

  • (Float)

    minimum <main>/<article> to Readability text-length ratio that triggers the semantic-container fallback in readability_to_markdown. Picked low enough to catch the failure mode (Readability collapsing a page that uses divs/lists instead of <p> — e.g. vaadin.com/company, ~5x) but high enough that pages where both produce comparable output keep Readability’s noise filtering.

2.0
MAIN_FALLBACK_MIN_CHARS =

Returns minimum text length the <main>/<article> container must hold before the fallback in readability_to_markdown can fire. Below this, the ratio comparison is dominated by noise and we’d swap on tiny pages where Readability is doing the right thing.

Returns:

  • (Integer)

    minimum text length the <main>/<article> container must hold before the fallback in readability_to_markdown can fire. Below this, the ratio comparison is dominated by noise and we’d swap on tiny pages where Readability is doing the right thing.

500

Class Method Summary collapse

Class Method Details

.extract(io) ⇒ String

Render the HTML document behind io as Markdown by emitting both the JSON-LD section (when an interesting node is present) and the readability / <main> section, joined by a horizontal rule. Either section may be missing — pages with no JSON-LD return only the readability output, and a malformed page with no extractable body returns only the JSON-LD render.

Parameters:

  • io (IO, StringIO)

    IO over the HTML document.

Returns:

  • (String)

    Markdown representation



104
105
106
107
108
109
# File 'lib/pikuri/extractor/html.rb', line 104

def self.extract(io)
  html = io.read
  sections = [jsonld_section(html), readability_to_markdown(html)]
  sections.reject! { |s| s.nil? || s.strip.empty? }
  sections.join("\n\n---\n\n")
end

.jsonld_section(html) ⇒ String?

Pick the first JSON-LD node whose @type matches one of INTERESTING_TYPES and render it as Markdown. Returns nil when no such node exists, in which case extract emits only the readability section.

No content-field gating: a node carrying just name/author/ datePublished still renders (as a metadata-only header), because the readability pass independently produces the page body. That is the trade-off that lets us drop the type-based “is this teaser or article copy?” heuristics — duplication is acceptable when both views are available, and the LLM can pick whichever it needs.

Parameters:

  • html (String)

    HTML document body

Returns:

  • (String, nil)

    Markdown render of the picked JSON-LD node, or nil when nothing matched



127
128
129
130
131
132
# File 'lib/pikuri/extractor/html.rb', line 127

def self.jsonld_section(html)
  node = parse_jsonld(html).find do |n|
    Array(n['@type']).any? { |t| INTERESTING_TYPES.include?(t) }
  end
  node ? jsonld_to_markdown(node) : nil
end

.jsonld_to_markdown(node) ⇒ String

Render a single JSON-LD node as Markdown: a top-level title from name/headline, a bullet list of common useful fields (brand, SKU, price, rating, author, published date, …), the body copy, and the lead image.

When the node carries articleBody (the full publisher-supplied article text), that wins over description — the description is typically a lede teaser and would just repeat the article’s opening lines.

Parameters:

Returns:

  • (String)

    Markdown representation



172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
# File 'lib/pikuri/extractor/html.rb', line 172

def self.jsonld_to_markdown(node)
  out = +''
  name = node['name'] || node['headline']
  out << "# #{name}\n\n" if name

  offer  = first_obj(node['offers'])
  rating = first_obj(node['aggregateRating'])
  brand  = first_obj_or_string(node['brand'])
  author = first_obj_or_string(node['author'])

  brand_name  = brand.is_a?(Hash)  ? brand['name']  : brand
  author_name = author.is_a?(Hash) ? author['name'] : author

  fields = {
    'Brand'        => brand_name,
    'SKU'          => node['sku'],
    'GTIN'         => node['gtin13'] || node['gtin'],
    'Price'        => [offer['price'], offer['priceCurrency']].compact.join(' '),
    'Availability' => offer['availability'],
    'Rating'       => rating['ratingValue'],
    'Reviews'      => rating['reviewCount'],
    'Author'       => author_name,
    'Published'    => node['datePublished']
  }.reject { |_, v| v.nil? || v.to_s.strip.empty? }

  unless fields.empty?
    fields.each { |k, v| out << "- **#{k}:** #{v}\n" }
    out << "\n"
  end

  if (body = node['articleBody'] || node['description'])
    out << "#{body}\n\n"
  end

  if (img = node['image'])
    img = img.first if img.is_a?(Array)
    img = img['url'] if img.is_a?(Hash)
    out << "![image](#{img})\n\n" if img
  end

  out
end

.kindSymbol

Returns Page#kind tag.

Returns:



83
84
85
# File 'lib/pikuri/extractor/html.rb', line 83

def self.kind
  :html
end

.matches?(sample:, content_type:) ⇒ Boolean

Parameters:

  • sample (String)

    leading bytes of the content (unused —see the no-sniff rationale in the module doc).

  • content_type (String, nil)

    normalized content-type.

Returns:

  • (Boolean)


91
92
93
# File 'lib/pikuri/extractor/html.rb', line 91

def self.matches?(sample:, content_type:)
  CONTENT_TYPES.include?(content_type)
end

.parse_jsonld(html) ⇒ Array<Hash>

Collect every JSON-LD payload embedded in html, flattening @graph wrappers so callers see one flat array of schema.org nodes. Malformed JSON blocks are silently skipped — sites frequently ship broken JSON-LD and we only need at least one parseable block.

Parameters:

  • html (String)

    HTML document body

Returns:

  • (Array<Hash>)

    parsed JSON-LD nodes; possibly empty



142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# File 'lib/pikuri/extractor/html.rb', line 142

def self.parse_jsonld(html)
  doc = Nokogiri::HTML(html)
  blobs = doc.css('script[type="application/ld+json"]').map(&:text)

  blobs.flat_map do |raw|
    parsed = begin
      JSON.parse(raw)
    rescue JSON::ParserError
      nil
    end
    next [] unless parsed

    nodes = parsed.is_a?(Array) ? parsed : [parsed]
    nodes.flat_map { |n| n['@graph'].is_a?(Array) ? n['@graph'] : [n] }
  end
end

.readability_to_markdown(html) ⇒ String

Run Readability over html to isolate the main content node, then convert that to Markdown via reverse_markdown. The page <title> is rendered as a top-level heading.

When the page uses semantic HTML5 (+<main>+ or <article>) but leaves most of its content outside <p> tags — divs, lists, spans — Readability’s paragraph-density scoring collapses the extraction to a sliver of the page. In that case we render the <main>/<article> container directly. The fallback only fires when the container holds substantially more text than Readability picked up (see MAIN_FALLBACK_RATIO / MAIN_FALLBACK_MIN_CHARS); on pages where both agree we keep Readability so its noise filtering still strips nav/ads/etc.

Parameters:

  • html (String)

    HTML document body

Returns:

  • (String)

    Markdown representation



231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# File 'lib/pikuri/extractor/html.rb', line 231

def self.readability_to_markdown(html)
  rdoc = Readability::Document.new(
    html,
    tags: READABILITY_TAGS,
    attributes: READABILITY_ATTRS,
    remove_empty_nodes: true
  )
  readability_html = rdoc.content
  title = rdoc.title

  body_html = main_fallback_html(html, readability_html) || readability_html
  body = ReverseMarkdown.convert(body_html, unknown_tags: :bypass, github_flavored: true)

  out = +''
  out << "# #{title.strip}\n\n" if title && !title.strip.empty?
  out << body
  out
end