Module: Pikuri::Tool::Scraper::HTML
- Defined in:
- lib/pikuri/tool/scraper/html.rb
Overview
HTML → Markdown extractor used by Simple.visit when the fetched response carries an HTML content-type.
Always renders both views of the page when available:
-
JSON-LD section. Any <script type=“application/ldjson”>+ node whose @type matches a substantive schema.org content type (Product, Article, Recipe, …) is rendered as a header — title, metadata bullets (brand, SKU, price, rating, author, published), and the
articleBody/descriptioncopy when present. -
Readability section. The page is run through
Readability+reverse_markdown, with a <main>/<article> fallback for pages whose content sits mostly outside <p> tags.
Concatenated with a horizontal rule, so the LLM gets both the structured metadata and the rendered body and can pick whichever is more useful for the task. Trades some duplication (when a publisher embeds the article body in JSON-LD AND in HTML) for fewer type-based heuristics on which branch should win — the earlier “is this Article’s description a teaser or the real body?” carve-out is no longer needed because both end up in the output regardless.
Pure parser — no I/O. HTML.extract takes an HTML string and returns Markdown, so tests can drive it against fixture HTML without a network round-trip.
Constant Summary collapse
- INTERESTING_TYPES =
Returns schema.org @type values that we treat as “the primary entity of this page” when picking a JSON-LD node to render. Order does not matter — the first matching node wins. Skips noise nodes (Organization, BreadcrumbList, WebSite, …) that ship on most pages but carry no page content.
%w[ Product Article NewsArticle BlogPosting Recipe Event Book Movie ].freeze
- READABILITY_TAGS =
Returns HTML tags preserved by the readability pass. Anything outside this list is stripped before Markdown conversion.
%w[ h1 h2 h3 h4 h5 h6 p div span ul ol li blockquote pre code a img strong em b i br hr table thead tbody tr td th ].freeze
- READABILITY_ATTRS =
Returns HTML attributes preserved by the readability pass; everything else (class, id, style, data-*) is dropped before Markdown conversion.
%w[href src alt title].freeze
- MAIN_FALLBACK_RATIO =
Returns minimum <main>
/<article> to Readability text-length ratio that triggers the semantic-container fallback in readability_to_markdown. Picked low enough to catch the failure mode (Readability collapsing a page that uses divs/lists instead of <p> — e.g.vaadin.com/company, ~5x) but high enough that pages where both produce comparable output keep Readability’s noise filtering. 2.0- MAIN_FALLBACK_MIN_CHARS =
Returns minimum text length the <main>
/<article> container must hold before the fallback in readability_to_markdown can fire. Below this, the ratio comparison is dominated by noise and we’d swap on tiny pages where Readability is doing the right thing. 500
Class Method Summary collapse
-
.extract(html) ⇒ String
Render
htmlas Markdown by emitting both the JSON-LD section (when an interesting node is present) and the readability / <main> section, joined by a horizontal rule. -
.jsonld_section(html) ⇒ String?
Pick the first JSON-LD node whose @type matches one of INTERESTING_TYPES and render it as Markdown.
-
.jsonld_to_markdown(node) ⇒ String
Render a single JSON-LD
nodeas Markdown: a top-level title fromname/headline, a bullet list of common useful fields (brand, SKU, price, rating, author, published date, …), the body copy, and the lead image. -
.parse_jsonld(html) ⇒ Array<Hash>
Collect every JSON-LD payload embedded in
html, flattening @graph wrappers so callers see one flat array of schema.org nodes. -
.readability_to_markdown(html) ⇒ String
Run
Readabilityoverhtmlto isolate the main content node, then convert that to Markdown viareverse_markdown.
Class Method Details
.extract(html) ⇒ String
Render html as Markdown by emitting both the JSON-LD section (when an interesting node is present) and the readability / <main> section, joined by a horizontal rule. Either section may be missing — pages with no JSON-LD return only the readability output, and a malformed page with no extractable body returns only the JSON-LD render.
86 87 88 89 90 |
# File 'lib/pikuri/tool/scraper/html.rb', line 86 def self.extract(html) sections = [jsonld_section(html), readability_to_markdown(html)] sections.reject! { |s| s.nil? || s.strip.empty? } sections.join("\n\n---\n\n") end |
.jsonld_section(html) ⇒ String?
Pick the first JSON-LD node whose @type matches one of INTERESTING_TYPES and render it as Markdown. Returns nil when no such node exists, in which case extract emits only the readability section.
No content-field gating: a node carrying just name/author/ datePublished still renders (as a metadata-only header), because the readability pass independently produces the page body. That is the trade-off that lets us drop the type-based “is this teaser or article copy?” heuristics — duplication is acceptable when both views are available, and the LLM can pick whichever it needs.
108 109 110 111 112 113 |
# File 'lib/pikuri/tool/scraper/html.rb', line 108 def self.jsonld_section(html) node = parse_jsonld(html).find do |n| Array(n['@type']).any? { |t| INTERESTING_TYPES.include?(t) } end node ? jsonld_to_markdown(node) : nil end |
.jsonld_to_markdown(node) ⇒ String
Render a single JSON-LD node as Markdown: a top-level title from name/headline, a bullet list of common useful fields (brand, SKU, price, rating, author, published date, …), the body copy, and the lead image.
When the node carries articleBody (the full publisher-supplied article text), that wins over description — the description is typically a lede teaser and would just repeat the article’s opening lines.
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
# File 'lib/pikuri/tool/scraper/html.rb', line 153 def self.jsonld_to_markdown(node) out = +'' name = node['name'] || node['headline'] out << "# #{name}\n\n" if name offer = first_obj(node['offers']) = first_obj(node['aggregateRating']) brand = first_obj_or_string(node['brand']) = first_obj_or_string(node['author']) brand_name = brand.is_a?(Hash) ? brand['name'] : brand = .is_a?(Hash) ? ['name'] : fields = { 'Brand' => brand_name, 'SKU' => node['sku'], 'GTIN' => node['gtin13'] || node['gtin'], 'Price' => [offer['price'], offer['priceCurrency']].compact.join(' '), 'Availability' => offer['availability'], 'Rating' => ['ratingValue'], 'Reviews' => ['reviewCount'], 'Author' => , 'Published' => node['datePublished'] }.reject { |_, v| v.nil? || v.to_s.strip.empty? } unless fields.empty? fields.each { |k, v| out << "- **#{k}:** #{v}\n" } out << "\n" end if (body = node['articleBody'] || node['description']) out << "#{body}\n\n" end if (img = node['image']) img = img.first if img.is_a?(Array) img = img['url'] if img.is_a?(Hash) out << "\n\n" if img end out end |
.parse_jsonld(html) ⇒ Array<Hash>
Collect every JSON-LD payload embedded in html, flattening @graph wrappers so callers see one flat array of schema.org nodes. Malformed JSON blocks are silently skipped — sites frequently ship broken JSON-LD and we only need at least one parseable block.
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'lib/pikuri/tool/scraper/html.rb', line 123 def self.parse_jsonld(html) doc = Nokogiri::HTML(html) blobs = doc.css('script[type="application/ld+json"]').map(&:text) blobs.flat_map do |raw| parsed = begin JSON.parse(raw) rescue JSON::ParserError nil end next [] unless parsed nodes = parsed.is_a?(Array) ? parsed : [parsed] nodes.flat_map { |n| n['@graph'].is_a?(Array) ? n['@graph'] : [n] } end end |
.readability_to_markdown(html) ⇒ String
Run Readability over html to isolate the main content node, then convert that to Markdown via reverse_markdown. The page <title> is rendered as a top-level heading.
When the page uses semantic HTML5 (+<main>+ or <article>) but leaves most of its content outside <p> tags — divs, lists, spans — Readability’s paragraph-density scoring collapses the extraction to a sliver of the page. In that case we render the <main>/<article> container directly. The fallback only fires when the container holds substantially more text than Readability picked up (see MAIN_FALLBACK_RATIO / MAIN_FALLBACK_MIN_CHARS); on pages where both agree we keep Readability so its noise filtering still strips nav/ads/etc.
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
# File 'lib/pikuri/tool/scraper/html.rb', line 212 def self.readability_to_markdown(html) rdoc = Readability::Document.new( html, tags: READABILITY_TAGS, attributes: READABILITY_ATTRS, remove_empty_nodes: true ) readability_html = rdoc.content title = rdoc.title body_html = main_fallback_html(html, readability_html) || readability_html body = ReverseMarkdown.convert(body_html, unknown_tags: :bypass, github_flavored: true) out = +'' out << "# #{title.strip}\n\n" if title && !title.strip.empty? out << body out end |