Class: Html2rss::AutoSource::Scraper::Html::ClassClustering

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/html/class_clustering.rb

Overview

ClassClustering clusters DOM elements on anchorless pages by class lists and scores candidate groups to find the best list of content cards/articles. rubocop:disable Metrics/ClassLength

Constant Summary collapse

LAYOUT_TAG_NAMES =

Node tags considered layout containers

Set['div', 'section', 'article'].freeze
EXCLUDED_TAGS =

HTML/layout tags excluded from candidate nodes

Set['html', 'body', 'nav', 'footer', 'header', 'svg', 'script', 'style'].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, minimum_selector_frequency:) ⇒ ClassClustering

Returns a new instance of ClassClustering.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)
  • minimum_selector_frequency (Integer)


31
32
33
34
35
36
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 31

def initialize(parsed_body, minimum_selector_frequency:)
  @parsed_body = parsed_body
  @minimum_frequency = minimum_selector_frequency
  @text_words = {}.compare_by_identity
  @has_date = {}.compare_by_identity
end

Class Method Details

.call(parsed_body, minimum_selector_frequency:) ⇒ Array<Nokogiri::XML::Node>

Clusters elements in parsed_body and returns the best set of content card nodes.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • minimum_selector_frequency (Integer)

    minimum frequency for class groups

Returns:

  • (Array<Nokogiri::XML::Node>)

    candidate nodes of the top-scoring class group



24
25
26
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 24

def call(parsed_body, minimum_selector_frequency:)
  new(parsed_body, minimum_selector_frequency:).call
end

Instance Method Details

#callArray<Nokogiri::XML::Node>

Returns:

  • (Array<Nokogiri::XML::Node>)


39
40
41
42
43
44
45
46
47
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 39

def call
  candidate_groups = collect_candidate_groups
  return [] if candidate_groups.empty?

  non_containers = filter_containers(candidate_groups)
  final_groups = filter_1_to_1_overlap(non_containers)

  select_best_group(final_groups)
end