Class: Html2rss::AutoSource::Scraper::Html::ClassClustering
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Html::ClassClustering
- Defined in:
- lib/html2rss/auto_source/scraper/html/class_clustering.rb
Overview
ClassClustering clusters DOM elements on anchorless pages by class lists and scores candidate groups to find the best list of content cards/articles. rubocop:disable Metrics/ClassLength
Constant Summary collapse
- LAYOUT_TAG_NAMES =
Node tags considered layout containers
Set['div', 'section', 'article'].freeze
- EXCLUDED_TAGS =
HTML/layout tags excluded from candidate nodes
Set['html', 'body', 'nav', 'footer', 'header', 'svg', 'script', 'style'].freeze
Class Method Summary collapse
-
.call(parsed_body, minimum_selector_frequency:) ⇒ Array<Nokogiri::XML::Node>
Clusters elements in parsed_body and returns the best set of content card nodes.
Instance Method Summary collapse
- #call ⇒ Array<Nokogiri::XML::Node>
-
#initialize(parsed_body, minimum_selector_frequency:) ⇒ ClassClustering
constructor
A new instance of ClassClustering.
Constructor Details
#initialize(parsed_body, minimum_selector_frequency:) ⇒ ClassClustering
Returns a new instance of ClassClustering.
31 32 33 34 35 36 |
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 31 def initialize(parsed_body, minimum_selector_frequency:) @parsed_body = parsed_body @minimum_frequency = minimum_selector_frequency @text_words = {}.compare_by_identity @has_date = {}.compare_by_identity end |
Class Method Details
.call(parsed_body, minimum_selector_frequency:) ⇒ Array<Nokogiri::XML::Node>
Clusters elements in parsed_body and returns the best set of content card nodes.
24 25 26 |
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 24 def call(parsed_body, minimum_selector_frequency:) new(parsed_body, minimum_selector_frequency:).call end |
Instance Method Details
#call ⇒ Array<Nokogiri::XML::Node>
39 40 41 42 43 44 45 46 47 |
# File 'lib/html2rss/auto_source/scraper/html/class_clustering.rb', line 39 def call candidate_groups = collect_candidate_groups return [] if candidate_groups.empty? non_containers = filter_containers(candidate_groups) final_groups = filter_1_to_1_overlap(non_containers) select_best_group(final_groups) end |