Class: Html2rss::CategoryExtractor
- Inherits:
-
Object
- Object
- Html2rss::CategoryExtractor
- Defined in:
- lib/html2rss/category_extractor.rb
Overview
CategoryExtractor is responsible for extracting categories from HTML elements by looking for CSS class names containing common category-related terms.
Constant Summary collapse
- CATEGORY_TERMS =
Common category-related terms to look for in class names
%w[category tag topic section label theme subject].freeze
- CATEGORY_SELECTORS =
CSS selectors to find elements with category-related class names or data attributes
CATEGORY_TERMS.flat_map do |term| ["[class*=\"#{term}\"]", "[data-#{term}]", "[#{term}]"] end.freeze
- CATEGORY_ATTR_PATTERN =
Regex pattern for matching category-related attribute names
/#{CATEGORY_TERMS.join('|')}/i
Class Method Summary collapse
-
.call(article_tag) ⇒ Array<String>
Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.
-
.extract_all_categories(article_tag) ⇒ Set<String>
Optimized single DOM traversal that extracts all category types.
-
.extract_element_data_categories!(categories, element) ⇒ void
Extracts categories from data attributes of a single element.
-
.extract_text_categories!(categories, element) ⇒ void
Extracts text-based categories from elements, splitting content into discrete values.
Class Method Details
.call(article_tag) ⇒ Array<String>
Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.
25 26 27 28 29 30 31 32 |
# File 'lib/html2rss/category_extractor.rb', line 25 def self.call(article_tag) return [] unless article_tag # Single optimized traversal that extracts all category types extract_all_categories(article_tag) .map(&:strip) .reject(&:empty?) end |
.extract_all_categories(article_tag) ⇒ Set<String>
Optimized single DOM traversal that extracts all category types.
39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/html2rss/category_extractor.rb', line 39 def self.extract_all_categories(article_tag) Set.new.tap do |categories| article_tag.css(CATEGORY_SELECTORS.join(',')).each do |element| # Extract text categories from elements with category-related class names extract_text_categories!(categories, element) if element['class']&.match?(CATEGORY_ATTR_PATTERN) # Extract data categories from all elements extract_element_data_categories!(categories, element) end end end |
.extract_element_data_categories!(categories, element) ⇒ void
This method returns an undefined value.
Extracts categories from data attributes of a single element.
57 58 59 60 61 62 63 64 |
# File 'lib/html2rss/category_extractor.rb', line 57 def self.extract_element_data_categories!(categories, element) element.attributes.each_value do |attr| next unless attr.name.match?(CATEGORY_ATTR_PATTERN) value = attr.value&.strip categories.add(value) if value && !value.empty? end end |
.extract_text_categories!(categories, element) ⇒ void
This method returns an undefined value.
Extracts text-based categories from elements, splitting content into discrete values.
72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/html2rss/category_extractor.rb', line 72 def self.extract_text_categories!(categories, element) if element.name == 'a' add_text_to_categories!(categories, element) return end anchors = element.css('a') if anchors.any? anchors.each { |node| add_text_to_categories!(categories, node) } else extract_split_text_categories!(categories, element) end end |