Class: Html2rss::CategoryExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/category_extractor.rb

Overview

CategoryExtractor is responsible for extracting categories from HTML elements by looking for CSS class names containing common category-related terms.

Constant Summary collapse

CATEGORY_TERMS =

Common category-related terms to look for in class names

%w[category tag topic section label theme subject].freeze
CATEGORY_SELECTORS =

CSS selectors to find elements with category-related class names or data attributes

CATEGORY_TERMS.flat_map do |term|
  ["[class*=\"#{term}\"]", "[data-#{term}]", "[#{term}]"]
end.freeze
CATEGORY_ATTR_PATTERN =

Regex pattern for matching category-related attribute names

/#{CATEGORY_TERMS.join('|')}/i

Class Method Summary collapse

Class Method Details

.call(article_tag) ⇒ Array<String>

Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.

Parameters:

  • article_tag (Nokogiri::XML::Element)

    The article element to extract categories from

Returns:

  • (Array<String>)

    Array of category strings, empty if none found



25
26
27
28
29
30
31
32
# File 'lib/html2rss/category_extractor.rb', line 25

def self.call()
  return [] unless 

  # Single optimized traversal that extracts all category types
  extract_all_categories()
    .map(&:strip)
    .reject(&:empty?)
end

.extract_all_categories(article_tag) ⇒ Set<String>

Optimized single DOM traversal that extracts all category types.

Parameters:

  • article_tag (Nokogiri::XML::Element)

    The article element

Returns:

  • (Set<String>)

    Set of category strings



39
40
41
42
43
44
45
46
47
48
49
# File 'lib/html2rss/category_extractor.rb', line 39

def self.extract_all_categories()
  Set.new.tap do |categories|
    .css(CATEGORY_SELECTORS.join(',')).each do |element|
      # Extract text categories from elements with category-related class names
      extract_text_categories!(categories, element) if element['class']&.match?(CATEGORY_ATTR_PATTERN)

      # Extract data categories from all elements
      extract_element_data_categories!(categories, element)
    end
  end
end

.extract_element_data_categories!(categories, element) ⇒ void

This method returns an undefined value.

Extracts categories from data attributes of a single element.

Parameters:

  • categories (Set<String>)

    Accumulator set

  • element (Nokogiri::XML::Element)

    metadata element that may contain category links



57
58
59
60
61
62
63
64
# File 'lib/html2rss/category_extractor.rb', line 57

def self.extract_element_data_categories!(categories, element)
  element.attributes.each_value do |attr|
    next unless attr.name.match?(CATEGORY_ATTR_PATTERN)

    value = attr.value&.strip
    categories.add(value) if value && !value.empty?
  end
end

.extract_text_categories!(categories, element) ⇒ void

This method returns an undefined value.

Extracts text-based categories from elements, splitting content into discrete values.

Parameters:

  • categories (Set<String>)

    Accumulator set

  • element (Nokogiri::XML::Element)

    metadata element whose text may contain delimiters



72
73
74
75
76
77
78
79
80
81
82
83
84
85
# File 'lib/html2rss/category_extractor.rb', line 72

def self.extract_text_categories!(categories, element)
  if element.name == 'a'
    add_text_to_categories!(categories, element)
    return
  end

  anchors = element.css('a')

  if anchors.any?
    anchors.each { |node| add_text_to_categories!(categories, node) }
  else
    extract_split_text_categories!(categories, element)
  end
end