Class: Html2rss::Selectors

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/selectors.rb,
lib/html2rss/selectors/config.rb,
lib/html2rss/selectors/extractors.rb,
lib/html2rss/selectors/extractors/href.rb,
lib/html2rss/selectors/extractors/html.rb,
lib/html2rss/selectors/extractors/text.rb,
lib/html2rss/selectors/post_processors.rb,
lib/html2rss/selectors/extractors/static.rb,
lib/html2rss/selectors/extractors/attribute.rb,
lib/html2rss/selectors/post_processors/base.rb,
lib/html2rss/selectors/post_processors/gsub.rb,
lib/html2rss/selectors/object_to_xml_converter.rb,
lib/html2rss/selectors/post_processors/template.rb,
lib/html2rss/selectors/post_processors/parse_uri.rb,
lib/html2rss/selectors/post_processors/substring.rb,
lib/html2rss/selectors/post_processors/parse_time.rb,
lib/html2rss/selectors/post_processors/sanitize_html.rb,
lib/html2rss/selectors/post_processors/html_to_markdown.rb,
lib/html2rss/selectors/post_processors/markdown_to_html.rb,
lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb,
lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb

Overview

This scraper is designed to scrape articles from a given HTML page using CSS selectors defined in the feed config.

It supports the traditional feed configs that html2rss originally provided, ensuring compatibility with existing setups.

Additionally, it uniquely offers the capability to convert JSON into XML, extending its versatility for diverse data processing workflows.

Defined Under Namespace

Modules: Extractors, PostProcessors Classes: Config, Context, InvalidSelectorName, ObjectToXmlConverter

Constant Summary collapse

DEFAULT_CONFIG =

Default selectors options merged into user configuration.

{ items: { enhance: true } }.freeze
ITEMS_SELECTOR_KEY =

Selector key that points to the root list of article nodes.

:items
ITEM_TAGS =

Supported RSS item attributes extractable through selectors.

%i[title url description author comments published_at guid enclosure categories].freeze
SPECIAL_ATTRIBUTES =

Item attributes that require dedicated extraction logic.

Set[:guid, :enclosure, :categories].freeze
RENAMED_ATTRIBUTES =

Mapping of new attribute names to their legacy names for backward compatibility.

{ published_at: %i[updated pubDate] }.freeze

Instance Method Summary collapse

Constructor Details

#initialize(response, selectors:, time_zone:) ⇒ Selectors

Initializes a new Selectors instance.

Parameters:

  • response (RequestService::Response)

    The response object.

  • selectors (Hash)

    A hash of CSS selectors.

  • time_zone (String)

    Time zone string used for date parsing.



42
43
44
45
46
47
48
49
50
# File 'lib/html2rss/selectors.rb', line 42

def initialize(response, selectors:, time_zone:)
  @response = response
  @url = response.url
  @selectors = selectors
  @time_zone = time_zone

  prepare_selectors!
  @rss_item_attributes = @selectors.keys & Html2rss::RssBuilder::Article::PROVIDED_KEYS
end

Instance Method Details

#articlesArray<Html2rss::RssBuilder::Article>

Returns articles extracted from the response. Reverses order if config specifies reverse ordering.

Returns:



57
58
59
# File 'lib/html2rss/selectors.rb', line 57

def articles
  @articles ||= @selectors.dig(ITEMS_SELECTOR_KEY, :order) == 'reverse' ? to_a.tap(&:reverse!) : to_a
end

#each {|article| ... } ⇒ Enumerator

Iterates over each scraped article.

Yields:

  • (article)

    Gives each article as an Html2rss::RssBuilder::Article.

Returns:

  • (Enumerator)

    An enumerator if no block is given.



66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/html2rss/selectors.rb', line 66

def each(&)
  return enum_for(:each) unless block_given?

  enhance = enhance?

  parsed_body.css(items_selector).each do |item|
    article_hash = extract_article(item, response)

    enhance_article_hash(article_hash, item, response.url) if enhance

    yield Html2rss::RssBuilder::Article.new(**article_hash, scraper: self.class)
  end
end

#enhance?Boolean

Returns whether to enhance the article hash with auto_source’s semantic HTML extraction.

Returns:

  • (Boolean)

    whether to enhance the article hash with auto_source’s semantic HTML extraction.



86
# File 'lib/html2rss/selectors.rb', line 86

def enhance? = !!@selectors.dig(ITEMS_SELECTOR_KEY, :enhance)

#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash

Enhances the article hash using semantic HTML extraction. Only adds keys that are missing from the original hash.

Parameters:

  • article_hash (Hash)

    The original article hash.

  • article_tag (Nokogiri::XML::Element)

    HTML element to extract additional info from.

  • base_url (String, Html2rss::Url) (defaults to: @url)

    base URL for normalization during enhancement

Returns:

  • (Hash)

    The enhanced article hash.



106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/html2rss/selectors.rb', line 106

def enhance_article_hash(article_hash, , base_url = @url)
  selected_anchor = HtmlExtractor.main_anchor_for()
  return article_hash unless selected_anchor

  extracted = HtmlExtractor.new(, base_url:, selected_anchor:).call
  return article_hash unless extracted

  extracted.each_with_object(article_hash) do |(key, value), hash|
    next if value.nil? || (hash.key?(key) && hash[key])

    hash[key] = value
  end
end

#extract_article(item, page_response = response) ⇒ Hash

Extracts an article hash for a given item element.

Parameters:

  • item (Nokogiri::XML::Element)

    The element to extract from.

  • page_response (RequestService::Response) (defaults to: response)

    response used for selector extraction context

Returns:

  • (Hash)

    Hash of attributes for the article.



94
95
96
# File 'lib/html2rss/selectors.rb', line 94

def extract_article(item, page_response = response)
  @rss_item_attributes.to_h { |key| [key, select(key, item, base_url: page_response.url)] }.compact
end

#items_selectorString

Returns the CSS selector for the items.

Returns:

  • (String)

    the CSS selector for the items



83
# File 'lib/html2rss/selectors.rb', line 83

def items_selector = @selectors.dig(ITEMS_SELECTOR_KEY, :selector)

#select(name, item, base_url: @url) ⇒ Object+

Selects the value for a given attribute from an HTML element.

Parameters:

  • name (Symbol, String)

    Name of the attribute.

  • item (Nokogiri::XML::Element)

    The HTML element to process.

  • base_url (String, Html2rss::Url) (defaults to: @url)

    base URL for relative extraction values

Returns:

  • (Object, Array<Object>)

    The selected value(s).

Raises:



128
129
130
131
132
133
134
135
136
137
138
139
140
# File 'lib/html2rss/selectors.rb', line 128

def select(name, item, base_url: @url)
  name = name.to_sym

  raise InvalidSelectorName, "Attribute selector '#{name}' is reserved for items." if name == ITEMS_SELECTOR_KEY

  selector_key, config = selector_config_for(name)

  if SPECIAL_ATTRIBUTES.member?(selector_key)
    select_special(selector_key, item:, config:, base_url:)
  else
    select_regular(selector_key, item:, config:, base_url:)
  end
end