Class: Html2rss::Selectors

Inherits:

Object

Object
Html2rss::Selectors

show all

Includes:: Enumerable

Defined in:: lib/html2rss/selectors.rb,
lib/html2rss/selectors/config.rb,
lib/html2rss/selectors/extractors.rb,
lib/html2rss/selectors/extractors/href.rb,
lib/html2rss/selectors/extractors/html.rb,
lib/html2rss/selectors/extractors/text.rb,
lib/html2rss/selectors/post_processors.rb,
lib/html2rss/selectors/extractors/static.rb,
lib/html2rss/selectors/extractors/attribute.rb,
lib/html2rss/selectors/post_processors/base.rb,
lib/html2rss/selectors/post_processors/gsub.rb,
lib/html2rss/selectors/object_to_xml_converter.rb,
lib/html2rss/selectors/post_processors/template.rb,
lib/html2rss/selectors/post_processors/parse_uri.rb,
lib/html2rss/selectors/post_processors/substring.rb,
lib/html2rss/selectors/post_processors/parse_time.rb,
lib/html2rss/selectors/post_processors/sanitize_html.rb,
lib/html2rss/selectors/post_processors/html_to_markdown.rb,
lib/html2rss/selectors/post_processors/markdown_to_html.rb,
lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb,
lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb

Overview

This scraper is designed to scrape articles from a given HTML page using CSS selectors defined in the feed config.

It supports the traditional feed configs that html2rss originally provided, ensuring compatibility with existing setups.

Additionally, it uniquely offers the capability to convert JSON into XML, extending its versatility for diverse data processing workflows.

Defined Under Namespace

Modules: Extractors, PostProcessors Classes: Config, Context, InvalidSelectorName, ObjectToXmlConverter

Constant Summary collapse

DEFAULT_CONFIG = Default selectors options merged into user configuration.

{ items: { enhance: true } }.freeze

ITEMS_SELECTOR_KEY = Selector key that points to the root list of article nodes.

:items

ITEM_TAGS = Supported RSS item attributes extractable through selectors.

%i[title url description author comments published_at guid enclosure categories].freeze

SPECIAL_ATTRIBUTES = Item attributes that require dedicated extraction logic.

Set[:guid, :enclosure, :categories].freeze

RENAMED_ATTRIBUTES = Mapping of new attribute names to their legacy names for backward compatibility.

{ published_at: %i[updated pubDate] }.freeze

Instance Method Summary collapse

#articles ⇒ Array<Html2rss::RssBuilder::Article>

Returns articles extracted from the response.
#each {|article| ... } ⇒ Enumerator

Iterates over each scraped article.
#enhance? ⇒ Boolean

Whether to enhance the article hash with auto_source’s semantic HTML extraction.
#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash

Enhances the article hash using semantic HTML extraction.
#extract_article(item, page_response = response) ⇒ Hash

Extracts an article hash for a given item element.
#initialize(response, selectors:, time_zone:) ⇒ Selectors constructor

Initializes a new Selectors instance.
#items_selector ⇒ String

Returns the CSS selector for the items.
#select(name, item, base_url: @url) ⇒ Object⁺

Selects the value for a given attribute from an HTML element.

Constructor Details

#initialize(response, selectors:, time_zone:) ⇒ `Selectors`

Initializes a new Selectors instance.

Parameters:

response (RequestService::Response) —

The response object.
selectors (Hash) —

A hash of CSS selectors.
time_zone (String) —

Time zone string used for date parsing.

# File 'lib/html2rss/selectors.rb', line 42

def initialize(response, selectors:, time_zone:)
  @response = response
  @url = response.url
  @selectors = selectors
  @time_zone = time_zone

  prepare_selectors!
  @rss_item_attributes = @selectors.keys & Html2rss::RssBuilder::Article::PROVIDED_KEYS
end

Instance Method Details

#articles ⇒ `Array<Html2rss::RssBuilder::Article>`

Returns articles extracted from the response. Reverses order if config specifies reverse ordering.

Returns:

(Array<Html2rss::RssBuilder::Article>)



57
58
59

# File 'lib/html2rss/selectors.rb', line 57

def articles
  @articles ||= @selectors.dig(ITEMS_SELECTOR_KEY, :order) == 'reverse' ? to_a.tap(&:reverse!) : to_a
end

#each {|article| ... } ⇒ `Enumerator`

Iterates over each scraped article.

Yields:

(article) —

Gives each article as an Html2rss::RssBuilder::Article.

Returns:

(Enumerator) —

An enumerator if no block is given.

# File 'lib/html2rss/selectors.rb', line 66

def each(&)
  return enum_for(:each) unless block_given?

  enhance = enhance?

  parsed_body.css(items_selector).each do |item|
    article_hash = extract_article(item, response)

    enhance_article_hash(article_hash, item, response.url) if enhance

    yield Html2rss::RssBuilder::Article.new(**article_hash, scraper: self.class)
  end
end

#enhance? ⇒ `Boolean`

Returns whether to enhance the article hash with auto_source’s semantic HTML extraction.

Returns:

(Boolean) —

whether to enhance the article hash with auto_source’s semantic HTML extraction.

86	# File 'lib/html2rss/selectors.rb', line 86 def enhance? = !!@selectors.dig(ITEMS_SELECTOR_KEY, :enhance)

#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ `Hash`

Enhances the article hash using semantic HTML extraction. Only adds keys that are missing from the original hash.

Parameters:

article_hash (Hash) —

The original article hash.
article_tag (Nokogiri::XML::Element) —

HTML element to extract additional info from.
base_url (String, Html2rss::Url) (defaults to: @url) —

base URL for normalization during enhancement

Returns:

(Hash) —

The enhanced article hash.

# File 'lib/html2rss/selectors.rb', line 106

def enhance_article_hash(article_hash, article_tag, base_url = @url)
  selected_anchor = HtmlExtractor.main_anchor_for(article_tag)
  return article_hash unless selected_anchor

  extracted = HtmlExtractor.new(article_tag, base_url:, selected_anchor:).call
  return article_hash unless extracted

  extracted.each_with_object(article_hash) do |(key, value), hash|
    next if value.nil? || (hash.key?(key) && hash[key])

    hash[key] = value
  end
end

#extract_article(item, page_response = response) ⇒ `Hash`

Extracts an article hash for a given item element.

Parameters:

item (Nokogiri::XML::Element) —

The element to extract from.
page_response (RequestService::Response) (defaults to: response) —

response used for selector extraction context

Returns:

(Hash) —

Hash of attributes for the article.



94
95
96

# File 'lib/html2rss/selectors.rb', line 94

def extract_article(item, page_response = response)
  @rss_item_attributes.to_h { |key| [key, select(key, item, base_url: page_response.url)] }.compact
end

#items_selector ⇒ `String`

Returns the CSS selector for the items.

Returns:

(String) —

the CSS selector for the items

83	# File 'lib/html2rss/selectors.rb', line 83 def items_selector = @selectors.dig(ITEMS_SELECTOR_KEY, :selector)

#select(name, item, base_url: @url) ⇒ `Object`⁺

Selects the value for a given attribute from an HTML element.

Parameters:

name (Symbol, String) —

Name of the attribute.
item (Nokogiri::XML::Element) —

The HTML element to process.
base_url (String, Html2rss::Url) (defaults to: @url) —

base URL for relative extraction values

Returns:

(Object, Array<Object>) —

The selected value(s).

Raises:

(InvalidSelectorName) —

If the attribute name is invalid or not defined.

# File 'lib/html2rss/selectors.rb', line 128

def select(name, item, base_url: @url)
  name = name.to_sym

  raise InvalidSelectorName, "Attribute selector '#{name}' is reserved for items." if name == ITEMS_SELECTOR_KEY

  selector_key, config = selector_config_for(name)

  if SPECIAL_ATTRIBUTES.member?(selector_key)
    select_special(selector_key, item:, config:, base_url:)
  else
    select_regular(selector_key, item:, config:, base_url:)
  end
end

Class: Html2rss::Selectors

Overview

Defined Under Namespace

Constant Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(response, selectors:, time_zone:) ⇒ Selectors

Instance Method Details

#articles ⇒ Array<Html2rss::RssBuilder::Article>

#each {|article| ... } ⇒ Enumerator

#enhance? ⇒ Boolean

#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash

#extract_article(item, page_response = response) ⇒ Hash

#items_selector ⇒ String

#select(name, item, base_url: @url) ⇒ Object+

#initialize(response, selectors:, time_zone:) ⇒ `Selectors`

#articles ⇒ `Array<Html2rss::RssBuilder::Article>`

#each {|article| ... } ⇒ `Enumerator`

#enhance? ⇒ `Boolean`

#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ `Hash`

#extract_article(item, page_response = response) ⇒ `Hash`

#items_selector ⇒ `String`

#select(name, item, base_url: @url) ⇒ `Object`⁺