Class: Html2rss::Selectors
- Inherits:
-
Object
- Object
- Html2rss::Selectors
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/selectors.rb,
lib/html2rss/selectors/config.rb,
lib/html2rss/selectors/extractors.rb,
lib/html2rss/selectors/extractors/href.rb,
lib/html2rss/selectors/extractors/html.rb,
lib/html2rss/selectors/extractors/text.rb,
lib/html2rss/selectors/post_processors.rb,
lib/html2rss/selectors/extractors/static.rb,
lib/html2rss/selectors/extractors/attribute.rb,
lib/html2rss/selectors/post_processors/base.rb,
lib/html2rss/selectors/post_processors/gsub.rb,
lib/html2rss/selectors/object_to_xml_converter.rb,
lib/html2rss/selectors/post_processors/template.rb,
lib/html2rss/selectors/post_processors/parse_uri.rb,
lib/html2rss/selectors/post_processors/substring.rb,
lib/html2rss/selectors/post_processors/parse_time.rb,
lib/html2rss/selectors/post_processors/sanitize_html.rb,
lib/html2rss/selectors/post_processors/html_to_markdown.rb,
lib/html2rss/selectors/post_processors/markdown_to_html.rb,
lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb,
lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb
Overview
This scraper is designed to scrape articles from a given HTML page using CSS selectors defined in the feed config.
It supports the traditional feed configs that html2rss originally provided, ensuring compatibility with existing setups.
Additionally, it uniquely offers the capability to convert JSON into XML, extending its versatility for diverse data processing workflows.
Defined Under Namespace
Modules: Extractors, PostProcessors Classes: Config, Context, InvalidSelectorName, ObjectToXmlConverter
Constant Summary collapse
- DEFAULT_CONFIG =
Default selectors options merged into user configuration.
{ items: { enhance: true } }.freeze
- ITEMS_SELECTOR_KEY =
Selector key that points to the root list of article nodes.
:items- ITEM_TAGS =
Supported RSS item attributes extractable through selectors.
%i[title url description author comments published_at guid enclosure categories].freeze
- SPECIAL_ATTRIBUTES =
Item attributes that require dedicated extraction logic.
Set[:guid, :enclosure, :categories].freeze
- RENAMED_ATTRIBUTES =
Mapping of new attribute names to their legacy names for backward compatibility.
{ published_at: %i[updated pubDate] }.freeze
Instance Method Summary collapse
-
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Returns articles extracted from the response.
-
#each {|article| ... } ⇒ Enumerator
Iterates over each scraped article.
-
#enhance? ⇒ Boolean
Whether to enhance the article hash with auto_source’s semantic HTML extraction.
-
#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash
Enhances the article hash using semantic HTML extraction.
-
#extract_article(item, page_response = response) ⇒ Hash
Extracts an article hash for a given item element.
-
#initialize(response, selectors:, time_zone:) ⇒ Selectors
constructor
Initializes a new Selectors instance.
-
#items_selector ⇒ String
Returns the CSS selector for the items.
-
#select(name, item, base_url: @url) ⇒ Object+
Selects the value for a given attribute from an HTML element.
Constructor Details
#initialize(response, selectors:, time_zone:) ⇒ Selectors
Initializes a new Selectors instance.
42 43 44 45 46 47 48 49 50 |
# File 'lib/html2rss/selectors.rb', line 42 def initialize(response, selectors:, time_zone:) @response = response @url = response.url @selectors = selectors @time_zone = time_zone prepare_selectors! @rss_item_attributes = @selectors.keys & Html2rss::RssBuilder::Article::PROVIDED_KEYS end |
Instance Method Details
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Returns articles extracted from the response. Reverses order if config specifies reverse ordering.
57 58 59 |
# File 'lib/html2rss/selectors.rb', line 57 def articles @articles ||= @selectors.dig(ITEMS_SELECTOR_KEY, :order) == 'reverse' ? to_a.tap(&:reverse!) : to_a end |
#each {|article| ... } ⇒ Enumerator
Iterates over each scraped article.
66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/html2rss/selectors.rb', line 66 def each(&) return enum_for(:each) unless block_given? enhance = enhance? parsed_body.css(items_selector).each do |item| article_hash = extract_article(item, response) enhance_article_hash(article_hash, item, response.url) if enhance yield Html2rss::RssBuilder::Article.new(**article_hash, scraper: self.class) end end |
#enhance? ⇒ Boolean
Returns whether to enhance the article hash with auto_source’s semantic HTML extraction.
86 |
# File 'lib/html2rss/selectors.rb', line 86 def enhance? = !!@selectors.dig(ITEMS_SELECTOR_KEY, :enhance) |
#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash
Enhances the article hash using semantic HTML extraction. Only adds keys that are missing from the original hash.
106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/html2rss/selectors.rb', line 106 def enhance_article_hash(article_hash, article_tag, base_url = @url) selected_anchor = HtmlExtractor.main_anchor_for(article_tag) return article_hash unless selected_anchor extracted = HtmlExtractor.new(article_tag, base_url:, selected_anchor:).call return article_hash unless extracted extracted.each_with_object(article_hash) do |(key, value), hash| next if value.nil? || (hash.key?(key) && hash[key]) hash[key] = value end end |
#extract_article(item, page_response = response) ⇒ Hash
Extracts an article hash for a given item element.
94 95 96 |
# File 'lib/html2rss/selectors.rb', line 94 def extract_article(item, page_response = response) @rss_item_attributes.to_h { |key| [key, select(key, item, base_url: page_response.url)] }.compact end |
#items_selector ⇒ String
Returns the CSS selector for the items.
83 |
# File 'lib/html2rss/selectors.rb', line 83 def items_selector = @selectors.dig(ITEMS_SELECTOR_KEY, :selector) |
#select(name, item, base_url: @url) ⇒ Object+
Selects the value for a given attribute from an HTML element.
128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/html2rss/selectors.rb', line 128 def select(name, item, base_url: @url) name = name.to_sym raise InvalidSelectorName, "Attribute selector '#{name}' is reserved for items." if name == ITEMS_SELECTOR_KEY selector_key, config = selector_config_for(name) if SPECIAL_ATTRIBUTES.member?(selector_key) select_special(selector_key, item:, config:, base_url:) else select_regular(selector_key, item:, config:, base_url:) end end |