Class: Html2rss::AutoSource::Scraper::WordpressApi
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::WordpressApi
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb
Overview
Scrapes WordPress sites through their REST API instead of parsing article HTML.
Defined Under Namespace
Classes: PageScope, PostsEndpoint
Constant Summary collapse
- API_LINK_SELECTOR =
Selector for WordPress API discovery link tags.
'link[rel="https://api.w.org/"][href]'- CANONICAL_LINK_SELECTOR =
Selector for canonical link tags used for scope normalization.
'link[rel="canonical"][href]'- POSTS_FIELDS =
Fields requested from the WordPress posts endpoint.
%w[id title excerpt content link date categories].freeze
- POSTS_QUERY_DEFAULTS =
Baseline query sent to WordPress posts API follow-ups.
{ '_fields' => POSTS_FIELDS.join(','), 'per_page' => '100' }.freeze
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
Whether the page advertises a WordPress REST API endpoint.
-
.options_key ⇒ Symbol
Scraper config key.
Instance Method Summary collapse
-
#each {|article| ... } ⇒ Enumerator, void
Yields article hashes from the WordPress posts API.
- #initialize(parsed_body, url:, request_session: nil, **_opts) ⇒ void constructor
Constructor Details
#initialize(parsed_body, url:, request_session: nil, **_opts) ⇒ void
43 44 45 46 47 48 |
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 43 def initialize(parsed_body, url:, request_session: nil, **_opts) @parsed_body = parsed_body @url = Html2rss::Url.from_absolute(url) @request_session = request_session @page_scope = PageScope.from(parsed_body:, url: @url) end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns whether the page advertises a WordPress REST API endpoint.
30 31 32 33 34 |
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 30 def self.articles?(parsed_body) return false unless parsed_body !parsed_body.at_css(API_LINK_SELECTOR).nil? end |
.options_key ⇒ Symbol
Returns scraper config key.
25 |
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 25 def self. = :wordpress_api |
Instance Method Details
#each {|article| ... } ⇒ Enumerator, void
Yields article hashes from the WordPress posts API.
55 56 57 58 59 60 |
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 55 def each return enum_for(:each) unless block_given? return unless (posts = fetch_posts) posts.filter_map { article_from(_1) }.each { yield(_1) } end |