Class: Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier
- Defined in:
- lib/html2rss/auto_source/scraper/link_heuristics.rb
Overview
Classifies normalized destination path segments for scoring. rubocop:disable Metrics/ClassLength
Constant Summary collapse
- SEGMENT_SETS =
Segment groups used to classify article, taxonomy, utility, and vanity routes.
{ content: %w[ article articles blog blogs changelog changelogs insight insights launch launches news post posts release releases story stories update updates artikel beitrag beitraege nachrichten neuigkeiten aktuelles articulo articulos noticia noticias entrada entradas publicacion publicaciones actualite actualites nouvelle nouvelles teaser teasers card cards ].to_set.freeze, utility: %w[ about account archive archives author authors category categories comment comments contact feedback help login logout newsletter newsletters notification notifications preference preferences profile register search settings share signup subscribe tag tags topic topics feed feeds comment-feed comments-feed recommended for-you privacy terms cookie cookies join member members membership plus premium plans pricing user users kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb categoria categorias etiqueta etiquetas tema temas autores archivos sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar registrarse registro cuenta suscribirse boletin privacidad condiciones categorie etiquette etiquettes sujet sujets theme themes auteur auteurs a-propos apropos recherche rechercher aide connexion s-inscrire sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu menu sidebar widget social modal popup banner promo ad ads related recommendation recommendations pagination pager ].to_set.freeze, high_confidence_junk: %w[ about account archive archives author authors category categories comment comments contact cookie cookies feedback feed feeds help login logout notification notifications preference preferences privacy profile register search settings share signup subscribe tag tags terms topic topics comment-feed comments-feed user users kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb categoria categorias etiqueta etiquetas tema temas autores archivos sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar registrarse registro cuenta suscribirse boletin privacidad condiciones categorie etiquette etiquettes sujet sujets theme themes auteur auteurs a-propos apropos recherche rechercher aide connexion s-inscrire sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu menu sidebar widget social modal popup banner promo ad ads related recommendation recommendations pagination pager ].to_set.freeze, taxonomy: %w[ category categories tag tags topic topics kategorie kategorien schlagwort schlagworte thema themen categoria categorias etiqueta etiquetas tema temas categorie etiquette etiquettes sujet sujets theme themes ].to_set.freeze, vanity: %w[ join membership plus premium pricing plans subscribe signup abonnieren abo suscribirse boletin s-abonner saboner ].to_set.freeze, deep_post_context: %w[ press newsroom presse pressemitteilungen prensa ].to_set.freeze }.freeze
- YEARISH_SEGMENT =
Path segment that begins with a year-like publishing marker.
/\A\d{4,}[\w-]*\z/- POST_SLUG_SEGMENT =
Hyphenated slug shape common to article permalinks.
/\A[a-z0-9]+(?:-[a-z0-9]+){2,}\z/i
Instance Attribute Summary collapse
-
#segments ⇒ Object
readonly
Returns the value of attribute segments.
Instance Method Summary collapse
-
#confidence_attributes ⇒ Hash
High-confidence noise classification attributes.
-
#content_path? ⇒ Boolean
True when the route has article-like path evidence.
-
#deep_utility_context_route? ⇒ Boolean
True when the leading segments are all utility chrome.
-
#destination_attributes ⇒ Hash
Destination attributes consumed by DestinationFacts.
-
#initialize(segments) ⇒ PathClassifier
constructor
A new instance of PathClassifier.
-
#route_attributes ⇒ Hash
Baseline path classification attributes.
-
#shallow? ⇒ Boolean
True when the route is too shallow to strongly indicate an article.
-
#shallow_high_confidence_route? ⇒ Boolean
True when the route is shallow and contains high-confidence noise.
-
#strong_post_suffix? ⇒ Boolean
True when the final path segment looks like a post slug.
-
#taxonomy_path? ⇒ Boolean
True when the route points at taxonomy/listing chrome.
-
#utility_only_route? ⇒ Boolean
True when every path segment is utility chrome.
-
#utility_path? ⇒ Boolean
True when the route includes utility/navigation evidence.
-
#vanity_path? ⇒ Boolean
True when the route points at conversion or account chrome.
Constructor Details
#initialize(segments) ⇒ PathClassifier
Returns a new instance of PathClassifier.
205 206 207 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 205 def initialize(segments) @segments = segments end |
Instance Attribute Details
#segments ⇒ Object (readonly)
Returns the value of attribute segments.
130 131 132 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 130 def segments @segments end |
Instance Method Details
#confidence_attributes ⇒ Hash
Returns high-confidence noise classification attributes.
228 229 230 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 228 def confidence_attributes ConfidenceClassifier.new(self).attributes end |
#content_path? ⇒ Boolean
Returns true when the route has article-like path evidence.
233 234 235 236 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 233 def content_path? @content_path ||= SEGMENT_SETS.fetch(:content).intersect?(segments.to_set) || yearish_content_context? end |
#deep_utility_context_route? ⇒ Boolean
Returns true when the leading segments are all utility chrome.
284 285 286 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 284 def deep_utility_context_route? LeadingSegments.new(segments).all_junk? end |
#destination_attributes ⇒ Hash
Returns destination attributes consumed by DestinationFacts.
210 211 212 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 210 def destination_attributes route_attributes.merge(confidence_attributes) end |
#route_attributes ⇒ Hash
Returns baseline path classification attributes.
215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 215 def route_attributes { segments:, content_path: content_path?, utility_path: utility_path?, taxonomy_path: taxonomy_path?, vanity_path: vanity_path?, shallow: shallow?, strong_post_suffix: strong_post_suffix? } end |
#shallow? ⇒ Boolean
Returns true when the route is too shallow to strongly indicate an article.
254 255 256 257 258 259 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 254 def shallow? segment_count = segments.size junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) segment_count <= 1 || (segment_count == 2 && junk_segments.include?(segments.last)) end |
#shallow_high_confidence_route? ⇒ Boolean
Returns true when the route is shallow and contains high-confidence noise.
274 275 276 277 278 279 280 281 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 274 def shallow_high_confidence_route? junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) vanity_segments = SEGMENT_SETS.fetch(:vanity) shallow? && segments.any? do |segment| junk_segments.include?(segment) || vanity_segments.include?(segment) end end |
#strong_post_suffix? ⇒ Boolean
Returns true when the final path segment looks like a post slug.
262 263 264 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 262 def strong_post_suffix? PostSuffixClassifier.new(segments).strong? end |
#taxonomy_path? ⇒ Boolean
Returns true when the route points at taxonomy/listing chrome.
249 250 251 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 249 def taxonomy_path? @taxonomy_path ||= SEGMENT_SETS.fetch(:taxonomy).intersect?(segments.to_set) end |
#utility_only_route? ⇒ Boolean
Returns true when every path segment is utility chrome.
267 268 269 270 271 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 267 def utility_only_route? junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) segments.all? { |segment| junk_segments.include?(segment) } end |
#utility_path? ⇒ Boolean
Returns true when the route includes utility/navigation evidence.
239 240 241 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 239 def utility_path? @utility_path ||= SEGMENT_SETS.fetch(:utility).intersect?(segments.to_set) end |
#vanity_path? ⇒ Boolean
Returns true when the route points at conversion or account chrome.
244 245 246 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 244 def vanity_path? @vanity_path ||= SEGMENT_SETS.fetch(:vanity).intersect?(segments.to_set) end |