Class: Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier
- Defined in:
- lib/html2rss/auto_source/scraper/link_heuristics.rb
Overview
Classifies normalized destination path segments for scoring.
Constant Summary collapse
- SEGMENT_SETS =
Segment groups used to classify article, taxonomy, utility, and vanity routes.
{ content: %w[ article articles blog blogs changelog changelogs insight insights launch launches news post posts release releases story stories update updates artikel beitrag beitraege nachrichten neuigkeiten aktuelles articulo articulos noticia noticias entrada entradas publicacion publicaciones actualite actualites nouvelle nouvelles teaser teasers card cards ].to_set.freeze, utility: %w[ about account archive archives author authors category categories comment comments contact feedback help login logout newsletter newsletters notification notifications preference preferences profile register search settings share signup subscribe tag tags topic topics feed feeds comment-feed comments-feed recommended for-you privacy terms cookie cookies join member members membership plus premium plans pricing user users kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb categoria categorias etiqueta etiquetas tema temas autores archivos sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar registrarse registro cuenta suscribirse boletin privacidad condiciones categorie etiquette etiquettes sujet sujets theme themes auteur auteurs a-propos apropos recherche rechercher aide connexion s-inscrire sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu menu sidebar widget social modal popup banner promo ad ads related recommendation recommendations pagination pager ].to_set.freeze, high_confidence_junk: %w[ about account archive archives author authors category categories comment comments contact cookie cookies feedback feed feeds help login logout notification notifications preference preferences privacy profile register search settings share signup subscribe tag tags terms topic topics comment-feed comments-feed user users kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb categoria categorias etiqueta etiquetas tema temas autores archivos sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar registrarse registro cuenta suscribirse boletin privacidad condiciones categorie etiquette etiquettes sujet sujets theme themes auteur auteurs a-propos apropos recherche rechercher aide connexion s-inscrire sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu menu sidebar widget social modal popup banner promo ad ads related recommendation recommendations pagination pager ].to_set.freeze, taxonomy: %w[ category categories tag tags topic topics kategorie kategorien schlagwort schlagworte thema themen categoria categorias etiqueta etiquetas tema temas categorie etiquette etiquettes sujet sujets theme themes ].to_set.freeze, vanity: %w[ join membership plus premium pricing plans subscribe signup abonnieren abo suscribirse boletin s-abonner saboner ].to_set.freeze, deep_post_context: %w[ press newsroom presse pressemitteilungen prensa ].to_set.freeze }.freeze
- YEARISH_SEGMENT =
Path segment that begins with a year-like publishing marker.
/\A\d{4,}[\w-]*\z/- POST_SLUG_SEGMENT =
Hyphenated slug shape common to article permalinks.
/\A[a-z0-9]+(?:-[a-z0-9]+){2,}\z/i
Instance Attribute Summary collapse
-
#segments ⇒ Object
readonly
rubocop:disable Metrics/ClassLength.
Instance Method Summary collapse
-
#content_path? ⇒ Boolean
True when the route has article-like path evidence.
-
#deep_utility_context_route? ⇒ Boolean
True when the leading segments are all utility chrome.
-
#initialize(segments) ⇒ PathClassifier
constructor
A new instance of PathClassifier.
-
#junk_path? ⇒ Boolean
True when the route is shallow and contains high-confidence noise.
-
#shallow? ⇒ Boolean
True when the route is too shallow to strongly indicate an article.
-
#shallow_high_confidence_route? ⇒ Boolean
True when the route is shallow and contains high-confidence noise.
-
#strong_post_suffix? ⇒ Boolean
True when the final path segment looks like a post slug.
-
#taxonomy_path? ⇒ Boolean
True when the route points at taxonomy/listing chrome.
-
#utility_destination? ⇒ Boolean
True when the route points at conversion or account chrome.
-
#utility_only_route? ⇒ Boolean
True when every path segment is utility chrome.
-
#utility_path? ⇒ Boolean
True when the route includes utility/navigation evidence.
-
#vanity_path? ⇒ Boolean
True when the route points at conversion or account chrome.
Constructor Details
#initialize(segments) ⇒ PathClassifier
Returns a new instance of PathClassifier.
213 214 215 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 213 def initialize(segments) @segments = segments end |
Instance Attribute Details
#segments ⇒ Object (readonly)
rubocop:disable Metrics/ClassLength
138 139 140 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 138 def segments @segments end |
Instance Method Details
#content_path? ⇒ Boolean
Returns true when the route has article-like path evidence.
218 219 220 221 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 218 def content_path? @content_path ||= segments.any? { |s| SEGMENT_SETS[:content].include?(s) } || yearish_content_context? end |
#deep_utility_context_route? ⇒ Boolean
Returns true when the leading segments are all utility chrome.
271 272 273 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 271 def deep_utility_context_route? all_junk?(segments.size - 1) end |
#junk_path? ⇒ Boolean
Returns true when the route is shallow and contains high-confidence noise.
276 277 278 279 280 281 282 283 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 276 def junk_path? return false if excluded_content_route? taxonomy_path? || utility_only_route? || deep_utility_context_route? || shallow_high_confidence_route? end |
#shallow? ⇒ Boolean
Returns true when the route is too shallow to strongly indicate an article.
239 240 241 242 243 244 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 239 def shallow? segment_count = segments.size junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) segment_count <= 1 || (segment_count == 2 && junk_segments.include?(segments.last)) end |
#shallow_high_confidence_route? ⇒ Boolean
Returns true when the route is shallow and contains high-confidence noise.
261 262 263 264 265 266 267 268 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 261 def shallow_high_confidence_route? junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) vanity_segments = SEGMENT_SETS.fetch(:vanity) shallow? && segments.any? do |segment| junk_segments.include?(segment) || vanity_segments.include?(segment) end end |
#strong_post_suffix? ⇒ Boolean
Returns true when the final path segment looks like a post slug.
247 248 249 250 251 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 247 def strong_post_suffix? @strong_post_suffix ||= segments.any? && included_last_segment? && trusted_post_context?(segments.size - 1) end |
#taxonomy_path? ⇒ Boolean
Returns true when the route points at taxonomy/listing chrome.
234 235 236 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 234 def taxonomy_path? @taxonomy_path ||= segments.any? { |s| SEGMENT_SETS[:taxonomy].include?(s) } end |
#utility_destination? ⇒ Boolean
Returns true when the route points at conversion or account chrome.
286 287 288 289 290 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 286 def utility_destination? return false if excluded_content_route? vanity_path? || utility_route? end |
#utility_only_route? ⇒ Boolean
Returns true when every path segment is utility chrome.
254 255 256 257 258 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 254 def utility_only_route? junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk) segments.all? { |segment| junk_segments.include?(segment) } end |
#utility_path? ⇒ Boolean
Returns true when the route includes utility/navigation evidence.
224 225 226 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 224 def utility_path? @utility_path ||= segments.any? { |s| SEGMENT_SETS[:utility].include?(s) } end |
#vanity_path? ⇒ Boolean
Returns true when the route points at conversion or account chrome.
229 230 231 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 229 def vanity_path? @vanity_path ||= segments.any? { |s| SEGMENT_SETS[:vanity].include?(s) } end |