Class: Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Classifies normalized destination path segments for scoring.

Constant Summary collapse

SEGMENT_SETS =

Segment groups used to classify article, taxonomy, utility, and vanity routes.

{
  content: %w[
    article articles blog blogs changelog changelogs insight insights
    launch launches news post posts release releases story stories update updates
    artikel beitrag beitraege nachrichten neuigkeiten aktuelles
    articulo articulos noticia noticias entrada entradas publicacion publicaciones
    actualite actualites nouvelle nouvelles
    teaser teasers card cards
  ].to_set.freeze,
  utility: %w[
    about account archive archives author authors category categories comment comments
    contact feedback help login logout newsletter newsletters notification notifications
    preference preferences profile register search settings share signup subscribe
    tag tags topic topics
    feed feeds comment-feed comments-feed
    recommended
    for-you
    privacy terms cookie cookies
    join member members membership plus premium plans pricing user users
    kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv
    ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren
    konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb
    categoria categorias etiqueta etiquetas tema temas autores archivos
    sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar
    registrarse registro cuenta suscribirse boletin privacidad condiciones
    categorie etiquette etiquettes sujet sujets theme themes auteur auteurs
    a-propos apropos recherche rechercher aide connexion s-inscrire
    sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu
    menu sidebar widget social modal popup banner promo ad ads
    related recommendation recommendations pagination pager
  ].to_set.freeze,
  high_confidence_junk: %w[
    about account archive archives author authors category categories comment comments
    contact cookie cookies feedback feed feeds help login logout notification notifications
    preference preferences privacy profile register search settings share signup subscribe
    tag tags terms topic topics comment-feed comments-feed user users
    kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv
    ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren
    konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb
    categoria categorias etiqueta etiquetas tema temas autores archivos
    sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar
    registrarse registro cuenta suscribirse boletin privacidad condiciones
    categorie etiquette etiquettes sujet sujets theme themes auteur auteurs
    a-propos apropos recherche rechercher aide connexion s-inscrire
    sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu
    menu sidebar widget social modal popup banner promo ad ads
    related recommendation recommendations pagination pager
  ].to_set.freeze,
  taxonomy: %w[
    category categories tag tags topic topics
    kategorie kategorien schlagwort schlagworte thema themen
    categoria categorias etiqueta etiquetas tema temas
    categorie etiquette etiquettes sujet sujets theme themes
  ].to_set.freeze,
  vanity: %w[
    join membership plus premium pricing plans subscribe signup
    abonnieren abo
    suscribirse boletin
    s-abonner saboner
  ].to_set.freeze,
  deep_post_context: %w[
    press newsroom
    presse pressemitteilungen
    prensa
  ].to_set.freeze
}.freeze
YEARISH_SEGMENT =

Path segment that begins with a year-like publishing marker.

/\A\d{4,}[\w-]*\z/
POST_SLUG_SEGMENT =

Hyphenated slug shape common to article permalinks.

/\A[a-z0-9]+(?:-[a-z0-9]+){2,}\z/i

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(segments) ⇒ PathClassifier

Returns a new instance of PathClassifier.

Parameters:

  • segments (Array<String>)

    normalized URL path segments



213
214
215
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 213

def initialize(segments)
  @segments = segments
end

Instance Attribute Details

#segmentsObject (readonly)

rubocop:disable Metrics/ClassLength



138
139
140
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 138

def segments
  @segments
end

Instance Method Details

#content_path?Boolean

Returns true when the route has article-like path evidence.

Returns:

  • (Boolean)

    true when the route has article-like path evidence



218
219
220
221
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 218

def content_path?
  @content_path ||= segments.any? { |s| SEGMENT_SETS[:content].include?(s) } ||
                    yearish_content_context?
end

#deep_utility_context_route?Boolean

Returns true when the leading segments are all utility chrome.

Returns:

  • (Boolean)

    true when the leading segments are all utility chrome



271
272
273
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 271

def deep_utility_context_route?
  all_junk?(segments.size - 1)
end

#junk_path?Boolean

Returns true when the route is shallow and contains high-confidence noise.

Returns:

  • (Boolean)

    true when the route is shallow and contains high-confidence noise



276
277
278
279
280
281
282
283
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 276

def junk_path?
  return false if excluded_content_route?

  taxonomy_path? ||
    utility_only_route? ||
    deep_utility_context_route? ||
    shallow_high_confidence_route?
end

#shallow?Boolean

Returns true when the route is too shallow to strongly indicate an article.

Returns:

  • (Boolean)

    true when the route is too shallow to strongly indicate an article



239
240
241
242
243
244
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 239

def shallow?
  segment_count = segments.size
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)

  segment_count <= 1 || (segment_count == 2 && junk_segments.include?(segments.last))
end

#shallow_high_confidence_route?Boolean

Returns true when the route is shallow and contains high-confidence noise.

Returns:

  • (Boolean)

    true when the route is shallow and contains high-confidence noise



261
262
263
264
265
266
267
268
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 261

def shallow_high_confidence_route?
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)
  vanity_segments = SEGMENT_SETS.fetch(:vanity)

  shallow? && segments.any? do |segment|
    junk_segments.include?(segment) || vanity_segments.include?(segment)
  end
end

#strong_post_suffix?Boolean

Returns true when the final path segment looks like a post slug.

Returns:

  • (Boolean)

    true when the final path segment looks like a post slug



247
248
249
250
251
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 247

def strong_post_suffix?
  @strong_post_suffix ||= segments.any? &&
                          included_last_segment? &&
                          trusted_post_context?(segments.size - 1)
end

#taxonomy_path?Boolean

Returns true when the route points at taxonomy/listing chrome.

Returns:

  • (Boolean)

    true when the route points at taxonomy/listing chrome



234
235
236
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 234

def taxonomy_path?
  @taxonomy_path ||= segments.any? { |s| SEGMENT_SETS[:taxonomy].include?(s) }
end

#utility_destination?Boolean

Returns true when the route points at conversion or account chrome.

Returns:

  • (Boolean)

    true when the route points at conversion or account chrome



286
287
288
289
290
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 286

def utility_destination?
  return false if excluded_content_route?

  vanity_path? || utility_route?
end

#utility_only_route?Boolean

Returns true when every path segment is utility chrome.

Returns:

  • (Boolean)

    true when every path segment is utility chrome



254
255
256
257
258
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 254

def utility_only_route?
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)

  segments.all? { |segment| junk_segments.include?(segment) }
end

#utility_path?Boolean

Returns true when the route includes utility/navigation evidence.

Returns:

  • (Boolean)

    true when the route includes utility/navigation evidence



224
225
226
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 224

def utility_path?
  @utility_path ||= segments.any? { |s| SEGMENT_SETS[:utility].include?(s) }
end

#vanity_path?Boolean

Returns true when the route points at conversion or account chrome.

Returns:

  • (Boolean)

    true when the route points at conversion or account chrome



229
230
231
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 229

def vanity_path?
  @vanity_path ||= segments.any? { |s| SEGMENT_SETS[:vanity].include?(s) }
end