Class: Html2rss::AutoSource::Scraper::LinkHeuristics::PathClassifier

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Classifies normalized destination path segments for scoring. rubocop:disable Metrics/ClassLength

Constant Summary collapse

SEGMENT_SETS =

Segment groups used to classify article, taxonomy, utility, and vanity routes.

{
  content: %w[
    article articles blog blogs changelog changelogs insight insights
    launch launches news post posts release releases story stories update updates
    artikel beitrag beitraege nachrichten neuigkeiten aktuelles
    articulo articulos noticia noticias entrada entradas publicacion publicaciones
    actualite actualites nouvelle nouvelles
    teaser teasers card cards
  ].to_set.freeze,
  utility: %w[
    about account archive archives author authors category categories comment comments
    contact feedback help login logout newsletter newsletters notification notifications
    preference preferences profile register search settings share signup subscribe
    tag tags topic topics
    feed feeds comment-feed comments-feed
    recommended
    for-you
    privacy terms cookie cookies
    join member members membership plus premium plans pricing user users
    kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv
    ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren
    konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb
    categoria categorias etiqueta etiquetas tema temas autores archivos
    sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar
    registrarse registro cuenta suscribirse boletin privacidad condiciones
    categorie etiquette etiquettes sujet sujets theme themes auteur auteurs
    a-propos apropos recherche rechercher aide connexion s-inscrire
    sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu
    menu sidebar widget social modal popup banner promo ad ads
    related recommendation recommendations pagination pager
  ].to_set.freeze,
  high_confidence_junk: %w[
    about account archive archives author authors category categories comment comments
    contact cookie cookies feedback feed feeds help login logout notification notifications
    preference preferences privacy profile register search settings share signup subscribe
    tag tags terms topic topics comment-feed comments-feed user users
    kategorie kategorien schlagwort schlagworte thema themen autor autoren archiv
    ueber-uns ueber ueberuns profil kontakt impressum suche hilfe anmelden registrieren
    konto registrierung anmeldung abonnieren abo datenschutz nutzungsbedingungen agb
    categoria categorias etiqueta etiquetas tema temas autores archivos
    sobre-nosotros sobre quienes-somos buscar busqueda ayuda entrar ingresar
    registrarse registro cuenta suscribirse boletin privacidad condiciones
    categorie etiquette etiquettes sujet sujets theme themes auteur auteurs
    a-propos apropos recherche rechercher aide connexion s-inscrire
    sinscrire inscription compte s-abonner saboner lettre-information confidentialite mentions-legales cgu
    menu sidebar widget social modal popup banner promo ad ads
    related recommendation recommendations pagination pager
  ].to_set.freeze,
  taxonomy: %w[
    category categories tag tags topic topics
    kategorie kategorien schlagwort schlagworte thema themen
    categoria categorias etiqueta etiquetas tema temas
    categorie etiquette etiquettes sujet sujets theme themes
  ].to_set.freeze,
  vanity: %w[
    join membership plus premium pricing plans subscribe signup
    abonnieren abo
    suscribirse boletin
    s-abonner saboner
  ].to_set.freeze,
  deep_post_context: %w[
    press newsroom
    presse pressemitteilungen
    prensa
  ].to_set.freeze
}.freeze
YEARISH_SEGMENT =

Path segment that begins with a year-like publishing marker.

/\A\d{4,}[\w-]*\z/
POST_SLUG_SEGMENT =

Hyphenated slug shape common to article permalinks.

/\A[a-z0-9]+(?:-[a-z0-9]+){2,}\z/i

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(segments) ⇒ PathClassifier

Returns a new instance of PathClassifier.

Parameters:

  • segments (Array<String>)

    normalized URL path segments



205
206
207
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 205

def initialize(segments)
  @segments = segments
end

Instance Attribute Details

#segmentsObject (readonly)

Returns the value of attribute segments.



130
131
132
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 130

def segments
  @segments
end

Instance Method Details

#confidence_attributesHash

Returns high-confidence noise classification attributes.

Returns:

  • (Hash)

    high-confidence noise classification attributes



228
229
230
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 228

def confidence_attributes
  ConfidenceClassifier.new(self).attributes
end

#content_path?Boolean

Returns true when the route has article-like path evidence.

Returns:

  • (Boolean)

    true when the route has article-like path evidence



233
234
235
236
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 233

def content_path?
  @content_path ||= SEGMENT_SETS.fetch(:content).intersect?(segments.to_set) ||
                    yearish_content_context?
end

#deep_utility_context_route?Boolean

Returns true when the leading segments are all utility chrome.

Returns:

  • (Boolean)

    true when the leading segments are all utility chrome



284
285
286
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 284

def deep_utility_context_route?
  LeadingSegments.new(segments).all_junk?
end

#destination_attributesHash

Returns destination attributes consumed by DestinationFacts.

Returns:

  • (Hash)

    destination attributes consumed by DestinationFacts



210
211
212
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 210

def destination_attributes
  route_attributes.merge(confidence_attributes)
end

#route_attributesHash

Returns baseline path classification attributes.

Returns:

  • (Hash)

    baseline path classification attributes



215
216
217
218
219
220
221
222
223
224
225
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 215

def route_attributes
  {
    segments:,
    content_path: content_path?,
    utility_path: utility_path?,
    taxonomy_path: taxonomy_path?,
    vanity_path: vanity_path?,
    shallow: shallow?,
    strong_post_suffix: strong_post_suffix?
  }
end

#shallow?Boolean

Returns true when the route is too shallow to strongly indicate an article.

Returns:

  • (Boolean)

    true when the route is too shallow to strongly indicate an article



254
255
256
257
258
259
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 254

def shallow?
  segment_count = segments.size
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)

  segment_count <= 1 || (segment_count == 2 && junk_segments.include?(segments.last))
end

#shallow_high_confidence_route?Boolean

Returns true when the route is shallow and contains high-confidence noise.

Returns:

  • (Boolean)

    true when the route is shallow and contains high-confidence noise



274
275
276
277
278
279
280
281
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 274

def shallow_high_confidence_route?
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)
  vanity_segments = SEGMENT_SETS.fetch(:vanity)

  shallow? && segments.any? do |segment|
    junk_segments.include?(segment) || vanity_segments.include?(segment)
  end
end

#strong_post_suffix?Boolean

Returns true when the final path segment looks like a post slug.

Returns:

  • (Boolean)

    true when the final path segment looks like a post slug



262
263
264
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 262

def strong_post_suffix?
  PostSuffixClassifier.new(segments).strong?
end

#taxonomy_path?Boolean

Returns true when the route points at taxonomy/listing chrome.

Returns:

  • (Boolean)

    true when the route points at taxonomy/listing chrome



249
250
251
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 249

def taxonomy_path?
  @taxonomy_path ||= SEGMENT_SETS.fetch(:taxonomy).intersect?(segments.to_set)
end

#utility_only_route?Boolean

Returns true when every path segment is utility chrome.

Returns:

  • (Boolean)

    true when every path segment is utility chrome



267
268
269
270
271
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 267

def utility_only_route?
  junk_segments = SEGMENT_SETS.fetch(:high_confidence_junk)

  segments.all? { |segment| junk_segments.include?(segment) }
end

#utility_path?Boolean

Returns true when the route includes utility/navigation evidence.

Returns:

  • (Boolean)

    true when the route includes utility/navigation evidence



239
240
241
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 239

def utility_path?
  @utility_path ||= SEGMENT_SETS.fetch(:utility).intersect?(segments.to_set)
end

#vanity_path?Boolean

Returns true when the route points at conversion or account chrome.

Returns:

  • (Boolean)

    true when the route points at conversion or account chrome



244
245
246
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 244

def vanity_path?
  @vanity_path ||= SEGMENT_SETS.fetch(:vanity).intersect?(segments.to_set)
end