Class: Html2rss::AutoSource::Scraper::LinkHeuristics::LeadingSegments

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Classifies route context before the final segment.

Instance Method Summary collapse

Constructor Details

#initialize(segments) ⇒ LeadingSegments

Returns a new instance of LeadingSegments.

Parameters:

  • segments (Array<String>)

    normalized URL path segments



348
349
350
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 348

def initialize(segments)
  @segments = segments[0...-1]
end

Instance Method Details

#all_junk?Boolean

Returns true when every leading segment is utility chrome.

Returns:

  • (Boolean)

    true when every leading segment is utility chrome



353
354
355
356
357
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 353

def all_junk?
  junk_segments = PathClassifier::SEGMENT_SETS.fetch(:high_confidence_junk)

  @segments.any? && @segments.all? { |segment| junk_segments.include?(segment) }
end

#trusted_post_context?Boolean

Returns true when leading segments provide article context.

Returns:

  • (Boolean)

    true when leading segments provide article context



360
361
362
363
364
365
366
367
368
369
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 360

def trusted_post_context?
  content_segments = PathClassifier::SEGMENT_SETS.fetch(:content)
  context_segments = PathClassifier::SEGMENT_SETS.fetch(:deep_post_context)

  @segments.any? do |segment|
    content_segments.include?(segment) ||
      segment.match?(PathClassifier::YEARISH_SEGMENT) ||
      context_segments.include?(segment)
  end
end