Class: Html2rss::AutoSource::Scraper::LinkHeuristics::TextClassifier

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Classifies visible anchor text for utility and recommendation chrome.

Constant Summary collapse

UTILITY_PREFIX_PATTERN =

Prefix labels that usually identify navigation or subscription links.

/
  \A\s*(
    # English
    view\s+all|see\s+all|all\s+news|subscribe|newsletter|comment\s+feed|comments\s+feed|join|premium|plus|
    # German
    alle\s+anzeigen|alle\s+news|abonnieren|newsletter|kommentar\s+feed|mitmachen|
    # Spanish
    ver\s+todos|ver\s+todo|todas\s+las\s+noticias|suscribirse|bolet(i|í)n|comentarios\s+feed|unirse|
    # French
    voir\s+tout|voir\s+tous|toutes\s+les\s+nouvelles|s['’]abonner|flux\s+de\s+commentaires|rejoindre
  )\b
/ix
UTILITY_PATTERN =

Short labels that usually identify non-article navigation links.

/
  \A\s*(
    # English
    about|contact|comments?|join|log\s+in|login|member(ship)?|
    plus|premium|pricing|recommended(\s+for\s+you)?|
    see\s+all|share|sign\s+up|signup|subscribe|view\s+all|
    # German
    (ue|ü)ber(\s+uns)?|kontakt|kommentare?|mitmachen|anmelden|login|
    mitglied(schaft)?|empfohlen(\s+f(ue|ü)r\s+dich)?|alle\s+anzeigen|
    teilen|registrieren|abonnieren|newsletter|
    # Spanish
    sobre(\s+nosotros)?|contacto|comentarios?|unirse|iniciar\s+sesion|
    login|miembro|membres(i|í)a|recomendado(\s+para\s+ti)?|ver\s+todo|
    compartir|registrarse|suscribirse|bolet(i|í)n|
    # French
    (a|à)\s+propos|(a|à)propos|contact|commentaires?|rejoindre|
    se\s+connecter|login|membre|abonnement|recommand(e|é)(\s+pour\s+vous)?|
    voir\s+tout|partager|s['’]inscrire|s['’]abonner|newsletter
  )\b
/ix
/
  \A\s*(
    recommended(\s+for\s+you)?|
    empfohlen(\s+f(ue|ü)r\s+dich)?|
    recomendado(\s+para\s+ti)?|
    recommand(e|é)(\s+pour\s+vous)?
  )\b
/ix

Instance Method Summary collapse

Instance Method Details

#recommended?(text) ⇒ Boolean

Returns true when text identifies recommendation chrome.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text identifies recommendation chrome



124
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 124

def recommended?(text) = text.to_s.match?(RECOMMENDED_PATTERN)

#utility?(text) ⇒ Boolean

Returns true when text matches a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text matches a utility label



116
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 116

def utility?(text) = text.to_s.match?(UTILITY_PATTERN)

#utility_prefix?(text) ⇒ Boolean

Returns true when text begins with a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text begins with a utility label



120
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 120

def utility_prefix?(text) = text.to_s.match?(UTILITY_PREFIX_PATTERN)