Class: Html2rss::AutoSource::Scraper::LinkHeuristics::TextClassifier
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::LinkHeuristics::TextClassifier
- Defined in:
- lib/html2rss/auto_source/scraper/link_heuristics.rb
Overview
Classifies visible anchor text for utility and recommendation chrome.
Constant Summary collapse
- UTILITY_PREFIX_PATTERN =
Prefix labels that usually identify navigation or subscription links.
/ \A\s*( # English view\s+all|see\s+all|all\s+news|subscribe|newsletter|comment\s+feed|comments\s+feed|join|premium|plus| # German alle\s+anzeigen|alle\s+news|abonnieren|newsletter|kommentar\s+feed|mitmachen| # Spanish ver\s+todos|ver\s+todo|todas\s+las\s+noticias|suscribirse|bolet(i|í)n|comentarios\s+feed|unirse| # French voir\s+tout|voir\s+tous|toutes\s+les\s+nouvelles|s['’]abonner|flux\s+de\s+commentaires|rejoindre )\b /ix- UTILITY_PATTERN =
Short labels that usually identify non-article navigation links.
/ \A\s*( # English about|contact|comments?|join|log\s+in|login|member(ship)?| plus|premium|pricing|recommended(\s+for\s+you)?| see\s+all|share|sign\s+up|signup|subscribe|view\s+all| # German (ue|ü)ber(\s+uns)?|kontakt|kommentare?|mitmachen|anmelden|login| mitglied(schaft)?|empfohlen(\s+f(ue|ü)r\s+dich)?|alle\s+anzeigen| teilen|registrieren|abonnieren|newsletter| # Spanish sobre(\s+nosotros)?|contacto|comentarios?|unirse|iniciar\s+sesion| login|miembro|membres(i|í)a|recomendado(\s+para\s+ti)?|ver\s+todo| compartir|registrarse|suscribirse|bolet(i|í)n| # French (a|à)\s+propos|(a|à)propos|contact|commentaires?|rejoindre| se\s+connecter|login|membre|abonnement|recommand(e|é)(\s+pour\s+vous)?| voir\s+tout|partager|s['’]inscrire|s['’]abonner|newsletter )\b /ix- RECOMMENDED_PATTERN =
Labels for recommendation chrome rather than source articles.
/ \A\s*( recommended(\s+for\s+you)?| empfohlen(\s+f(ue|ü)r\s+dich)?| recomendado(\s+para\s+ti)?| recommand(e|é)(\s+pour\s+vous)? )\b /ix
Instance Method Summary collapse
-
#recommended?(text) ⇒ Boolean
True when text identifies recommendation chrome.
-
#utility?(text) ⇒ Boolean
True when text matches a utility label.
-
#utility_prefix?(text) ⇒ Boolean
True when text begins with a utility label.
Instance Method Details
#recommended?(text) ⇒ Boolean
Returns true when text identifies recommendation chrome.
124 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 124 def recommended?(text) = text.to_s.match?(RECOMMENDED_PATTERN) |
#utility?(text) ⇒ Boolean
Returns true when text matches a utility label.
116 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 116 def utility?(text) = text.to_s.match?(UTILITY_PATTERN) |
#utility_prefix?(text) ⇒ Boolean
Returns true when text begins with a utility label.
120 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 120 def utility_prefix?(text) = text.to_s.match?(UTILITY_PREFIX_PATTERN) |