Readability

Ruby port of Mozilla Readability.js -- extract readable article content from HTML pages, like Firefox Reader View.

Gem Version Build Status License

Passes all 130 Mozilla test fixtures.

Installation

Add this line to your application's Gemfile:

gem "readability-rb"

Quick Start

result = Readability.parse(html, url: "https://example.com/article")

result.title        # article title
result.       # author name
result.content      # cleaned HTML content
result.text_content # plain text content
result.excerpt      # short summary
result.length       # text content length
result.site_name    # site name
result.published_time # publication date
result.dir          # text direction
result.lang         # language

Usage

Parse an article

html = Net::HTTP.get(URI("https://example.com/article"))
result = Readability.parse(html, url: "https://example.com/article")

puts result.title
puts result.content

Returns a Readability::Result or nil if parsing fails.

Check if a page is readable

if Readability.readerable?(html)
  result = Readability.parse(html)
end

Accepts min_score and min_content_length options.

Readability.readerable?(html, min_score: 30, min_content_length: 200)

Use the lower-level API

Pass a Nokogiri document directly.

doc = Nokogiri::HTML5(html)
result = Readability::Document.new(doc, url: "https://example.com").parse

Custom serializer

Replace the default HTML serializer.

result = Readability.parse(html, serializer: ->(el) { el.to_html })

Options

Option Description Default
url Base URL for resolving relative links nil
max_elems_to_parse Max elements before aborting (0 = no limit) 0
nb_top_candidates Number of top candidates to consider 5
char_threshold Min characters for a successful parse 500
classes_to_preserve CSS classes to keep on elements []
keep_classes Preserve all CSS classes false
disable_json_ld Skip JSON-LD metadata extraction false
allowed_video_regex Regex for allowed video embed URLs built-in
link_density_modifier Adjust link density calculation 0
serializer Lambda to serialize the content element inner_html

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

License

Apache 2.0