philiprehberger-sanitize_html

Tests Gem Version Last updated

HTML sanitizer with configurable allow lists, security profiles, and URL/CSS sanitization for safe user content rendering

Requirements

  • Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-sanitize_html"

Or install directly:

gem install philiprehberger-sanitize_html

Usage

require "philiprehberger/sanitize_html"

# Clean HTML with default allowed tags
safe = Philiprehberger::SanitizeHtml.clean('<p>Hello <script>alert("xss")</script></p>')
# => "<p>Hello </p>"

Custom Allow Lists

Philiprehberger::SanitizeHtml.clean(
  '<div class="box"><span>text</span></div>',
  tags: %w[div span],
  attributes: { 'div' => %w[class] }
)
# => '<div class="box"><span>text</span></div>'

Security Profiles

# :strict - removes all tags
Philiprehberger::SanitizeHtml.clean('<p>Hello <b>world</b></p>', profile: :strict)
# => "Hello world"

# :moderate - basic formatting (p, br, strong, em, b, i, u, lists, blockquote)
Philiprehberger::SanitizeHtml.clean('<p>Hello <b>world</b></p>', profile: :moderate)
# => "<p>Hello <b>world</b></p>"

# :permissive - most safe tags (formatting, links, images, tables, divs, spans)
Philiprehberger::SanitizeHtml.clean('<div><table><tr><td>cell</td></tr></table></div>', profile: :permissive)
# => "<div><table><tr><td>cell</td></tr></table></div>"

# :markdown - code, links, formatting, headings, tables
Philiprehberger::SanitizeHtml.clean('<pre><code>puts "hi"</code></pre>', profile: :markdown)
# => '<pre><code>puts "hi"</code></pre>'

URL Protocol Sanitization

# Default: allows http, https, mailto
Philiprehberger::SanitizeHtml.clean('<a href="javascript:alert(1)">click</a>')
# => "<a>click</a>"

# Custom allowed protocols
Philiprehberger::SanitizeHtml.clean(
  '<a href="ftp://files.example.com/doc.pdf">download</a>',
  allowed_protocols: %w[http https ftp]
)
# => '<a href="ftp://files.example.com/doc.pdf">download</a>'

Data URI Filtering

# Allow specific MIME types for data: URIs
Philiprehberger::SanitizeHtml.clean(
  '<a href="data:image/png;base64,abc123">image</a>',
  allowed_data_mimes: ['image/png', 'image/jpeg']
)
# => '<a href="data:image/png;base64,abc123">image</a>'

CSS Sanitization

# Safe CSS properties are preserved, dangerous ones are stripped
Philiprehberger::SanitizeHtml.clean(
  '<p style="color: red; expression(alert(1))">text</p>',
  tags: %w[p],
  attributes: { 'p' => %w[style] }
)
# => '<p style="color: red">text</p>'

Callback Hooks

# Custom tag processing with on_tag callback
result = Philiprehberger::SanitizeHtml.clean(
  '<a href="http://example.com">link</a>',
  on_tag: ->(tag, attrs) {
    attrs['rel'] = 'nofollow' if tag == 'a'
    attrs
  }
)

# Return nil from callback to remove a tag
result = Philiprehberger::SanitizeHtml.clean(
  '<p>Keep</p><strong>Remove</strong>',
  on_tag: ->(tag, _attrs) { tag == 'strong' ? nil : {} }
)
# => "<p>Keep</p>"

Length Limits

require "philiprehberger/sanitize_html"

Philiprehberger::SanitizeHtml.clean(html, max_length: 10_000)
# raises Philiprehberger::SanitizeHtml::Error when html.length > 10_000
Philiprehberger::SanitizeHtml.clean(html, link_rel: 'nofollow noopener')
# every <a> in the output has rel="nofollow noopener"

Strip All Tags

Philiprehberger::SanitizeHtml.strip('<p>Hello <strong>world</strong></p>')
# => "Hello world"

Plain Text Extraction

# strip_tags removes all HTML and decodes entities for indexing or previews
Philiprehberger::SanitizeHtml.strip_tags('<p>Tom &amp; Jerry</p>')
# => "Tom & Jerry"

# script and style content is removed entirely, matching browser behavior
Philiprehberger::SanitizeHtml.strip_tags('Hi<script>alert(1)</script> there')
# => "Hi there"

# The :text_only profile is equivalent to strip_tags
Philiprehberger::SanitizeHtml.clean('<b>hi</b>', profile: :text_only)
# => "hi"

Escape HTML

Philiprehberger::SanitizeHtml.escape('<p>Hello</p>')
# => "&lt;p&gt;Hello&lt;/p&gt;"

Sanitize a Single URL

Philiprehberger::SanitizeHtml.sanitize_url('https://example.com')
# => "https://example.com"

Philiprehberger::SanitizeHtml.sanitize_url('javascript:alert(1)')
# => nil

Philiprehberger::SanitizeHtml.sanitize_url('ftp://files.example.com', allowed_protocols: %w[ftp])
# => "ftp://files.example.com"

API

Method / Constant Description
.clean(html, tags:, attributes:, profile:, allowed_protocols:, allowed_data_mimes:, on_tag:, max_length:, link_rel:) Sanitize HTML keeping only allowed tags and attributes with optional security profile, URL sanitization, data URI filtering, callback hooks, input length limit, and forced <a> rel attribute
.strip(html, max_length:) Remove all HTML tags, returning plain text (with entity normalization). Raises Error when input exceeds max_length
.strip_tags(html, max_length:) Convert HTML to plain text by removing all tags (including script/style content) and decoding entities; returns "" for nil or empty input. Raises Error when input exceeds max_length
.escape(html, max_length:) Entity-encode all HTML special characters. Raises Error when input exceeds max_length
max_length: Optional positive Integer accepted by clean/strip/strip_tags/escape; raises SanitizeHtml::Error when the input string length exceeds the limit (check happens before sanitization)
link_rel: Optional String accepted by clean (e.g. 'nofollow noopener'); when set, every emitted <a> tag has its rel attribute force-set to this value, bypassing attribute filtering
.sanitize_url(url, allowed_protocols:, allowed_data_mimes:) Validate a single URL; returns the stripped URL when safe or nil for disallowed protocols
DEFAULT_ALLOWED_TAGS Frozen array of tag names allowed by default (p, br, strong, em, b, i, u, a, ul, ol, li, blockquote, code, pre, h1-h6)
DEFAULT_ALLOWED_ATTRIBUTES Frozen hash of attributes allowed per tag (a => href, title; img => src, alt)
DEFAULT_ALLOWED_PROTOCOLS Frozen array of allowed URL protocols (http, https, mailto)
DEFAULT_ALLOWED_DATA_MIMES Frozen empty array of allowed data URI MIME types (none by default)
SAFE_CSS_PROPERTIES Frozen array of CSS property names considered safe for style attributes
PROFILES Frozen hash of predefined security profiles (:strict, :moderate, :permissive, :markdown, :text_only)
DANGEROUS_TAGS Frozen array of tags always removed with their content (script, style, iframe)
EVENT_ATTRIBUTE_PATTERN Regex matching event-handler attributes (e.g. onclick, onload) that are always stripped
Error Base error class for the module (Philiprehberger::SanitizeHtml::Error)

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT