Class: Markbridge::Renderers::Discourse::MarkdownEscaper

Inherits:

Object

Object
Markbridge::Renderers::Discourse::MarkdownEscaper

show all

Defined in:: lib/markbridge/renderers/discourse/markdown_escaper.rb

Overview

Escapes text to prevent interpretation as Markdown formatting.

Design principles:

No false negatives: all potentially special sequences MUST be escaped
False positives OK: over-escaping is acceptable for safety
Autolinks preserved: <https://…>, <mailto:…>, and <email@domain> remain functional
HTML escaped: tags, processing instructions, and SGML declarations are neutralized
Performance: minimal allocations, byte-level processing, early returns
Discourse-compatible: handles ndash conversion, unlimited ordered list numbers

Optimized for Ruby 3.3+ with YJIT. Key optimizations:

Fast path returns original string for plain text (no allocations)
Pre-allocated result buffers with estimated capacity
Byte-level processing for inline escaping (YJIT-friendly tight loops)
Simplified escaping rules: [ breaks links, so ] doesn’t need escaping

Examples:

Basic escaping

escaper = Markbridge::Renderers::Discourse::MarkdownEscaper.new
escaper.escape("# Heading")      # => "\\# Heading"
escaper.escape("*emphasis*")     # => "\\*emphasis\\*"
escaper.escape("foo -- bar")     # => "foo \\-\\- bar"

HTML is escaped

escaper.escape("<div>content</div>")  # => "\\<div>content\\</div>"
escaper.escape("<?php echo 1; ?>")    # => "\\<?php echo 1; ?>"

Constant Summary collapse

MAYBE_SPECIAL = Fast-path check: any character that might need escaping Only includes characters we actually escape (removed ], {, }, ^) > is needed for blockquote detection at line start

/[\\`*_\[#+\-.!<>&|~=>)]/

MAYBE_INDENTED_CODE = Check for indented code on any line Matches: 4+ spaces, tab, or space+tab combinations that reach column 4+

/(?:^|\n)(?: {4}|\t| {1,3}\t)/

ATX_HEADING = Block-level patterns

/\A\#{1,6}(?=[ \t]|$)/

BLOCK_QUOTE =

/\A>/

BULLET_LIST = List markers followed by space, tab, or end of line

/\A[-+*](?=[ \t]|$)/

ORDERED_LIST =

/\A(\d+)([.)])(?=[ \t])/

THEMATIC_BREAK_DASH =

/\A(?:-[ \t]*){3,}$/

THEMATIC_BREAK_STAR =

/\A(?:\*[ \t]*){3,}$/

THEMATIC_BREAK_UNDERSCORE =

/\A(?:_[ \t]*){3,}$/

FENCED_CODE_BACKTICK =

/\A`{3,}[^`]*$/

FENCED_CODE_TILDE =

/\A~{3,}/

SETEXT_UNDERLINE_EQUALS =

/\A=+[ \t]*$/

SETEXT_UNDERLINE_DASH =

/\A-+[ \t]*$/

INDENTED_CODE = Indented code: 4+ spaces, tab at start, or space+tab reaching column 4+

/\A(?: {4}|\t| {1,3}\t)/

INLINE_SPECIAL = Inline quick-check pattern (includes < for HTML tag escaping)

/[\\*_`\[!|<&~-]/

ENTITY_REF = Entity reference pattern (we escape these to prevent conversion)

/\A&(?:\#[xX][0-9a-fA-F]{1,6}|\#[0-9]{1,7}|[a-zA-Z][a-zA-Z0-9]{0,31});/

HTML_ATTR = HTML tag pattern (we escape these, but NOT autolinks) Handles quoted attributes which can contain > characters Attribute patterns: name=“value” | name=‘value’ | name=value | name

/(?:\s+[a-zA-Z_:][a-zA-Z0-9_.:-]*(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'=<>`]+))?)/

HTML_TAG =

%r{\A</?[a-zA-Z][a-zA-Z0-9-]*#{HTML_ATTR}*\s*/?>}

AUTOLINK = Autolink pattern - we pass these through entirely unchanged Matches <http://…>, <https://…>, <mailto:…>, and email addresses

%r{\A<(?:https?://|mailto:)[^>\s]*>|\A<[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*>}i

HTML_TAG_START = Match HTML-like constructs that need escaping: Processing instructions: <?php, <?xml, etc. SGML declarations: <!DOCTYPE, <!ELEMENT, <![CDATA[, <!–, etc. Incomplete/multi-line HTML tags: <div followed by attributes on next line Custom elements: <my-component>, <responsive-image> The (?:[s/]|$) ensures we don’t match comparisons like “a < b”

%r{\A<(?:[?!]|/?\s*[a-zA-Z][a-zA-Z0-9-]*(?:[\s/]|$))}

BACKSLASH = Byte constants for inline processing

BANG = \

HASH = !

AMP = #

STAR = &

PLUS = *

DASH = +

LT = -

EQUALS = <

GT =

BRACKET_OPEN = >

UNDERSCORE = [

BACKTICK = _

PIPE = ‘

TILDE = |

SPACE = ~

TAB =

DIGIT_0 =

DIGIT_9 =

Instance Method Summary collapse

#escape(text) ⇒ String

Escapes markdown special characters in the given text.
#initialize(escape_hard_line_breaks: false) ⇒ MarkdownEscaper constructor

A new instance of MarkdownEscaper.

Constructor Details

#initialize(escape_hard_line_breaks: false) ⇒ `MarkdownEscaper`

Returns a new instance of MarkdownEscaper.

Parameters:

escape_hard_line_breaks (Boolean) (defaults to: false) —

when true, strip trailing spaces before newlines to prevent CommonMark hard line breaks (<br/>). Defaults to false because Discourse has trailing-space hard line breaks disabled by default.

# File 'lib/markbridge/renderers/discourse/markdown_escaper.rb', line 37

def initialize(escape_hard_line_breaks: false)
  @escape_hard_line_breaks = escape_hard_line_breaks
  @inline_content = nil
  @inline_result = nil
  @inline_len = 0
end

Instance Method Details

#escape(text) ⇒ `String`

Note:

Multi-line HTML tags and blocks are handled by escaping the opening <

Escapes markdown special characters in the given text.

Handles both block-level constructs (headings, lists, code blocks, HTML blocks) and inline formatting (emphasis, code spans, links, inline HTML). Autolinks (<https://…>, <email@domain>) are intentionally preserved.