Class: Markbridge::Renderers::Discourse::MarkdownEscaper

Inherits:
Object
  • Object
show all
Defined in:
lib/markbridge/renderers/discourse/markdown_escaper.rb

Overview

Escapes text to prevent interpretation as Markdown formatting.

Design principles:

  • No false negatives: all potentially special sequences MUST be escaped

  • False positives OK: over-escaping is acceptable for safety

  • Autolinks preserved: <https://…>, <mailto:…>, and <email@domain> remain functional

  • HTML escaped: tags, processing instructions, and SGML declarations are neutralized

  • Performance: minimal allocations, byte-level processing, early returns

  • Discourse-compatible: handles ndash conversion, unlimited ordered list numbers

Optimized for Ruby 3.3+ with YJIT. Key optimizations:

  • Fast path returns original string for plain text (no allocations)

  • Pre-allocated result buffers with estimated capacity

  • Byte-level processing for inline escaping (YJIT-friendly tight loops)

  • Simplified escaping rules: [ breaks links, so ] doesn’t need escaping

Examples:

Basic escaping

escaper = Markbridge::Renderers::Discourse::MarkdownEscaper.new
escaper.escape("# Heading")      # => "\\# Heading"
escaper.escape("*emphasis*")     # => "\\*emphasis\\*"
escaper.escape("foo -- bar")     # => "foo \\-\\- bar"

HTML is escaped

escaper.escape("<div>content</div>")  # => "\\<div>content\\</div>"
escaper.escape("<?php echo 1; ?>")    # => "\\<?php echo 1; ?>"

Constant Summary collapse

MAYBE_SPECIAL =

Fast-path check: any character that might need escaping Only includes characters we actually escape (removed ], {, }, ^) > is needed for blockquote detection at line start

/[\\`*_\[#+\-.!<>&|~=>)]/
MAYBE_INDENTED_CODE =

Check for indented code on any line Matches: 4+ spaces, tab, or space+tab combinations that reach column 4+

/(?:^|\n)(?: {4}|\t| {1,3}\t)/
ATX_HEADING =

Block-level patterns

/\A\#{1,6}(?=[ \t]|$)/
BLOCK_QUOTE =
/\A>/
BULLET_LIST =

List markers followed by space, tab, or end of line

/\A[-+*](?=[ \t]|$)/
ORDERED_LIST =
/\A(\d+)([.)])(?=[ \t])/
THEMATIC_BREAK_DASH =
/\A(?:-[ \t]*){3,}$/
THEMATIC_BREAK_STAR =
/\A(?:\*[ \t]*){3,}$/
THEMATIC_BREAK_UNDERSCORE =
/\A(?:_[ \t]*){3,}$/
FENCED_CODE_BACKTICK =
/\A`{3,}[^`]*$/
FENCED_CODE_TILDE =
/\A~{3,}/
SETEXT_UNDERLINE_EQUALS =
/\A=+[ \t]*$/
SETEXT_UNDERLINE_DASH =
/\A-+[ \t]*$/
INDENTED_CODE =

Indented code: 4+ spaces, tab at start, or space+tab reaching column 4+

/\A(?: {4}|\t| {1,3}\t)/
INLINE_SPECIAL =

Inline quick-check pattern (includes < for HTML tag escaping)

/[\\*_`\[!|<&~-]/
ENTITY_REF =

Entity reference pattern (we escape these to prevent conversion)

/\A&(?:\#[xX][0-9a-fA-F]{1,6}|\#[0-9]{1,7}|[a-zA-Z][a-zA-Z0-9]{0,31});/
HTML_ATTR =

HTML tag pattern (we escape these, but NOT autolinks) Handles quoted attributes which can contain > characters Attribute patterns: name=“value” | name=‘value’ | name=value | name

/(?:\s+[a-zA-Z_:][a-zA-Z0-9_.:-]*(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s"'=<>`]+))?)/
HTML_TAG =
%r{\A</?[a-zA-Z][a-zA-Z0-9-]*#{HTML_ATTR}*\s*/?>}
%r{\A<(?:https?://|mailto:)[^>\s]*>|\A<[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*>}i
HTML_TAG_START =

Match HTML-like constructs that need escaping:

  • Processing instructions: <?php, <?xml, etc.

  • SGML declarations: <!DOCTYPE, <!ELEMENT, <![CDATA[, <!–, etc.

  • Incomplete/multi-line HTML tags: <div followed by attributes on next line

  • Custom elements: <my-component>, <responsive-image>

The (?:[s/]|$) ensures we don’t match comparisons like “a < b”

%r{\A<(?:[?!]|/?\s*[a-zA-Z][a-zA-Z0-9-]*(?:[\s/]|$))}
BACKSLASH =

Byte constants for inline processing

92
BANG =

\

33
HASH =

!

35
AMP =

#

38
STAR =

&

42
PLUS =

*

43
DASH =

+

45
LT =

-

60
EQUALS =

<

61
GT =

62
BRACKET_OPEN =

>

91
UNDERSCORE =

[

95
BACKTICK =

_

96
PIPE =

124
TILDE =

|

126
SPACE =

~

32
TAB =
9
DIGIT_0 =
48
DIGIT_9 =
57

Instance Method Summary collapse

Constructor Details

#initialize(escape_hard_line_breaks: false) ⇒ MarkdownEscaper

Returns a new instance of MarkdownEscaper.

Parameters:

  • escape_hard_line_breaks (Boolean) (defaults to: false)

    when true, strip trailing spaces before newlines to prevent CommonMark hard line breaks (<br/>). Defaults to false because Discourse has trailing-space hard line breaks disabled by default.



37
38
39
40
41
42
# File 'lib/markbridge/renderers/discourse/markdown_escaper.rb', line 37

def initialize(escape_hard_line_breaks: false)
  @escape_hard_line_breaks = escape_hard_line_breaks
  @inline_content = nil
  @inline_result = nil
  @inline_len = 0
end

Instance Method Details

#escape(text) ⇒ String

Note:

Multi-line HTML tags and blocks are handled by escaping the opening <

Escapes markdown special characters in the given text.

Handles both block-level constructs (headings, lists, code blocks, HTML blocks) and inline formatting (emphasis, code spans, links, inline HTML). Autolinks (<https://…>, <email@domain>) are intentionally preserved.

Parameters:

  • text (String, nil)

    the text to escape

Returns:

  • (String)

    the escaped text, or empty string if input is nil



124
125
126
127
128
129
130
131
132
133
134
# File 'lib/markbridge/renderers/discourse/markdown_escaper.rb', line 124

def escape(text)
  return "".freeze if text.nil?
  return text if text.empty?

  # Neutralize hard line breaks (trailing 2+ spaces before newline)
  text = text.gsub(/  +\n/, "\n") if @escape_hard_line_breaks && text.include?("  \n")

  return text unless MAYBE_SPECIAL.match?(text) || MAYBE_INDENTED_CODE.match?(text)

  escape_text(text)
end