Class: RedQuilt::Inline::Lexer

Inherits:

Object

Object
RedQuilt::Inline::Lexer

show all

Defined in:: lib/red_quilt/inline/lexer.rb

Overview

Scans a byte range of the document source and emits inline tokens into a caller-owned Tokens storage.

The lexer never copies the source string; all positions are absolute byte offsets into @source. The caller is responsible for clearing the Tokens storage between invocations if it is being reused.

Constant Summary collapse

SPECIAL_BYTES = Bytes whose appearance ends a TEXT run. Anything not in this set is plain text content. Newline is included so LINE_ENDING gets its own token.

begin
  a = Array.new(256, false)
  # *, _, `, [, ], !, <, &, \, \n, ~ (GFM strikethrough)
  [0x2A, 0x5F, 0x60, 0x5B, 0x5D, 0x21, 0x3C, 0x26, 0x5C, 0x0A, 0x7E].each { |b| a[b] = true }
  a.freeze
end

SPECIAL_BYTE_RE = Same set as SPECIAL_BYTES, for String#byteindex to jump over long plain-text runs at C speed.

/[*_`\[\]!<&\\\n~]/

URI_AUTOLINK_RE = Anchored regexes for StringScanner#scan (still used by scan_angle / scan_amp). StringScanner anchors at the current pos, so no ‘G` is needed. URI autolink rejects every ASCII control char (U+0000-U+001F, U+007F) plus space (U+0020); CommonMark 6.5 forbids ASCII control characters, space, <, or >.

/<([A-Za-z][A-Za-z0-9+.-]{1,31}:[^<>\u0000-\u0020\u007F]*)>/

EMAIL_AUTOLINK_RE =

/<([a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)>/

HTML_OPEN_TAG_RE = CommonMark spec 6.6 “Raw HTML”: six forms — open tag, closing tag, HTML comment, processing instruction, declaration, CDATA section. Attribute values are allowed to span lines. HTML tag separators are restricted to space/tab/CR/LF per spec – s would also match form feed (U+000C) and vertical tab (U+000B), which CommonMark disallows.

%r{<[A-Za-z][A-Za-z0-9-]*(?:[ \t\r\n]+[A-Za-z_:][A-Za-z0-9_.:-]*(?:[ \t\r\n]*=[ \t\r\n]*(?:"[^"]*"|'[^']*'|[^ \t\r\n"'=<>`]+))?)*[ \t\r\n]*/?>}

HTML_CLOSING_TAG_RE =

%r{</[A-Za-z][A-Za-z0-9-]*[ \t\r\n]*>}

HTML_COMMENT_RE = Comment: ‘<!–>`, `<!—>`, or `<!– text –>` where text doesn’t start with ‘>` or `->`, end with `-`, or contain `–`.

%r{<!-->|<!--->|<!--(?!>)(?!->)[\s\S]*?(?<!-)-->}

HTML_PROC_INST_RE =

%r{<\?[\s\S]*?\?>}

HTML_DECLARATION_RE =

%r{<![A-Za-z][^>]*>}

HTML_CDATA_RE =

%r{<!\[CDATA\[[\s\S]*?\]\]>}

Instance Method Summary collapse

#initialize(source) ⇒ Lexer constructor

Entity regex and decoder live on the enclosing Inline module so the same digit-count caps and U+FFFD replacement apply across the lexer, the inline builder, and the reference-definition parser.
#lex_into(tokens, start_byte, end_byte) ⇒ Object

Scans @source and emits tokens.

Constructor Details

#initialize(source) ⇒ `Lexer`

Entity regex and decoder live on the enclosing Inline module so the same digit-count caps and U+FFFD replacement apply across the lexer, the inline builder, and the reference-definition parser. See lib/red_quilt/inline/html_entities.rb.

# File 'lib/red_quilt/inline/lexer.rb', line 57

def initialize(source)
  @source = source
  # A binary-encoded view for String#byteindex hot paths (byteindex
  # on a UTF-8 string raises when the offset falls inside a
  # multibyte sequence; binary treats every byte as its own char).
  @source_b = source.b
  @ss = StringScanner.new(source)
end

Instance Method Details

#lex_into(tokens, start_byte, end_byte) ⇒ `Object`

Scans @source and emits tokens. Returns the tokens object that was passed in.

# File 'lib/red_quilt/inline/lexer.rb', line 68

def lex_into(tokens, start_byte, end_byte)
  @ss.pos = start_byte
  @start = start_byte
  @end = end_byte
  scan(tokens)
  tokens
end