Class: RedQuilt::Inline::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/red_quilt/inline/lexer.rb

Overview

Scans a byte range of the document source and emits inline tokens into a caller-owned Tokens storage.

The lexer never copies the source string; all positions are absolute byte offsets into @source. The caller is responsible for clearing the Tokens storage between invocations if it is being reused.

Constant Summary collapse

SPECIAL_BYTES =

Bytes whose appearance ends a TEXT run. Anything not in this set is plain text content. Newline is included so LINE_ENDING gets its own token.

begin
  a = Array.new(256, false)
  # *, _, `, [, ], !, <, &, \, \n, ~ (GFM strikethrough)
  [0x2A, 0x5F, 0x60, 0x5B, 0x5D, 0x21, 0x3C, 0x26, 0x5C, 0x0A, 0x7E].each { |b| a[b] = true }
  a.freeze
end
SPECIAL_BYTE_RE =

Same set as SPECIAL_BYTES, for String#byteindex to jump over long plain-text runs at C speed.

/[*_`\[\]!<&\\\n~]/
/<([A-Za-z][A-Za-z0-9+.-]{1,31}:[^<>\u0000-\u0020\u007F]*)>/
/<([a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)>/
HTML_OPEN_TAG_RE =

CommonMark spec 6.6 “Raw HTML”: six forms — open tag, closing tag, HTML comment, processing instruction, declaration, CDATA section. Attribute values are allowed to span lines. HTML tag separators are restricted to space/tab/CR/LF per spec – s would also match form feed (U+000C) and vertical tab (U+000B), which CommonMark disallows.

%r{<[A-Za-z][A-Za-z0-9-]*(?:[ \t\r\n]+[A-Za-z_:][A-Za-z0-9_.:-]*(?:[ \t\r\n]*=[ \t\r\n]*(?:"[^"]*"|'[^']*'|[^ \t\r\n"'=<>`]+))?)*[ \t\r\n]*/?>}
HTML_CLOSING_TAG_RE =
%r{</[A-Za-z][A-Za-z0-9-]*[ \t\r\n]*>}
HTML_COMMENT_RE =

Comment: ‘<!–>`, `<!—>`, or `<!– text –>` where text doesn’t start with ‘>` or `->`, end with `-`, or contain `–`.

%r{<!-->|<!--->|<!--(?!>)(?!->)[\s\S]*?(?<!-)-->}
HTML_PROC_INST_RE =
%r{<\?[\s\S]*?\?>}
HTML_DECLARATION_RE =
%r{<![A-Za-z][^>]*>}
HTML_CDATA_RE =
%r{<!\[CDATA\[[\s\S]*?\]\]>}

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Lexer

Entity regex and decoder live on the enclosing Inline module so the same digit-count caps and U+FFFD replacement apply across the lexer, the inline builder, and the reference-definition parser. See lib/red_quilt/inline/html_entities.rb.



57
58
59
60
61
62
63
64
# File 'lib/red_quilt/inline/lexer.rb', line 57

def initialize(source)
  @source = source
  # A binary-encoded view for String#byteindex hot paths (byteindex
  # on a UTF-8 string raises when the offset falls inside a
  # multibyte sequence; binary treats every byte as its own char).
  @source_b = source.b
  @ss = StringScanner.new(source)
end

Instance Method Details

#lex_into(tokens, start_byte, end_byte) ⇒ Object

Scans @source and emits tokens. Returns the tokens object that was passed in.



68
69
70
71
72
73
74
# File 'lib/red_quilt/inline/lexer.rb', line 68

def lex_into(tokens, start_byte, end_byte)
  @ss.pos = start_byte
  @start = start_byte
  @end = end_byte
  scan(tokens)
  tokens
end