Class: RedQuilt::Inline::Lexer
- Inherits:
-
Object
- Object
- RedQuilt::Inline::Lexer
- Defined in:
- lib/red_quilt/inline/lexer.rb
Overview
Scans a byte range of the document source and emits inline tokens into a caller-owned Tokens storage.
The lexer never copies the source string; all positions are absolute byte offsets into @source. The caller is responsible for clearing the Tokens storage between invocations if it is being reused.
Constant Summary collapse
- SPECIAL_BYTES =
Bytes whose appearance ends a TEXT run. Anything not in this set is plain text content. Newline is included so LINE_ENDING gets its own token.
begin a = Array.new(256, false) # *, _, `, [, ], !, <, &, \, \n, ~ (GFM strikethrough) [0x2A, 0x5F, 0x60, 0x5B, 0x5D, 0x21, 0x3C, 0x26, 0x5C, 0x0A, 0x7E].each { |b| a[b] = true } a.freeze end
- SPECIAL_BYTE_RE =
Same set as SPECIAL_BYTES, for String#byteindex to jump over long plain-text runs at C speed.
/[*_`\[\]!<&\\\n~]/- URI_AUTOLINK_RE =
Anchored regexes for StringScanner#scan (still used by scan_angle / scan_amp). StringScanner anchors at the current pos, so no ‘G` is needed.
URI autolink rejects every ASCII control char (U+0000-U+001F, U+007F) plus space (U+0020); CommonMark 6.5 forbids ASCII control characters, space, <, or >.
/<([A-Za-z][A-Za-z0-9+.-]{1,31}:[^<>\u0000-\u0020\u007F]*)>/- EMAIL_AUTOLINK_RE =
/<([a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)>/- HTML_OPEN_TAG_RE =
CommonMark spec 6.6 “Raw HTML”: six forms — open tag, closing tag, HTML comment, processing instruction, declaration, CDATA section. Attribute values are allowed to span lines. HTML tag separators are restricted to space/tab/CR/LF per spec – s would also match form feed (U+000C) and vertical tab (U+000B), which CommonMark disallows.
%r{<[A-Za-z][A-Za-z0-9-]*(?:[ \t\r\n]+[A-Za-z_:][A-Za-z0-9_.:-]*(?:[ \t\r\n]*=[ \t\r\n]*(?:"[^"]*"|'[^']*'|[^ \t\r\n"'=<>`]+))?)*[ \t\r\n]*/?>}- HTML_CLOSING_TAG_RE =
%r{</[A-Za-z][A-Za-z0-9-]*[ \t\r\n]*>}- HTML_COMMENT_RE =
Comment: ‘<!–>`, `<!—>`, or `<!– text –>` where text doesn’t start with ‘>` or `->`, end with `-`, or contain `–`.
%r{<!-->|<!--->|<!--(?!>)(?!->)[\s\S]*?(?<!-)-->}- HTML_PROC_INST_RE =
%r{<\?[\s\S]*?\?>}- HTML_DECLARATION_RE =
%r{<![A-Za-z][^>]*>}- HTML_CDATA_RE =
%r{<!\[CDATA\[[\s\S]*?\]\]>}
Instance Method Summary collapse
-
#initialize(source) ⇒ Lexer
constructor
Entity regex and decoder live on the enclosing Inline module so the same digit-count caps and U+FFFD replacement apply across the lexer, the inline builder, and the reference-definition parser.
-
#lex_into(tokens, start_byte, end_byte) ⇒ Object
Scans @source and emits tokens.
Constructor Details
#initialize(source) ⇒ Lexer
Entity regex and decoder live on the enclosing Inline module so the same digit-count caps and U+FFFD replacement apply across the lexer, the inline builder, and the reference-definition parser. See lib/red_quilt/inline/html_entities.rb.
57 58 59 60 61 62 63 64 |
# File 'lib/red_quilt/inline/lexer.rb', line 57 def initialize(source) @source = source # A binary-encoded view for String#byteindex hot paths (byteindex # on a UTF-8 string raises when the offset falls inside a # multibyte sequence; binary treats every byte as its own char). @source_b = source.b @ss = StringScanner.new(source) end |
Instance Method Details
#lex_into(tokens, start_byte, end_byte) ⇒ Object
Scans @source and emits tokens. Returns the tokens object that was passed in.
68 69 70 71 72 73 74 |
# File 'lib/red_quilt/inline/lexer.rb', line 68 def lex_into(tokens, start_byte, end_byte) @ss.pos = start_byte @start = start_byte @end = end_byte scan(tokens) tokens end |