Class: SmarterJSON::Parser

Inherits:

Object

Object
SmarterJSON::Parser

show all

Includes:: Bytes

Defined in:: lib/smarter_json/parser.rb

Overview

Hand-rolled FSM single-pass parser. Layer 1: strict JSON (RFC 8259). Layer 2: JSON5 additions — line/block comments, trailing comma,

unquoted ECMAScript identifier keys, single-quoted strings,
hex numbers, leading/trailing decimal points, Infinity/NaN,
explicit + sign, \-line-continuation inside strings.

Layer 3: HJSON-inspired additions — #/comment-marker rule, triple-quoted

strings, quoteless single-line strings, implicit root object,
newline-as-separator, broader unquoted keys, recognized-literals-win.

Layer 4: smarter_json additions — UTF-8 BOM skip, smart/curly quotes,

Python literals (True/False/None) and undefined, underscores in
numeric literals, and encoding validation (SmarterJSON::EncodingError).

Constant Summary collapse

NOT_NUMERIC =

Object.new

HEX_RE =

/\A[-+]?0[xX][0-9a-fA-F_]+\z/.freeze

DEC_RE = Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a bare exponent like “-e695881” is NOT a number — it falls through to a quoteless string, matching the C path. Trailing exponent stays optional.

/\A[-+]?(?:(?:0|[1-9][0-9_]*)(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze

NEEDS_DECIMAL_FIXUP = A decimal BigDecimal() would reject as-is: a leading dot (“.5”) or a dot not followed by a digit (“5.”, “5.e3”). Matches iff normalize_for_bigdecimal would change the string — so when it doesn’t match, we skip normalization.

/\A[+-]?\.|\.(?:[eE]|\z)/.freeze

BYTEINDEX_AVAILABLE = parse_string scans to the next closing-quote-or-backslash. byteindex (Ruby 3.2+, MRI) does that jump at C speed; the getbyte loop in scan_string_delimiter is the portable fallback (JRuby / TruffleRuby / older MRI). Both find the same byte.

"".respond_to?(:byteindex)

DQUOTE_OR_BACKSLASH =

/["\\]/.freeze

SQUOTE_OR_BACKSLASH =

/['\\]/.freeze

QL_BREAK = scan_quoteless_run’s fast path jumps (in C) to the first structural terminator (‘,’ ‘}’ ‘]’ ‘{’ ‘[’) OR any whitespace ([[:space:]] covers ASCII + Unicode space, incl. LF/CR which also terminate). Stopping at a terminator/EOF means the run had no interior whitespace, so there’s nothing to trim and no comment marker can apply.

/[,{}\[\]]|[[:space:]]/.freeze

DEFAULT_OPTIONS = The defaults live centrally in SmarterJSON::Options (lib/smarter_json/options.rb).

Options::DEFAULT_OPTIONS

Constants included from Bytes

Bytes::BACKSLASH, Bytes::COLON, Bytes::COMMA, Bytes::CR, Bytes::DOLLAR, Bytes::DOT, Bytes::DQUOTE, Bytes::HASH, Bytes::LBRACE, Bytes::LBRACKET, Bytes::LF, Bytes::LOWER_E, Bytes::LOWER_F, Bytes::LOWER_N, Bytes::LOWER_T, Bytes::LOWER_U, Bytes::LOWER_X, Bytes::MINUS, Bytes::NINE, Bytes::PLUS, Bytes::RBRACE, Bytes::RBRACKET, Bytes::SLASH, Bytes::SPACE, Bytes::SQUOTE, Bytes::STAR, Bytes::TAB, Bytes::UNDERSCORE, Bytes::UPPER_E, Bytes::UPPER_F, Bytes::UPPER_I, Bytes::UPPER_N, Bytes::UPPER_T, Bytes::UPPER_X, Bytes::ZERO

Instance Method Summary collapse

#each_value ⇒ Object

Yield each top-level value until EOF (JSONL / NDJSON / concatenated / whitespace-separated).
#initialize(input, options = {}) ⇒ Parser constructor

A new instance of Parser.
#parse ⇒ Object

No block: auto-detect the document count for free (the same “is there trailing content?” check that used to raise).

Constructor Details

#initialize(input, options = {}) ⇒ `Parser`

Returns a new instance of Parser.

Raises:

(ArgumentError)

# File 'lib/smarter_json/parser.rb', line 709

def initialize(input, options = {})
  raise ArgumentError, "input must be a String" unless input.is_a?(String)

  opts = DEFAULT_OPTIONS.merge(options)
  @symbolize_keys  = opts[:symbolize_keys]
  @duplicate_key   = opts[:duplicate_key]
  @decimal_precision = opts[:decimal_precision]
  @on_warning = opts[:on_warning]
  # store_member only needs the (per-member) Hash#key? duplicate lookup when a
  # repeat would change behavior: a warning must fire, or :first_wins must keep the
  # first. With the default (:last_wins, no handler) a duplicate just overwrites,
  # which `hash[k] = value` already does — so skip the lookup entirely.
  @check_duplicates = !@on_warning.nil? || @duplicate_key == :first_wins

  encoding = opts[:encoding]
  @input = encoding ? input.dup.force_encoding(encoding) : input
  raise EncodingError, "invalid byte sequence for #{@input.encoding.name}" unless @input.valid_encoding?

  @bytesize = @input.bytesize
  # Skip a UTF-8 BOM (EF BB BF) at the start of input.
  @pos = @input.getbyte(0) == 0xEF && @input.getbyte(1) == 0xBB && @input.getbyte(2) == 0xBF ? 3 : 0
end

Instance Method Details

#each_value ⇒ `Object`

Yield each top-level value until EOF (JSONL / NDJSON / concatenated / whitespace-separated). Used by the block form of SmarterJSON.process.

# File 'lib/smarter_json/parser.rb', line 753

def each_value
  count = 0
  until eof?
    skip_document_separators
    break if eof?

    value = parse_document
    enforce_scalar_boundary(value)
    yield value
    count += 1
  end
  count
end

#parse ⇒ `Object`

No block: auto-detect the document count for free (the same “is there trailing content?” check that used to raise). 0 documents -> nil; 1 document -> the value itself (single-document path, no Array allocated); 2+ documents (NDJSON / JSONL / concatenated / whitespace-separated) -> an Array of every value. Commas do NOT separate documents (only whitespace / newline / concatenation do), so a bracketless comma list still raises in parse_document.

# File 'lib/smarter_json/parser.rb', line 738

def parse
  results = []
  until eof?
    skip_document_separators
    break if eof?

    value = parse_document
    enforce_scalar_boundary(value)
    results << value
  end
  results
end