Module: SmarterJSON::Framer

Includes:: Bytes

Defined in:: lib/smarter_json/parser.rb

Constant Summary collapse

CHUNK_SIZE =

16 * 1024

Constants included from Bytes

Bytes::BACKSLASH, Bytes::COLON, Bytes::COMMA, Bytes::CR, Bytes::DOLLAR, Bytes::DOT, Bytes::DQUOTE, Bytes::HASH, Bytes::LBRACE, Bytes::LBRACKET, Bytes::LF, Bytes::LOWER_E, Bytes::LOWER_F, Bytes::LOWER_N, Bytes::LOWER_T, Bytes::LOWER_U, Bytes::LOWER_X, Bytes::MINUS, Bytes::NINE, Bytes::PLUS, Bytes::RBRACE, Bytes::RBRACKET, Bytes::SLASH, Bytes::SPACE, Bytes::SQUOTE, Bytes::STAR, Bytes::TAB, Bytes::UNDERSCORE, Bytes::UPPER_E, Bytes::UPPER_F, Bytes::UPPER_I, Bytes::UPPER_N, Bytes::UPPER_T, Bytes::UPPER_X, Bytes::ZERO

Class Method Summary collapse

.block_comment_start?(buffer, scan) ⇒ Boolean
.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ Boolean
True when b is the lead byte of a multi-byte marker but the rest of that marker has not been read into the buffer yet, so we cannot decide what it is.
.each_document(io) {|buffer| ... } ⇒ Object
.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ Object
Like each_document, but the IO's raw bytes are in conv's source encoding (UTF-16 / UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so the byte-level splitter works.
.finish_transcode(conv) ⇒ Object
Flush the converter at end of stream.
.line_comment_start?(buffer, scan) ⇒ Boolean
.preceded_by_ws_or_start?(buffer, scan) ⇒ Boolean
.read_chunk(io) ⇒ Object
.scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ Object
.separators_only?(buffer) ⇒ Boolean
.transcode_chunk(conv, raw) ⇒ Object
Push one raw chunk through the converter, returning the UTF-8 produced so far.
.whitespace_byte?(b) ⇒ Boolean

Class Method Details

.block_comment_start?(buffer, scan) ⇒ `Boolean`

Returns:

(Boolean)



655
656
657

# File 'lib/smarter_json/parser.rb', line 655

def block_comment_start?(buffer, scan)
  buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan)
end

.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ `Boolean`

True when b is the lead byte of a multi-byte marker but the rest of that marker has not been read into the buffer yet, so we cannot decide what it is. // and /* need 2 bytes; ''' (and a closing ''') needs 3; a closing */ needs 2. Backslash escapes and single-byte delimiters never need this.

Returns:

(Boolean)

# File 'lib/smarter_json/parser.rb', line 592

def defer_for_split_marker?(buffer, scan, b, mode, doc_start)
  avail = buffer.bytesize - scan
  case mode
  when :block_comment
    b == STAR && avail < 2
  when :triple
    b == SQUOTE && avail < 3
  when nil
    if doc_start.nil?
      b == SLASH && avail < 2
    else
      (b == SLASH && avail < 2) || (b == SQUOTE && avail < 3)
    end
  else
    false
  end
end

.each_document(io) {|buffer| ... } ⇒ `Object`

Yields:

(buffer)

# File 'lib/smarter_json/parser.rb', line 396

def each_document(io)
  buffer = +""
  scan = 0
  doc_start = nil
  stack = []
  mode = nil

  while (chunk = read_chunk(io))
    buffer << chunk
    loop do
      emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
      break unless emitted

      yield emitted
    end
  end

  yield buffer unless separators_only?(buffer)
end

.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ `Object`

Like each_document, but the IO's raw bytes are in conv's source encoding (UTF-16 / UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so the byte-level splitter works. first_chunk is the already-read first raw chunk (the caller sniffs a BOM from it). Memory stays bounded by one document, like each_document.

Yields:

(buffer)

# File 'lib/smarter_json/parser.rb', line 420

def each_document_transcoded(io, conv, first_chunk)
  buffer = +""
  scan = 0
  doc_start = nil
  stack = []
  mode = nil

  raw = first_chunk
  while raw
    chunk = transcode_chunk(conv, raw)
    unless chunk.empty?
      buffer << chunk
      loop do
        emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
        break unless emitted

        yield emitted
      end
    end
    raw = read_chunk(io)
  end

  finish_transcode(conv) # truncated / invalid trailing bytes -> SmarterJSON::EncodingError

  yield buffer unless separators_only?(buffer)
end

.finish_transcode(conv) ⇒ `Object`

Flush the converter at end of stream. A held incomplete multibyte sequence means the input was truncated mid-character — surface it the same way an invalid encoding is surfaced.

Raises:

(SmarterJSON::EncodingError)

# File 'lib/smarter_json/parser.rb', line 462

def finish_transcode(conv)
  return if conv.nil?

  status = conv.primitive_convert("".b, +"")
  raise SmarterJSON::EncodingError, "invalid byte sequence in stream" unless status == :finished
end

.line_comment_start?(buffer, scan) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/smarter_json/parser.rb', line 648

def line_comment_start?(buffer, scan)
  b = buffer.getbyte(scan)
  return preceded_by_ws_or_start?(buffer, scan) if b == HASH

  b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan)
end

.preceded_by_ws_or_start?(buffer, scan) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/smarter_json/parser.rb', line 659

def preceded_by_ws_or_start?(buffer, scan)
  return true if scan.zero?

  prev = buffer.getbyte(scan - 1)
  whitespace_byte?(prev)
end

.read_chunk(io) ⇒ `Object`

# File 'lib/smarter_json/parser.rb', line 469

def read_chunk(io)
  if io.respond_to?(:readpartial)
    io.readpartial(CHUNK_SIZE)
  else
    io.read(CHUNK_SIZE)
  end
rescue EOFError
  nil
end

.scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ `Object`

# File 'lib/smarter_json/parser.rb', line 479

def scan_buffer(buffer, scan, doc_start, stack, mode)
  while scan < buffer.bytesize
    b = buffer.getbyte(scan)
    # A multi-byte marker (// /* ''' */) whose lead byte is here but whose
    # remaining bytes have not arrived yet must not be guessed at — advancing
    # past the lead byte would misread the brace/quote that follows it once the
    # next chunk lands. Stop and let each_document append more input, then resume
    # from this same position. At true EOF the leftover is parsed whole instead.
    break if defer_for_split_marker?(buffer, scan, b, mode, doc_start)

    if mode == :double
      if b == BACKSLASH
        scan += 2
      elsif b == DQUOTE
        mode = nil
        scan += 1
      else
        scan += 1
      end
    elsif mode == :single
      if b == BACKSLASH
        scan += 2
      elsif b == SQUOTE
        mode = nil
        scan += 1
      else
        scan += 1
      end
    elsif mode == :triple
      if buffer.byteslice(scan, 3) == "'''"
        mode = nil
        scan += 3
      else
        scan += 1
      end
    elsif mode == :line_comment
      if [LF, CR].include?(b)
        mode = nil
      else
        scan += 1
        next
      end
    elsif mode == :block_comment
      if buffer.byteslice(scan, 2) == '*/'
        mode = nil
        scan += 2
      else
        scan += 1
      end
    elsif doc_start.nil?
      if whitespace_byte?(b)
        scan += 1
      elsif line_comment_start?(buffer, scan)
        mode = :line_comment
        scan += buffer.getbyte(scan) == HASH ? 1 : 2
      elsif block_comment_start?(buffer, scan)
        mode = :block_comment
        scan += 2
      elsif [LBRACE, LBRACKET].include?(b)
        doc_start = scan
        stack << b
        scan += 1
      else
        scan = buffer.bytesize
      end
    else
      if mode.nil? && line_comment_start?(buffer, scan)
        mode = :line_comment
        scan += buffer.getbyte(scan) == HASH ? 1 : 2
      elsif mode.nil? && block_comment_start?(buffer, scan)
        mode = :block_comment
        scan += 2
      elsif b == DQUOTE
        mode = :double
        scan += 1
      elsif buffer.byteslice(scan, 3) == "'''"
        mode = :triple
        scan += 3
      elsif b == SQUOTE
        mode = :single
        scan += 1
      elsif [LBRACE, LBRACKET].include?(b)
        stack << b
        scan += 1
      elsif b == RBRACE
        stack.pop if stack.last == LBRACE
        scan += 1
        if stack.empty?
          doc = buffer.byteslice(doc_start, scan - doc_start)
          buffer = buffer.byteslice(scan..-1) || +""
          return [doc, buffer, 0, nil, [], nil]
        end
      elsif b == RBRACKET
        stack.pop if stack.last == LBRACKET
        scan += 1
        if stack.empty?
          doc = buffer.byteslice(doc_start, scan - doc_start)
          buffer = buffer.byteslice(scan..-1) || +""
          return [doc, buffer, 0, nil, [], nil]
        end
      else
        scan += 1
      end
    end
  end

  [nil, buffer, scan, doc_start, stack, mode]
end

.separators_only?(buffer) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/smarter_json/parser.rb', line 610

def separators_only?(buffer)
  scan = 0
  mode = nil
  while scan < buffer.bytesize
    b = buffer.getbyte(scan)
    if mode == :line_comment
      if [LF, CR].include?(b)
        mode = nil
      else
        scan += 1
        next
      end
    elsif mode == :block_comment
      if buffer.byteslice(scan, 2) == '*/'
        mode = nil
        scan += 2
      else
        scan += 1
      end
    elsif whitespace_byte?(b)
      scan += 1
    elsif line_comment_start?(buffer, scan)
      mode = :line_comment
      scan += buffer.getbyte(scan) == HASH ? 1 : 2
    elsif block_comment_start?(buffer, scan)
      mode = :block_comment
      scan += 2
    else
      return false
    end
  end
  true
end

.transcode_chunk(conv, raw) ⇒ `Object`

Push one raw chunk through the converter, returning the UTF-8 produced so far. An incomplete trailing multibyte sequence is held inside the converter until the next chunk; invalid bytes raise SmarterJSON::EncodingError (matching the whole-buffer to_utf8_copy).

Raises:

(SmarterJSON::EncodingError)

# File 'lib/smarter_json/parser.rb', line 450

def transcode_chunk(conv, raw)
  return raw.dup.force_encoding(Encoding::UTF_8) if conv.nil? # raw bytes are already UTF-8

  out = +""
  status = conv.primitive_convert(raw.dup, out, nil, nil, partial_input: true)
  raise SmarterJSON::EncodingError, "invalid byte sequence in stream" if status == :invalid_byte_sequence

  out
end

.whitespace_byte?(b) ⇒ `Boolean`

Returns:

(Boolean)



644
645
646

# File 'lib/smarter_json/parser.rb', line 644

def whitespace_byte?(b)
  b == SPACE || (b && b >= TAB && b <= CR)
end

Module: SmarterJSON::Framer

Constant Summary collapse

Constants included from Bytes

Class Method Summary collapse

Class Method Details

.block_comment_start?(buffer, scan) ⇒ Boolean

.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ Boolean

.each_document(io) {|buffer| ... } ⇒ Object

.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ Object

.finish_transcode(conv) ⇒ Object

.line_comment_start?(buffer, scan) ⇒ Boolean

.preceded_by_ws_or_start?(buffer, scan) ⇒ Boolean

.read_chunk(io) ⇒ Object

.scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ Object

.separators_only?(buffer) ⇒ Boolean

.transcode_chunk(conv, raw) ⇒ Object

.whitespace_byte?(b) ⇒ Boolean

.block_comment_start?(buffer, scan) ⇒ `Boolean`

.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ `Boolean`

.each_document(io) {|buffer| ... } ⇒ `Object`

.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ `Object`

.finish_transcode(conv) ⇒ `Object`

.line_comment_start?(buffer, scan) ⇒ `Boolean`

.preceded_by_ws_or_start?(buffer, scan) ⇒ `Boolean`

.read_chunk(io) ⇒ `Object`

.scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ `Object`

.separators_only?(buffer) ⇒ `Boolean`

.transcode_chunk(conv, raw) ⇒ `Object`

.whitespace_byte?(b) ⇒ `Boolean`