Module: SmarterJSON::Framer
- Includes:
- Bytes
- Defined in:
- lib/smarter_json/parser.rb
Constant Summary collapse
- CHUNK_SIZE =
16 * 1024
Constants included from Bytes
Bytes::BACKSLASH, Bytes::COLON, Bytes::COMMA, Bytes::CR, Bytes::DOLLAR, Bytes::DOT, Bytes::DQUOTE, Bytes::HASH, Bytes::LBRACE, Bytes::LBRACKET, Bytes::LF, Bytes::LOWER_E, Bytes::LOWER_F, Bytes::LOWER_N, Bytes::LOWER_T, Bytes::LOWER_U, Bytes::LOWER_X, Bytes::MINUS, Bytes::NINE, Bytes::PLUS, Bytes::RBRACE, Bytes::RBRACKET, Bytes::SLASH, Bytes::SPACE, Bytes::SQUOTE, Bytes::STAR, Bytes::TAB, Bytes::UNDERSCORE, Bytes::UPPER_E, Bytes::UPPER_F, Bytes::UPPER_I, Bytes::UPPER_N, Bytes::UPPER_T, Bytes::UPPER_X, Bytes::ZERO
Class Method Summary collapse
- .block_comment_start?(buffer, scan) ⇒ Boolean
-
.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ Boolean
True when
bis the lead byte of a multi-byte marker but the rest of that marker has not been read into the buffer yet, so we cannot decide what it is. - .each_document(io) {|buffer| ... } ⇒ Object
-
.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ Object
Like each_document, but the IO's raw bytes are in
conv's source encoding (UTF-16 / UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so the byte-level splitter works. -
.finish_transcode(conv) ⇒ Object
Flush the converter at end of stream.
- .line_comment_start?(buffer, scan) ⇒ Boolean
- .preceded_by_ws_or_start?(buffer, scan) ⇒ Boolean
- .read_chunk(io) ⇒ Object
- .scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ Object
- .separators_only?(buffer) ⇒ Boolean
-
.transcode_chunk(conv, raw) ⇒ Object
Push one raw chunk through the converter, returning the UTF-8 produced so far.
- .whitespace_byte?(b) ⇒ Boolean
Class Method Details
.block_comment_start?(buffer, scan) ⇒ Boolean
655 656 657 |
# File 'lib/smarter_json/parser.rb', line 655 def block_comment_start?(buffer, scan) buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan) end |
.defer_for_split_marker?(buffer, scan, b, mode, doc_start) ⇒ Boolean
True when b is the lead byte of a multi-byte marker but the rest of that
marker has not been read into the buffer yet, so we cannot decide what it is.
// and /* need 2 bytes; ''' (and a closing ''') needs 3; a closing
*/ needs 2. Backslash escapes and single-byte delimiters never need this.
592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 |
# File 'lib/smarter_json/parser.rb', line 592 def defer_for_split_marker?(buffer, scan, b, mode, doc_start) avail = buffer.bytesize - scan case mode when :block_comment b == STAR && avail < 2 when :triple b == SQUOTE && avail < 3 when nil if doc_start.nil? b == SLASH && avail < 2 else (b == SLASH && avail < 2) || (b == SQUOTE && avail < 3) end else false end end |
.each_document(io) {|buffer| ... } ⇒ Object
396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 |
# File 'lib/smarter_json/parser.rb', line 396 def each_document(io) buffer = +"" scan = 0 doc_start = nil stack = [] mode = nil while (chunk = read_chunk(io)) buffer << chunk loop do emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode) break unless emitted yield emitted end end yield buffer unless separators_only?(buffer) end |
.each_document_transcoded(io, conv, first_chunk) {|buffer| ... } ⇒ Object
Like each_document, but the IO's raw bytes are in conv's source encoding (UTF-16 /
UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so
the byte-level splitter works. first_chunk is the already-read first raw chunk (the
caller sniffs a BOM from it). Memory stays bounded by one document, like each_document.
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 |
# File 'lib/smarter_json/parser.rb', line 420 def each_document_transcoded(io, conv, first_chunk) buffer = +"" scan = 0 doc_start = nil stack = [] mode = nil raw = first_chunk while raw chunk = transcode_chunk(conv, raw) unless chunk.empty? buffer << chunk loop do emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode) break unless emitted yield emitted end end raw = read_chunk(io) end finish_transcode(conv) # truncated / invalid trailing bytes -> SmarterJSON::EncodingError yield buffer unless separators_only?(buffer) end |
.finish_transcode(conv) ⇒ Object
Flush the converter at end of stream. A held incomplete multibyte sequence means the input was truncated mid-character — surface it the same way an invalid encoding is surfaced.
462 463 464 465 466 467 |
# File 'lib/smarter_json/parser.rb', line 462 def finish_transcode(conv) return if conv.nil? status = conv.primitive_convert("".b, +"") raise SmarterJSON::EncodingError, "invalid byte sequence in stream" unless status == :finished end |
.line_comment_start?(buffer, scan) ⇒ Boolean
648 649 650 651 652 653 |
# File 'lib/smarter_json/parser.rb', line 648 def line_comment_start?(buffer, scan) b = buffer.getbyte(scan) return preceded_by_ws_or_start?(buffer, scan) if b == HASH b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan) end |
.preceded_by_ws_or_start?(buffer, scan) ⇒ Boolean
659 660 661 662 663 664 |
# File 'lib/smarter_json/parser.rb', line 659 def preceded_by_ws_or_start?(buffer, scan) return true if scan.zero? prev = buffer.getbyte(scan - 1) whitespace_byte?(prev) end |
.read_chunk(io) ⇒ Object
469 470 471 472 473 474 475 476 477 |
# File 'lib/smarter_json/parser.rb', line 469 def read_chunk(io) if io.respond_to?(:readpartial) io.readpartial(CHUNK_SIZE) else io.read(CHUNK_SIZE) end rescue EOFError nil end |
.scan_buffer(buffer, scan, doc_start, stack, mode) ⇒ Object
479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 |
# File 'lib/smarter_json/parser.rb', line 479 def scan_buffer(buffer, scan, doc_start, stack, mode) while scan < buffer.bytesize b = buffer.getbyte(scan) # A multi-byte marker (// /* ''' */) whose lead byte is here but whose # remaining bytes have not arrived yet must not be guessed at — advancing # past the lead byte would misread the brace/quote that follows it once the # next chunk lands. Stop and let each_document append more input, then resume # from this same position. At true EOF the leftover is parsed whole instead. break if defer_for_split_marker?(buffer, scan, b, mode, doc_start) if mode == :double if b == BACKSLASH scan += 2 elsif b == DQUOTE mode = nil scan += 1 else scan += 1 end elsif mode == :single if b == BACKSLASH scan += 2 elsif b == SQUOTE mode = nil scan += 1 else scan += 1 end elsif mode == :triple if buffer.byteslice(scan, 3) == "'''" mode = nil scan += 3 else scan += 1 end elsif mode == :line_comment if [LF, CR].include?(b) mode = nil else scan += 1 next end elsif mode == :block_comment if buffer.byteslice(scan, 2) == '*/' mode = nil scan += 2 else scan += 1 end elsif doc_start.nil? if whitespace_byte?(b) scan += 1 elsif line_comment_start?(buffer, scan) mode = :line_comment scan += buffer.getbyte(scan) == HASH ? 1 : 2 elsif block_comment_start?(buffer, scan) mode = :block_comment scan += 2 elsif [LBRACE, LBRACKET].include?(b) doc_start = scan stack << b scan += 1 else scan = buffer.bytesize end else if mode.nil? && line_comment_start?(buffer, scan) mode = :line_comment scan += buffer.getbyte(scan) == HASH ? 1 : 2 elsif mode.nil? && block_comment_start?(buffer, scan) mode = :block_comment scan += 2 elsif b == DQUOTE mode = :double scan += 1 elsif buffer.byteslice(scan, 3) == "'''" mode = :triple scan += 3 elsif b == SQUOTE mode = :single scan += 1 elsif [LBRACE, LBRACKET].include?(b) stack << b scan += 1 elsif b == RBRACE stack.pop if stack.last == LBRACE scan += 1 if stack.empty? doc = buffer.byteslice(doc_start, scan - doc_start) buffer = buffer.byteslice(scan..-1) || +"" return [doc, buffer, 0, nil, [], nil] end elsif b == RBRACKET stack.pop if stack.last == LBRACKET scan += 1 if stack.empty? doc = buffer.byteslice(doc_start, scan - doc_start) buffer = buffer.byteslice(scan..-1) || +"" return [doc, buffer, 0, nil, [], nil] end else scan += 1 end end end [nil, buffer, scan, doc_start, stack, mode] end |
.separators_only?(buffer) ⇒ Boolean
610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 |
# File 'lib/smarter_json/parser.rb', line 610 def separators_only?(buffer) scan = 0 mode = nil while scan < buffer.bytesize b = buffer.getbyte(scan) if mode == :line_comment if [LF, CR].include?(b) mode = nil else scan += 1 next end elsif mode == :block_comment if buffer.byteslice(scan, 2) == '*/' mode = nil scan += 2 else scan += 1 end elsif whitespace_byte?(b) scan += 1 elsif line_comment_start?(buffer, scan) mode = :line_comment scan += buffer.getbyte(scan) == HASH ? 1 : 2 elsif block_comment_start?(buffer, scan) mode = :block_comment scan += 2 else return false end end true end |
.transcode_chunk(conv, raw) ⇒ Object
Push one raw chunk through the converter, returning the UTF-8 produced so far. An incomplete trailing multibyte sequence is held inside the converter until the next chunk; invalid bytes raise SmarterJSON::EncodingError (matching the whole-buffer to_utf8_copy).
450 451 452 453 454 455 456 457 458 |
# File 'lib/smarter_json/parser.rb', line 450 def transcode_chunk(conv, raw) return raw.dup.force_encoding(Encoding::UTF_8) if conv.nil? # raw bytes are already UTF-8 out = +"" status = conv.primitive_convert(raw.dup, out, nil, nil, partial_input: true) raise SmarterJSON::EncodingError, "invalid byte sequence in stream" if status == :invalid_byte_sequence out end |