Class: Pikuri::Extractor::Page

Inherits:
Data
  • Object
show all
Defined in:
lib/pikuri/extractor.rb

Overview

One windowed slice of a document, returned by extract_paged. The caller turns this into an observation; this struct carries everything a trailer needs without the caller re-reading the document.

Fields

  • linesArray<String>, the collected window. Already per-line truncated (with PAGE_LINE_TRUNCATION_MARKER); not line-numbered — numbering is presentation the caller adds. For a PDF the array includes the “— Page N —” marker lines pikuri-pdf’s extractor emits, which count toward limit / the byte cap like any other line.

  • start_line — the 1-indexed line number of lines.first (i.e. the offset the caller asked for). lines.last is at start_line lines.length - 1+.

  • total_lines — total line count of the document when known, else nil. Known when the read reached EOF, when the format was extracted in full (no extract_lines — e.g. HTML), or when the lazy stream is cheap enough to count to the end (plain text). nil when a lazy stream stopped early — the byte cap fired, or a PDF filled the window before its last page (counting the rest would mean parsing every page, defeating the laziness).

  • moretrue if content remains past this window (the caller should offer offset = start_line lines.length+).

  • byte_cappedtrue if the byte cap (not the line limit) was the stopping criterion.

  • kind — the matched extractor’s kind tag (:text / :pdf / :html); lets the caller word format-specific trailers and the empty-document message.

An empty document yields lines: [], total_lines: 0; an offset past EOF yields lines: [] with total_lines set to the real (non-zero) count — the caller distinguishes the two.

Instance Attribute Summary collapse

Instance Attribute Details

#byte_cappedObject (readonly)

Returns the value of attribute byte_capped

Returns:

  • (Object)

    the current value of byte_capped



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def byte_capped
  @byte_capped
end

#kindObject (readonly)

Returns the value of attribute kind

Returns:

  • (Object)

    the current value of kind



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def kind
  @kind
end

#linesObject (readonly)

Returns the value of attribute lines

Returns:

  • (Object)

    the current value of lines



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def lines
  @lines
end

#moreObject (readonly)

Returns the value of attribute more

Returns:

  • (Object)

    the current value of more



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def more
  @more
end

#start_lineObject (readonly)

Returns the value of attribute start_line

Returns:

  • (Object)

    the current value of start_line



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def start_line
  @start_line
end

#total_linesObject (readonly)

Returns the value of attribute total_lines

Returns:

  • (Object)

    the current value of total_lines



146
147
148
# File 'lib/pikuri/extractor.rb', line 146

def total_lines
  @total_lines
end