Class: Archaeo::WarcReader

Inherits:
Object
  • Object
show all
Defined in:
lib/archaeo/warc_support.rb

Overview

Reads WARC (Web ARChive) format files (.warc, .warc.gz).

Parses WARC 1.0 records and yields WarcRecord value objects containing headers and body content.

Constant Summary collapse

WARC_VERSION =
"WARC/1.0"
CRLF =
"\r\n"
HEADER_END =
"\r\n\r\n"

Instance Method Summary collapse

Constructor Details

#initializeWarcReader

Returns a new instance of WarcReader.



17
18
19
# File 'lib/archaeo/warc_support.rb', line 17

def initialize
  @record_count = 0
end

Instance Method Details

#read(path, &block) ⇒ Object



21
22
23
24
25
26
# File 'lib/archaeo/warc_support.rb', line 21

def read(path, &block)
  io = open_warc(path)
  read_records_from_io(io, &block)
ensure
  io&.close
end

#read_records(path) ⇒ Object



28
29
30
31
32
# File 'lib/archaeo/warc_support.rb', line 28

def read_records(path)
  records = []
  read(path) { |record| records << record }
  records
end