Class: Rsssf::Page

Inherits:

Object

Object
Rsssf::Page

show all

Includes:: Utils

Defined in:: lib/rsssf/page.rb,
lib/rsssf/page-meta.rb,
lib/rsssf/page-find_schedule.rb

Overview

note:

 a rsssf page may contain:
  many leagues, cups
  - tables, schedules (rounds), notes, etc.

a rsssf page MUST be in plain text (.txt) and utf-8 character encoding assumed

Constant Summary collapse

OPT_REF = let’s you check optional ref e.g. ‹§fin› todo/fix - change to OPT_REF_RE - make it regex regex embedded in regex will use regex.source automatic (no need to escape)!! let’s you check optional ref e.g. ‹§fin›

%q{
   (?: [ ]*
     ‹§ (?<ref> [^›]+?) ›
   )?
}

HX_RE =

%r{          ## negative lookahead
         ##   do NOT match  =-=
         ##   do NOT match  ===========  (without any heading text!!)
         ##     e.g.
         ##       Fall season
         ##       ===========

        (?! ^[ ]* (?:    =-=
                     |  ={1,} [ ]* $
                   )
         )

         ^
        [ ]*

      (?<marker> ={1,6})
         [ ]*
      (?<text> .+?)
         #{OPT_REF}
         [ ]*
$}x

HTML_COMMENT_HEADER_RE = note - A - start of string comment must start .txt document!!!

%r{  \A
         [ \n]*  ## trailing spaces and blank lines
    <!--
         [ \n]*
       (?<text> .+?)
         [ \n]*
     -->
}imx

HEADER_RE = note - starts at

%r{          ## negative lookahead
         ##   do NOT match  =-=
         ##   do NOT match  ===========  (without any heading text!!)
         ##     e.g.
         ##       Fall season
         ##       ===========

        (?! ^[ ]* (?:    =-=
                     |  ={1,} [ ]* $
                   )
         )

         ^
        [ ]*
      (?<marker> ={1,6})
         [ ]*
      (?<text> .+?)
         #{OPT_REF}
         [ ]*
$}x

Constants included from Utils

Utils::YEAR_FROM_NAME_RE

Instance Attribute Summary collapse

#txt ⇒ Object

use text alias too (for txt) - why? why not?.
#url ⇒ Object

source url.

Class Method Summary collapse

.parse_meta(txt) ⇒ Object
.read_cache(url) ⇒ Object

use read_cache /web/html or such - why? why not?.
.read_txt(path) ⇒ Object

use read_txt.

Instance Method Summary collapse

#_build_toc(txt) ⇒ Object
#_find_schedule(header:, strict: false) ⇒ Object
#_scan_headings ⇒ Object

change to outline - why? why not?.
#_split_sections(txt, level: 2) ⇒ Object
#_walk_sections(txt, header:, depth:, strict: false) ⇒ Object
#build_stat ⇒ Object
#find_schedule!(header:) ⇒ Object

make header required - yes change to build_schedule - why? why not??? add level: 2 or such - why? why not?.
#initialize(txt) ⇒ Page constructor

A new instance of Page.
#parse_meta(txt) ⇒ Object
#save(path) ⇒ Object

Methods included from Utils

#archive_dir_for_season, #year_from_file, #year_from_name

Constructor Details

#initialize(txt) ⇒ `Page`

Returns a new instance of Page.

# File 'lib/rsssf/page.rb', line 56

def initialize( txt )
  @txt   = txt
  @url   = nil
end

Instance Attribute Details

#txt ⇒ `Object`

use text alias too (for txt) - why? why not?



52
53
54

# File 'lib/rsssf/page.rb', line 52

def txt
  @txt
end

#url ⇒ `Object`

source url



53
54
55

# File 'lib/rsssf/page.rb', line 53

def url
  @url
end

Class Method Details

.parse_meta(txt) ⇒ `Object`

# File 'lib/rsssf/page-meta.rb', line 40

def self.parse_meta( txt )
     meta = {}
     m = HTML_COMMENT_HEADER_RE.match( txt )
     if m
        text = m[:text]
        text.each_line do |line|
            line = line.strip

            ## note - allow "inline" blank lines and comment lines (starting w/ #)
            next if line.empty?  || line.start_with?('#')

            ## split line on first colon (:) (only)
            ##   note - limit split to two pieces!!!
            key, value = line.split( /[ ]*:[ ]*/, 2)
            ## use a symbol (not string) as key - why? why not?
            meta[ key.to_sym ] = value
        end
        meta
     else
        nil ## no meta data (comment header) found
     end
end

.read_cache(url) ⇒ `Object`

use read_cache /web/html or such - why? why not?

# File 'lib/rsssf/page.rb', line 30

def self.read_cache( url )  ### use read_cache /web/html or such - why? why not?
  html = Webcache.read( url )

  puts "html:"
  pp html[0..400]

  txt = PageConverter.convert( html, url: url )
  txt

  new( txt )
end

.read_txt(path) ⇒ `Object`

use read_txt

# File 'lib/rsssf/page.rb', line 43

def self.read_txt( path )  ## use read_txt
    # note: always assume sources (already) converted from html to txt!!!!
  txt = read_text( path )
  new( txt )
end

Instance Method Details

#_build_toc(txt) ⇒ `Object`

# File 'lib/rsssf/page.rb', line 106

def _build_toc( txt )

     hx =  txt.scan( HX_RE )

     toc = []
       hx.each do |marker,text,ref|
          toc <<  "#{marker} #{text}"
       end
     toc
end

#_find_schedule(header:, strict: false) ⇒ `Object`

# File 'lib/rsssf/page-find_schedule.rb', line 76

def _find_schedule( header:, strict: false )
    ## make sure header is an array
    header = [header]    if header.is_a?( String )

    txt = _walk_sections( @txt, header: header,
                                depth:  0,
                                strict: strict )

    if txt
        ## wrap in schedule class - why? why not?
        schedule = Schedule.new( txt )
        schedule
    else
       nil
    end
end

#_scan_headings ⇒ `Object`

change to outline - why? why not?

102	# File 'lib/rsssf/page.rb', line 102 def _scan_headings() txt.scan( HX_RE ); end

#_split_sections(txt, level: 2) ⇒ `Object`

# File 'lib/rsssf/page-find_schedule.rb', line 42

def _split_sections( txt, level: 2 )

  sections = {}
  current  = nil

  txt.each_line do |line|
    if m=HEADER_RE.match( line )
        header_level  = m[:marker].size
        header_text   = m[:text]
        if header_level == level
           current = String.new
           sections[ header_text ] = current
           next
        end
    end

    current << line    if current
  end

  sections
end

#_walk_sections(txt, header:, depth:, strict: false) ⇒ `Object`

# File 'lib/rsssf/page-find_schedule.rb', line 94

def _walk_sections( txt, header:,
                         depth:,
                         strict: false )

   query      =  header[depth]
   query_next =  header[depth+1]

   ## note - start at level 2
   sections = _split_sections( txt, level: depth+2 )

   txt = sections[ query ]
   if txt
       if query_next
         txt = _walk_sections( txt, header: header,
                                    depth: depth+1,
                                    strict: strict )
         txt
       else
         txt
       end
   else
      if strict
        ## note - return nil if not found!!!
        raise ArgumentError, "section with header >#{query}< not found; sections incl. #{sections.keys}"
      else
        nil
      end
   end
end

#build_stat ⇒ `Object`

# File 'lib/rsssf/page.rb', line 132

def build_stat
  title        = nil
  source       = nil
  authors      = nil
  last_updated = nil

  meta = parse_meta( @txt ) || {}

  title        = meta[:title]
  source       = meta[:source]
  authors      = meta[:author] || meta[:authors]   ## note - check for author & authors !!!
  last_updated = meta[:updated]


  puts "*** !!! missing source"        if source.nil?
  puts "*** !!! missing author(s)"     if authors.nil?
  puts "**  !!! missing last updated"  if last_updated.nil?


  ## get year from source (url)
  ###   move (for reuse) to  year_from_url  in utils - why? why not?
  url_path  = URI.parse( source ).path
  basename  = File.basename( url_path, File.extname( url_path ) )  ## e.g. duit92.txt or duit92.html => duit92
  puts "   basename=>#{basename}<"
  year      = year_from_name( basename )


  sections = _build_toc( txt )



  rec = PageStat.new
  rec.source       = source         # e.g. http://rsssf.org/tabled/duit89.html   -- use source_url - why?? why not??
  rec.year         = year       ## note: in 2021/22  - year is always end_year, that is, 2022
  rec.title        = title
  rec.authors      = authors
  rec.last_updated = last_updated
  rec.line_count   = @txt.lines.count    ### or @txt.each_line.count
  rec.char_count   = @txt.size          ## note - size/length is true char count (@txt.bytesize is byte count!!)
  rec.sections     = sections

  rec
end

#find_schedule!(header:) ⇒ `Object`

make header required - yes

change to build_schedule - why? why not???
 add level: 2 or such - why? why not?



71
72
73

# File 'lib/rsssf/page-find_schedule.rb', line 71

def find_schedule!( header: )
    _find_schedule( header: header, strict: true )
end

#parse_meta(txt) ⇒ `Object`

62	# File 'lib/rsssf/page-meta.rb', line 62 def parse_meta( txt ) self.class.parse_meta( txt ); end

#save(path) ⇒ `Object`



177
178
179

# File 'lib/rsssf/page.rb', line 177

def save( path )
  write_text( path, @txt )
end

Class: Rsssf::Page

Overview

Constant Summary collapse

Constants included from Utils

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Utils

Constructor Details

#initialize(txt) ⇒ Page

Instance Attribute Details

#txt ⇒ Object

#url ⇒ Object

Class Method Details

.parse_meta(txt) ⇒ Object

.read_cache(url) ⇒ Object

.read_txt(path) ⇒ Object

Instance Method Details

#_build_toc(txt) ⇒ Object

#_find_schedule(header:, strict: false) ⇒ Object

#_scan_headings ⇒ Object

#_split_sections(txt, level: 2) ⇒ Object

#_walk_sections(txt, header:, depth:, strict: false) ⇒ Object

#build_stat ⇒ Object

#find_schedule!(header:) ⇒ Object

#parse_meta(txt) ⇒ Object

#save(path) ⇒ Object