Class: Iev::Scraper::PageParser

Inherits:
Object
  • Object
show all
Defined in:
lib/iev/scraper/page_parser.rb

Overview

Parses an Electropedia HTML page into a concept data hash.

The Electropedia HTML structure is a table with rows for each language:

  • Language row: <div align=“center”><font color=“#800080”>en</font></div>

  • Term cell: term text in the third <td>

  • Definition row: next row’s third <td> (if present)

  • Empty/separator rows with <hr> or spacer images

Constant Summary collapse

LANG_CODE_MAP =

Map Electropedia HTML language codes to ISO 639-2/3 three-char codes.

{
  "en" => "eng",
  "fr" => "fra",
  "ar" => "ara",
  "de" => "deu",
  "es" => "spa",
  "it" => "ita",
  "ko" => "kor",
  "ja" => "jpn",
  "pl" => "pol",
  "pt" => "por",
  "sr" => "srp",
  "sv" => "swe",
  "zh" => "zho",
  "nl" => "nld",
  "fi" => "fin",
  "cs" => "ces",
  "no" => "nor",
  "ru" => "rus",
  "sl" => "slv",
  "sk" => "slk",
}.freeze

Instance Method Summary collapse

Constructor Details

#initialize(doc, code) ⇒ PageParser

Returns a new instance of PageParser.



37
38
39
40
# File 'lib/iev/scraper/page_parser.rb', line 37

def initialize(doc, code)
  @doc = doc
  @code = code
end

Instance Method Details

#parseObject



42
43
44
45
46
47
48
49
50
51
52
# File 'lib/iev/scraper/page_parser.rb', line 42

def parse
  return nil unless find_iev_ref

  {
    "id" => @code,
    "data" => {
      "identifier" => @code,
      "localized_concepts" => localized_concepts,
    },
  }
end