Class: Iev::Scraper::PageParser
- Inherits:
-
Object
- Object
- Iev::Scraper::PageParser
- Defined in:
- lib/iev/scraper/page_parser.rb
Overview
Parses an Electropedia HTML page into a concept data hash.
The Electropedia HTML structure is a table with rows for each language:
-
Language row: <div align=“center”><font color=“#800080”>en</font></div>
-
Term cell: term text in the third <td>
-
Definition row: next row’s third <td> (if present)
-
Empty/separator rows with <hr> or spacer images
Constant Summary collapse
- LANG_CODE_MAP =
Map Electropedia HTML language codes to ISO 639-2/3 three-char codes.
{ "en" => "eng", "fr" => "fra", "ar" => "ara", "de" => "deu", "es" => "spa", "it" => "ita", "ko" => "kor", "ja" => "jpn", "pl" => "pol", "pt" => "por", "sr" => "srp", "sv" => "swe", "zh" => "zho", "nl" => "nld", "fi" => "fin", "cs" => "ces", "no" => "nor", "ru" => "rus", "sl" => "slv", "sk" => "slk", }.freeze
Instance Method Summary collapse
-
#initialize(doc, code) ⇒ PageParser
constructor
A new instance of PageParser.
- #parse ⇒ Object
Constructor Details
#initialize(doc, code) ⇒ PageParser
Returns a new instance of PageParser.
37 38 39 40 |
# File 'lib/iev/scraper/page_parser.rb', line 37 def initialize(doc, code) @doc = doc @code = code end |
Instance Method Details
#parse ⇒ Object
42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/iev/scraper/page_parser.rb', line 42 def parse return nil unless find_iev_ref { "id" => @code, "data" => { "identifier" => @code, "localized_concepts" => localized_concepts, }, } end |