Class: Iev::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/iev/scraper.rb,
lib/iev/scraper/page_parser.rb

Overview

Scrapes IEV term data from Electropedia (electropedia.org).

Electropedia is behind AWS WAF which requires JavaScript execution, so a headless browser (via Ferrum/Chrome) is used to handle the challenge.

Examples:

scraper = Iev::Scraper.new
concept = scraper.fetch_concept("103-01-02")
doc = scraper.fetch_page("103-01-02")

Defined Under Namespace

Classes: PageParser

Constant Summary collapse

BASE_URL =
"https://www.electropedia.org/iev/iev.nsf/" \
"display?openform&ievref="
USER_AGENT_PROFILES =

Pool of realistic Chrome User-Agent strings with matching platform hints. Rotated per request to reduce fingerprinting by AWS WAF.

[
  {
    user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " \
                "AppleWebKit/537.36 (KHTML, like Gecko) " \
                "Chrome/131.0.0.0 Safari/537.36",
    platform: '"macOS"',
    chrome_version: "131",
  },
  {
    user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
                "AppleWebKit/537.36 (KHTML, like Gecko) " \
                "Chrome/130.0.0.0 Safari/537.36",
    platform: '"Windows"',
    chrome_version: "130",
  },
  {
    user_agent: "Mozilla/5.0 (X11; Linux x86_64) " \
                "AppleWebKit/537.36 (KHTML, like Gecko) " \
                "Chrome/131.0.0.0 Safari/537.36",
    platform: '"Linux"',
    chrome_version: "131",
  },
  {
    user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " \
                "AppleWebKit/537.36 (KHTML, like Gecko) " \
                "Chrome/129.0.0.0 Safari/537.36",
    platform: '"macOS"',
    chrome_version: "129",
  },
  {
    user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
                "AppleWebKit/537.36 (KHTML, like Gecko) " \
                "Chrome/131.0.0.0 Safari/537.36",
    platform: '"Windows"',
    chrome_version: "131",
  },
].freeze

Instance Method Summary collapse

Constructor Details

#initialize(browser_opts: {}) ⇒ Scraper

Returns a new instance of Scraper.



57
58
59
# File 'lib/iev/scraper.rb', line 57

def initialize(browser_opts: {})
  @browser_opts = browser_opts
end

Instance Method Details

#fetch_concept(code) ⇒ Object

Fetch and parse concept data for an IEV code. Returns a hash with concept data or nil if not found.



99
100
101
102
103
104
# File 'lib/iev/scraper.rb', line 99

def fetch_concept(code)
  doc = fetch_page(code)
  return nil unless doc

  PageParser.new(doc, code).parse
end

#fetch_page(code) ⇒ Object

Fetch the Electropedia page HTML for a given IEV code. Returns a Nokogiri document.



63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/iev/scraper.rb', line 63

def fetch_page(code)
  require "ferrum"
  require "nokogiri"

  url = "#{BASE_URL}#{code}"
  browser = Ferrum::Browser.new(
    headless: "new",
    timeout: 30,
    window_size: [1366, 768],
    browser_options: {
      "disable-blink-features" => "AutomationControlled",
    },
    **@browser_opts,
  )

  browser.headers.set(random_headers)
  browser.go_to(url)
  browser.network.wait_for_idle(timeout: 15)
  html = browser.body

  # Check if we got a real page or a WAF block
  if html.include?("403 ERROR") || html.include?("Request blocked")
    warn "IEV Scraper: AWS WAF blocked request for #{code}"
    return nil
  end

  Nokogiri::HTML(html)
rescue Ferrum::Error, Ferrum::BrowserError => e
  warn "IEV Scraper error for #{code}: #{e.message}"
  nil
ensure
  browser&.quit
end