Class: Iev::Scraper
- Inherits:
-
Object
- Object
- Iev::Scraper
- Defined in:
- lib/iev/scraper.rb,
lib/iev/scraper/page_parser.rb
Overview
Scrapes IEV term data from Electropedia (electropedia.org).
Electropedia is behind AWS WAF which requires JavaScript execution, so a headless browser (via Ferrum/Chrome) is used to handle the challenge.
Defined Under Namespace
Classes: PageParser
Constant Summary collapse
- BASE_URL =
"https://www.electropedia.org/iev/iev.nsf/" \ "display?openform&ievref="
- USER_AGENT_PROFILES =
Pool of realistic Chrome User-Agent strings with matching platform hints. Rotated per request to reduce fingerprinting by AWS WAF.
[ { user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/131.0.0.0 Safari/537.36", platform: '"macOS"', chrome_version: "131", }, { user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/130.0.0.0 Safari/537.36", platform: '"Windows"', chrome_version: "130", }, { user_agent: "Mozilla/5.0 (X11; Linux x86_64) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/131.0.0.0 Safari/537.36", platform: '"Linux"', chrome_version: "131", }, { user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/129.0.0.0 Safari/537.36", platform: '"macOS"', chrome_version: "129", }, { user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \ "AppleWebKit/537.36 (KHTML, like Gecko) " \ "Chrome/131.0.0.0 Safari/537.36", platform: '"Windows"', chrome_version: "131", }, ].freeze
Instance Method Summary collapse
-
#fetch_concept(code) ⇒ Object
Fetch and parse concept data for an IEV code.
-
#fetch_page(code) ⇒ Object
Fetch the Electropedia page HTML for a given IEV code.
-
#initialize(browser_opts: {}) ⇒ Scraper
constructor
A new instance of Scraper.
Constructor Details
#initialize(browser_opts: {}) ⇒ Scraper
Returns a new instance of Scraper.
57 58 59 |
# File 'lib/iev/scraper.rb', line 57 def initialize(browser_opts: {}) @browser_opts = browser_opts end |
Instance Method Details
#fetch_concept(code) ⇒ Object
Fetch and parse concept data for an IEV code. Returns a hash with concept data or nil if not found.
99 100 101 102 103 104 |
# File 'lib/iev/scraper.rb', line 99 def fetch_concept(code) doc = fetch_page(code) return nil unless doc PageParser.new(doc, code).parse end |
#fetch_page(code) ⇒ Object
Fetch the Electropedia page HTML for a given IEV code. Returns a Nokogiri document.
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/iev/scraper.rb', line 63 def fetch_page(code) require "ferrum" require "nokogiri" url = "#{BASE_URL}#{code}" browser = Ferrum::Browser.new( headless: "new", timeout: 30, window_size: [1366, 768], browser_options: { "disable-blink-features" => "AutomationControlled", }, **@browser_opts, ) browser.headers.set(random_headers) browser.go_to(url) browser.network.wait_for_idle(timeout: 15) html = browser.body # Check if we got a real page or a WAF block if html.include?("403 ERROR") || html.include?("Request blocked") warn "IEV Scraper: AWS WAF blocked request for #{code}" return nil end Nokogiri::HTML(html) rescue Ferrum::Error, Ferrum::BrowserError => e warn "IEV Scraper error for #{code}: #{e.}" nil ensure browser&.quit end |