Module: Relaton::Gb::GbScraper
- Extended by:
- Scraper
- Defined in:
- lib/relaton/gb/gb_scraper.rb
Overview
National standard scraper.
Constant Summary
collapse
- SEARCH_URL =
"https://openstd.samr.gov.cn/bzgk/gb/std_list"
- DOC_URL =
"http://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno="
Constants included
from Scraper
Scraper::STAGES
Class Method Summary
collapse
Methods included from Scraper
create_org_name, get_contributors, get_docid, get_status, get_titles, scrapped_data
Class Method Details
.agent ⇒ Object
35
36
37
|
# File 'lib/relaton/gb/gb_scraper.rb', line 35
def agent
@agent ||= Mechanize.new
end
|
.scrape_doc(hit) ⇒ RelatonGb::GbBibliographicItem
41
42
43
44
45
46
47
|
# File 'lib/relaton/gb/gb_scraper.rb', line 41
def scrape_doc(hit)
src = DOC_URL + hit.pid
doc = agent.get src
ItemData.new(**scrapped_data(doc, src, hit))
rescue Mechanize::Error => e
raise Relaton::RequestError, e.message
end
|
.scrape_page(text) ⇒ RelatonGb::HitCollection
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# File 'lib/relaton/gb/gb_scraper.rb', line 18
def scrape_page(text) doc = agent.get("#{SEARCH_URL}?p.p2=#{CGI.escape(text)}")
hits = doc.xpath(
"//table[contains(@class, 'result_list')]/tbody[2]/tr",
).map do |h|
ref = h.at "./td[2]/a"
pid = ref[:onclick].match(/[0-9A-F]+/).to_s
status = h.at("./td[7]").text.strip
rdate = h.at("./td[8]").text.strip
Hit.new pid: pid, docref: ref.text, scraper: self,
release_date: rdate, status: status
end
HitCollection.new hits.sort_by(&:release_date).reverse
rescue Mechanize::Error => e
raise Relaton::RequestError, e.message
end
|