Class: FAIRChampionHarvester::Core
- Inherits:
-
Object
- Object
- FAIRChampionHarvester::Core
- Defined in:
- lib/harvester.rb
Constant Summary collapse
- @@distillerknown =
global, hash of sha256 keys of message bodies - have they been seen before t/f
{}
Class Method Summary collapse
- .convertToURL(guid) ⇒ Object
-
.deep_dive_properties(myHash, property = nil, props = []) ⇒ Array<Array>
Recursively collects **every key-value pair** from a nested Hash structure as [key, value] arrays.
-
.deep_dive_values(myHash, value = nil, vals = []) ⇒ Array
Recursively collects **all non-Hash values** (leaf values) from a nested Hash structure.
-
.fetch(guid:, headers: FAIRChampionHarvester::Utils::AcceptHeader, meta: nil) ⇒ Object
we will try to retrieve turtle whenever possible.
- .figure_out_type(head) ⇒ Object
-
.head(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) ⇒ Object
this returns the URI that results from all redirects, etc.
- .parse_html(meta, body) ⇒ Object
- .parse_json(meta, body) ⇒ Object
- .parse_link_body_headers(url, body) ⇒ Object
- .parse_link_http_headers(headers) ⇒ Object
- .parse_rdf(meta, body, format = nil) ⇒ Object
-
.parse_text(meta, body) ⇒ Object
================================================================== ================================================================== ================================================================== ==================================================================.
- .parse_xml(meta, body) ⇒ Object
-
.resolve(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) ⇒ Object
this returns the URI that results from all redirects, etc.
- .resolveit(guid) ⇒ Object
-
.simplefetch(url, headers = FAIRChampionHarvester::Utils::AcceptHeader, _meta = nil) ⇒ Object
we will try to retrieve turtle whenever possible.
- .typeit(guid) ⇒ Object
Class Method Details
.convertToURL(guid) ⇒ Object
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/harvester.rb', line 50 def self.convertToURL(guid) FAIRChampionHarvester::Utils::GUID_TYPES.each do |pair| k, regex = pair if k == "inchi" and regex.match(guid) return "inchi", "https://pubchem.ncbi.nlm.nih.gov/rest/rdf/inchikey/#{guid}" elsif k == "handle1" and regex.match(guid) return "handle", "http://hdl.handle.net/#{guid}" elsif k == "handle2" and regex.match(guid) return "handle", "http://hdl.handle.net/#{guid}" elsif k == "uri" and regex.match(guid) return "uri", guid elsif k == "doi" and regex.match(guid) return "doi", "https://doi.org/#{guid}" elsif k == "ark_url" and regex.match(guid) return "ark_url", guid elsif k == "ark" and regex.match(guid) return "ark", "https://n2t.net/#{guid}" end end [nil, nil] end |
.deep_dive_properties(myHash, property = nil, props = []) ⇒ Array<Array>
Recursively collects **every key-value pair** from a nested Hash structure as [key, value] arrays.
Traverses the entire nested hash in depth-first order and records every key-value pair encountered — including pairs where the value is itself a Hash.
Note: The ‘property` parameter is currently **not used** (dead code). Both branches of the conditional do the same thing, so every pair is collected regardless of `property`.
354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 |
# File 'lib/harvester.rb', line 354 def self.deep_dive_properties(myHash, property = nil, props = []) return props unless myHash.is_a?(Hash) myHash.each_pair do |key, value| # The conditional is redundant — both branches are identical # This is very likely a bug or unfinished implementation. props << if property && property == key [key, value] else [key, value] end if value.is_a?(Hash) # $stderr.puts "key: #{key} recursing..." # uncomment for debugging deep_dive_properties(value, property, props) end end props end |
.deep_dive_values(myHash, value = nil, vals = []) ⇒ Array
Recursively collects **all non-Hash values** (leaf values) from a nested Hash structure.
Traverses the hash in depth-first order and gathers every value that is not itself a Hash into a flat array. Keys are completely ignored.
309 310 311 312 313 314 315 316 317 318 319 320 |
# File 'lib/harvester.rb', line 309 def self.deep_dive_values(myHash, value = nil, vals = []) myHash.each_pair do |_key, value| if value.is_a?(Hash) # $stderr.puts "key: #{_key} recursing..." # uncomment for debugging deep_dive_values(value, value, vals) else vals << value end end vals end |
.fetch(guid:, headers: FAIRChampionHarvester::Utils::AcceptHeader, meta: nil) ⇒ Object
we will try to retrieve turtle whenever possible
403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 |
# File 'lib/harvester.rb', line 403 def self.fetch(guid:, headers: FAIRChampionHarvester::Utils::AcceptHeader, meta: nil) # we will try to retrieve turtle whenever possible head, body, finalURI = FAIRChampionHarvester::Cache.checkCache(guid, headers) return false if head and head == "ERROR" .finalURI |= [finalURI] if && finalURI warn .finalURI.inspect if if head and body warn "Retrieved from cache, returning data to code" return [head, body] end warn "In fetch routine now. " begin warn "executing call over the Web to #{guid}" # response = RestClient::Request.execute( # method: :get, # url: guid.to_s, # # user: user, # # password: pass, # headers: headers # ) response = HTTP .headers(headers).follow .get(guid.to_s) # or full URL if response.status.success? final_url = response.uri.to_s .finalURI |= [final_url] if warn "There was a response to the call #{guid}" FAIRChampionHarvester::Cache.writeToCache(guid, headers, response.headers, response.body.to_s, response.uri.to_s) [response.headers, response.body.to_s] # return headers, body, and final URL else # Handle HTTP error status codes (4xx, 5xx, etc.) warn "HTTP Error #{response.status} for #{url}" warn "Final URL: #{response.uri}" if response.uri FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) .comments << "WARN: HTTP error #{response.status} encountered when trying to resolve #{guid}\n" if false end rescue HTTP::Error => e # This catches network errors, timeouts, connection failures, DNS errors, etc. warn "HTTP Request Failed for #{guid}: #{e.}" FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) .comments << "WARN: HTTP error #{e.} encountered when trying to resolve #{guid}\n" if false rescue StandardError => e # Catch any other unexpected errors warn "Unexpected error while fetching #{guid}: #{e.class} - #{e.}" warn e.backtrace.first(5).join("\n") if ENV["DEBUG"] FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) .comments << "WARN: HTTP error #{e.} encountered when trying to resolve #{guid}\n" if false end # rescue RestClient::ExceptionWithResponse => e # warn "ERROR! #{e.response}" # FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) # meta.comments << "WARN: HTTP error #{e} encountered when trying to resolve #{guid}\n" if meta # false # # now we are returning 'False', and we will check that with an \"if\" statement in our main code # rescue RestClient::Exception => e # warn "ERROR! #{e}" # meta.comments << "WARN: HTTP error #{e} encountered when trying to resolve #{guid}\n" if meta # FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) # false # # now we are returning 'False', and we will check that with an \"if\" statement in our main code # rescue Exception => e # warn "ERROR! #{e}" # meta.comments << "WARN: HTTP error #{e} encountered when trying to resolve #{guid}\n" if meta # FAIRChampionHarvester::Cache.writeErrorToCache(guid, headers) # false # # now we are returning 'False', and we will check that with an \"if\" statement in our main code # end # you can capture the Exception and do something useful with it!\n", end |
.figure_out_type(head) ⇒ Object
375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 |
# File 'lib/harvester.rb', line 375 def self.figure_out_type(head) type = head[:content_type] if type.nil? warn "\n\nSTRANGE - headers had no content-type\n\n" return nil, nil end type.match(%r{([\w+.]+/[\w+.]+):?;?}im) type = ::Regexp.last_match(1) # $stderr.puts "\n\nsearching for #{type}\n\n" FAIRChampionHarvester::Utils::RDF_FORMATS.each do |parser, types| return parser, type if types.include? type end FAIRChampionHarvester::Utils::JSON_FORMATS.each do |parser, types| return parser, type if types.include? type end FAIRChampionHarvester::Utils::TEXT_FORMATS.each do |parser, types| return parser, type if types.include? type end FAIRChampionHarvester::Utils::XML_FORMATS.each do |parser, types| return parser, type if types.include? type end FAIRChampionHarvester::Utils::HTML_FORMATS.each do |parser, types| return parser, type if types.include? type end [nil, nil] end |
.head(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) ⇒ Object
this returns the URI that results from all redirects, etc.
510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 |
# File 'lib/harvester.rb', line 510 def self.head(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) response = RestClient::Request.execute({ method: :head, url: url.to_s, # user: user, # password: pass, headers: headers }) response.headers rescue RestClient::ExceptionWithResponse => e warn e.response false # now we are returning 'False', and we will check that with an \"if\" statement in our main code rescue RestClient::Exception => e warn e.response false # now we are returning 'False', and we will check that with an \"if\" statement in our main code rescue Exception => e warn e false # now we are returning 'False', and we will check that with an \"if\" statement in our main code # you can capture the Exception and do something useful with it!\n", end |
.parse_html(meta, body) ⇒ Object
104 105 106 |
# File 'lib/harvester.rb', line 104 def self.parse_html(, body) # just use extruct and distiller instead end |
.parse_json(meta, body) ⇒ Object
93 94 95 96 97 98 99 100 101 102 |
# File 'lib/harvester.rb', line 93 def self.parse_json(, body) hash = JSON.parse(body) # warn body # warn hash.inspect # warn hash.class .hash.merge!(hash) # warn meta.hash .hash end |
.parse_link_body_headers(url, body) ⇒ Object
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
# File 'lib/harvester.rb', line 225 def self.parse_link_body_headers(url, body) # Parse the HTML body (Nokogiri is tolerant of malformed HTML) doc = Nokogiri::HTML(body) # Focus on <link> tags inside <head> (MetaInspector's head_links equivalent) # We use css selector for simplicity and readability link_nodes = doc.css('head link[rel="alternate"][type]') # only those with rel=alternate AND type attr # Your format lists – assuming these are constants/hashes like: # FAIRChampionHarvester::Utils::RDF_FORMATS => { jsonld: "application/ld+json", ... } # We flatten them once for efficiency allowed_types = [ FAIRChampionHarvester::Utils::RDF_FORMATS.values, FAIRChampionHarvester::Utils::XML_FORMATS.values, FAIRChampionHarvester::Utils::JSON_FORMATS.values ].flatten.uniq # uniq to avoid duplicates if any overlap # Filter and extract hrefs urls = link_nodes.filter_map do |link| type = link["type"]&.strip next unless type && allowed_types.include?(type) href = link["href"]&.strip href if href && !href.empty? end # Optional: make relative URLs absolute (MetaInspector usually does this) base_uri = begin URI.parse(url) rescue StandardError nil end if base_uri urls.map! do |href| URI.join(base_uri, href).to_s rescue StandardError href end end warn "\n\nGOT BODY LINKS #{urls}\n\n" urls end |
.parse_link_http_headers(headers) ⇒ Object
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
# File 'lib/harvester.rb', line 193 def self.parse_link_http_headers(headers) # we can be sure that a Link header is a URL # code stolen from https://gist.github.com/thesowah/0ca5e1b4b3c61bfe8e13 with a few tweaks links = headers[:link] return [] unless links parts = links.split(",") urls = [] # Parse each part into a named link parts.each do |part, _index| section = part.split(";") next unless section[0] url = section[0][/<(.*)>/, 1] next unless section[1] type = "" section[1..].each do |s| type = s[/rel="?(\w+)"?/, 1] break if type end next unless type # "meta" headers are for old versions of Virtuoso LDP - not in link relations standared next unless %w[meta alternate].include?(type.downcase) urls << url end urls end |
.parse_rdf(meta, body, format = nil) ⇒ Object
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
# File 'lib/harvester.rb', line 108 def self.parse_rdf(, body, format = nil) unless body .comments << "CRITICAL: The response message body component appears to have no content.\n" return end unless body.match(/\w/) .comments << "CRITICAL: The response message body component appears to have no content.\n" return end warn "\n\n\nSANITY CHECK \n\n#{body[0..300]}\n\n" # sanitycheck = RDF::Format.for({ sample: body[0..5000] }) # unless sanitycheck # meta.comments << "CRITICAL: The Evaluator found what it believed to be RDF (sample: #{body[0..300].delete!("\n")}), but it could not find a parser. Please report this error, along with the GUID of the resource, to the maintainer of the system.\n" # return meta # end graph = FAIRChampionHarvester::Cache.checkRDFCache(body) if graph.size > 0 warn "\n\n\n unmarshalling graph from cache\n\n" warn "\n\ngraph size #{graph.size} #{graph.inspect}\n\n" .merge_rdf(graph.to_a) return end formattype = nil warn "\n\n\ndeclared format #{format}\n\n" if format.nil? formattype = RDF::Format.for({ sample: body[0..3000] }) warn "\n\n\ndetected format #{formattype}\n\n" else warn "\n\n\ntesting declared format #{format}\n\n" formattype = RDF::Format.for(content_type: format) warn "\n\n\nfound format #{formattype}\n\n" end warn "\n\n\nfinal format #{formattype}\n\n" # $stderr.puts "\n\n\nBODY #{body}\n\n" unless formattype .comments << "CRITICAL: Unable to find an RDF reader type that matches the content that was returned from resolution. Here is a sample #{body[0..100]} Please send your GUID to the dev team so we can investigate!\n" return end .comments << "INFO: The response message body component appears to contain #{formattype}.\n" reader = "" begin reader = formattype.reader.new(body) rescue StandardError .comments << "WARN: Though linked data was found, it failed to parse. This likely indicates some syntax error in the data. As a result, no metadata will be extracted from this message.\n" return end begin # $stderr.puts "Reader Class #{reader.class}\n\n #{reader.inspect}" if reader.size == 0 .comments << "WARN: Though linked data was found, it failed to parse. This likely indicates some syntax error in the data. As a result, no metadata will be extracted from this message.\n" return end # reader.rewind! # for some reason, the rewind method isn't working here...?? reader = formattype.reader.new(body) # have to re-read it here, but now its safe because we have already caught errors warn "WRITING TO CACHE" FAIRChampionHarvester::Cache.writeRDFCache(reader, body) # write to the special RDF graph cache warn "WRITING DONE" reader = formattype.reader.new(body) warn "RE-READING DONE" .merge_rdf(reader.to_a) warn "MERGE DONE" rescue RDF::ReaderError => e .comments << "CRITICAL: The Linked Data was malformed and caused the parser to crash with error message: #{e.} || (sample of what was parsed: #{body[0..300].delete("\n")})\n" warn "CRITICAL: The Linked Data was malformed and caused the parser to crash with error message: #{e.} || (sample of what was parsed: #{body[0..300].delete("\n")})\n" nil rescue Exception => e .comments << "CRITICAL: An unknown error occurred while parsing the (apparent) Linked Data (sample of what was parsed: #{body[0..300].delete("\n")}). Moving on...\n" warn "\n\nCRITICAL: #{e.inspect} An unknown error occurred while parsing the (apparent) Linked Data (full body: #{body}). Moving on...\n\n" nil end end |
.parse_text(meta, body) ⇒ Object
86 87 88 89 90 91 |
# File 'lib/harvester.rb', line 86 def self.parse_text(, body) .comments << "WARTN: Plain Text cannot be mapped to any parser. No structured metadata found.\n" .comments << "INFO: Using Apache Tika to attempt to extract metadata from plaintext.\n" FAIRChampionHarvester::Tika.do_tika(, body) end |
.parse_xml(meta, body) ⇒ Object
186 187 188 189 190 191 |
# File 'lib/harvester.rb', line 186 def self.parse_xml(, body) hash = XmlSimple.xml_in(body) .comments << "INFO: The XML is being converted into a simple hash structure.\n" .hash.merge hash .hash end |
.resolve(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) ⇒ Object
this returns the URI that results from all redirects, etc.
535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 |
# File 'lib/harvester.rb', line 535 def self.resolve(url, headers = FAIRChampionHarvester::Utils::AcceptHeader) response = RestClient::Request.execute({ method: :head, url: url.to_s, # user: user, # password: pass, headers: headers }) response.request.url rescue RestClient::ExceptionWithResponse => e warn e.response false # now we are returning 'False', and we will check that with an \"if\" statement in our main code rescue RestClient::Exception => e warn e.response false # now we are returning 'False', and we will check that with an \"if\" statement in our main code rescue Exception => e warn e false # now we are returning 'False', and we will check that with an \"if\" statement in our main code # you can capture the Exception and do something useful with it!\n", end |
.resolveit(guid) ⇒ Object
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/harvester.rb', line 19 def self.resolveit(guid) # if meta = FAIRChampionHarvester::Utils::retrieveMetaObject(guid) # return meta # end = FAIRChampionHarvester::MetadataObject.new FAIRChampionHarvester::Utils::GUID_TYPES.each do |pair| # meta object gets updated in each case k, regex = pair if k == "inchi" and regex.match(guid) FAIRChampionHarvester::INCHI.resolve_inchi(guid, ) elsif k == "handle1" and regex.match(guid) FAIRChampionHarvester::Handle.resolve_handle(guid, ) elsif k == "handle2" and regex.match(guid) FAIRChampionHarvester::Handle.resolve_handle(guid, ) elsif k == "uri" and regex.match(guid) FAIRChampionHarvester::Uri.resolve_uri(guid, ) elsif k == "doi" and regex.match(guid) FAIRChampionHarvester::DOI.resolve_doi(guid, ) end end if .comments.empty? # didn't match any of the types, so no comments were added .guidtype = "unknown" .comments << "CRITICAL: The guid '#{guid}' did not correspond to any known GUID format. Tested #{FAIRChampionHarvester::Utils::GUID_TYPES.keys}. Halting.\n" end .comments << "INFO: END OF HARVESTING\n" # FAIRChampionHarvester::Utils::cacheMetaObject(meta, guid) end |
.simplefetch(url, headers = FAIRChampionHarvester::Utils::AcceptHeader, _meta = nil) ⇒ Object
we will try to retrieve turtle whenever possible
479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 |
# File 'lib/harvester.rb', line 479 def self.simplefetch(url, headers = FAIRChampionHarvester::Utils::AcceptHeader, = nil) # we will try to retrieve turtle whenever possible # head = FAIRChampionHarvester::Utils::head(url, headers) # $stderr.puts "content length " + head[:content_length].to_s # if head[:content_length] and head[:content_length].to_f > 300000 and meta # meta.comments << "WARN: The size of the content at #{url} reports itself to be >300kb. This service will not download something so large. This does not mean that the content is not FAIR, only that this service will not test it. Sorry!\n" # return false # end response = HTTP .headers(headers).follow .get(guid.to_s) # or full URL if response.status.success? [response.headers, response.body.to_s] # return headers, body, and final URL else # Handle HTTP error status codes (4xx, 5xx, etc.) warn "HTTP Error #{response.status} for #{url}" warn "Final URL: #{response.uri}" if response.uri false end rescue HTTP::Error => e # This catches network errors, timeouts, connection failures, DNS errors, etc. warn "HTTP Request Failed for #{guid}: #{e.}" false rescue StandardError => e # Catch any other unexpected errors warn "Unexpected error while fetching #{guid}: #{e.class} - #{e.}" false end |
.typeit(guid) ⇒ Object
72 73 74 75 76 77 78 |
# File 'lib/harvester.rb', line 72 def self.typeit(guid) FAIRChampionHarvester::Utils::GUID_TYPES.each do |pair| type, regex = pair return type if regex.match(guid) end false end |