Class: SearchSolrTools::Harvesters::Base
- Inherits:
-
Object
- Object
- SearchSolrTools::Harvesters::Base
- Includes:
- SSTLogger
- Defined in:
- lib/search_solr_tools/harvesters/base.rb
Overview
base class for solr harvesters
Direct Known Subclasses
Constant Summary collapse
- DELETE_DOCUMENTS_RATIO =
0.1- XML_CONTENT_TYPE =
'text/xml; charset=utf-8'- JSON_CONTENT_TYPE =
'application/json; charset=utf-8'
Constants included from SSTLogger
Instance Attribute Summary collapse
-
#environment ⇒ Object
Returns the value of attribute environment.
Instance Method Summary collapse
-
#create_new_solr_add_doc ⇒ Object
returns Nokogiri XML document with content ‘<?xml version=“1.0”?><add/>’.
-
#create_new_solr_add_doc_with_child(child) ⇒ Object
returns a Nokogiri XML document with content ‘<?xml version=“1.0”?><add> <child /> </add>’.
- #delete_old_documents(timestamp, constraints, solr_core, force: false) ⇒ Object
-
#doc_valid?(doc) ⇒ Boolean
Make sure that Solr is able to accept this doc in a POST.
-
#encode_data_provider_url(url) ⇒ Object
Some data providers require encoding (such as URI.encode), while others barf on encoding.
-
#get_results(request_url, metadata_path, content_type = 'application/xml') ⇒ Object
Get results from an end point specified in the request_url.
- #get_serialized_doc(doc, content_type) ⇒ Object
- #harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
-
#initialize(env = 'development', die_on_failure: false) ⇒ Base
constructor
A new instance of Base.
-
#insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
TODO: Need to return a specific type of failure: - Bad record content identified and no ingest attempted - Solr tries to ingest document and fails (bad content not detected prior to ingest) - Solr cannot insert document for reasons other than the document structure and content.
-
#insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
Update Solr with an array of Nokogiri xml documents, report number of successfully added documents.
-
#ping_solr(core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
Ping the Solr instance to ensure that it’s running.
-
#ping_source ⇒ Object
This should be overridden by child classes to implement the ability to “ping” the data center.
- #remove_documents(solr, delete_query, constraints, force, numfound) ⇒ Object
- #sanitize_data_centers_constraints(query_string) ⇒ Object
- #solr_url ⇒ Object
-
#valid_solr_spatial_coverage?(spatial_coverages) ⇒ Boolean
spatial_coverages is an array with length 4: [North, East, South, West].
Methods included from SSTLogger
Constructor Details
#initialize(env = 'development', die_on_failure: false) ⇒ Base
Returns a new instance of Base.
27 28 29 30 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 27 def initialize(env = 'development', die_on_failure: false) @environment = env @die_on_failure = die_on_failure end |
Instance Attribute Details
#environment ⇒ Object
Returns the value of attribute environment.
21 22 23 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 21 def environment @environment end |
Instance Method Details
#create_new_solr_add_doc ⇒ Object
returns Nokogiri XML document with content ‘<?xml version=“1.0”?><add/>’
214 215 216 217 218 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 214 def create_new_solr_add_doc doc = Nokogiri::XML::Document.new doc.root = Nokogiri::XML::Node.new('add', doc) doc end |
#create_new_solr_add_doc_with_child(child) ⇒ Object
returns a Nokogiri XML document with content ‘<?xml version=“1.0”?><add> <child /> </add>’
222 223 224 225 226 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 222 def create_new_solr_add_doc_with_child(child) doc = create_new_solr_add_doc doc.root.add_child(child) doc end |
#delete_old_documents(timestamp, constraints, solr_core, force: false) ⇒ Object
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 81 def delete_old_documents(, constraints, solr_core, force: false) constraints = sanitize_data_centers_constraints(constraints) delete_query = "last_update:[* TO #{}] AND #{constraints}" full_solr_url = "#{solr_url}/#{solr_core}" faraday_connection = Faraday.new(url: full_solr_url, ssl: { verify: false }) do |conn| conn.request :url_encoded conn.adapter Faraday.default_adapter end solr = RSolr.connect(faraday_connection, url: full_solr_url) unchanged_count = (solr.get 'select', params: { wt: :ruby, q: delete_query, rows: 0 })['response']['numFound'].to_i if unchanged_count.zero? logger.info "All documents were updated after #{}, nothing to delete" else logger.info "Begin removing documents older than #{}" remove_documents(solr, delete_query, constraints, force, unchanged_count) end end |
#doc_valid?(doc) ⇒ Boolean
Make sure that Solr is able to accept this doc in a POST
229 230 231 232 233 234 235 236 237 238 239 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 229 def doc_valid?(doc) spatial_coverages = doc.xpath(".//field[@name='spatial_coverages']").first return true if spatial_coverages.nil? spatial_coverages = spatial_coverages.text.split # We've only seen the failure with 4 spatial coverage values return true if spatial_coverages.size < 4 valid_solr_spatial_coverage?(spatial_coverages) end |
#encode_data_provider_url(url) ⇒ Object
Some data providers require encoding (such as URI.encode), while others barf on encoding. The default is to just return url, override this in the subclass if special encoding is needed.
41 42 43 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 41 def encode_data_provider_url(url) url end |
#get_results(request_url, metadata_path, content_type = 'application/xml') ⇒ Object
Get results from an end point specified in the request_url
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 186 def get_results(request_url, , content_type = 'application/xml') timeout = 300 retries_left = 3 request_url = encode_data_provider_url(request_url) begin logger.debug "Request: #{request_url}" response = URI.parse(request_url).open(read_timeout: timeout, 'Content-Type' => content_type) rescue OpenURI::HTTPError, Timeout::Error, Errno::ETIMEDOUT => e retries_left -= 1 logger.error "## REQUEST FAILED ## #{e.class} ## Retrying #{retries_left} more times..." retry if retries_left.positive? # TODO: Do we really need this "die_on_failure" anymore? The empty return # will cause the "No Documents" error to be thrown in the harvester class # now, so it will pretty much always "die on failure" raise e if @die_on_failure return end doc = Nokogiri.XML(response) doc.xpath(, Helpers::IsoNamespaces.namespaces(doc)) end |
#get_serialized_doc(doc, content_type) ⇒ Object
175 176 177 178 179 180 181 182 183 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 175 def get_serialized_doc(doc, content_type) if content_type.eql?(XML_CONTENT_TYPE) doc.respond_to?(:to_xml) ? doc.to_xml : doc elsif content_type.eql?(JSON_CONTENT_TYPE) MultiJson.dump(doc) else doc end end |
#harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
72 73 74 75 76 77 78 79 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 72 def harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) start_time = Time.now.utc.iso8601 harvest_status = harvest_method.call delete_old_documents start_time, delete_constraints, solr_core harvest_status end |
#insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
TODO: Need to return a specific type of failure:
- Bad record content identified and no ingest attempted
- Solr tries to ingest document and fails (bad content not detected prior to ingest)
- Solr cannot insert document for reasons other than the document structure and content.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 145 def insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) url = solr_url + "/#{core}/update?commit=true" status = Helpers::HarvestStatus::INGEST_OK # Some of the docs will cause Solr to crash - CPU goes to 195% with `top` and it # doesn't seem to recover. return Helpers::HarvestStatus::INGEST_ERR_INVALID_DOC if content_type == XML_CONTENT_TYPE && !doc_valid?(doc) doc_serialized = get_serialized_doc(doc, content_type) # Some docs will cause solr to time out during the POST begin RestClient::Request.execute( method: :post, url: url, payload: doc_serialized, headers: { content_type: }, verify_ssl: OpenSSL::SSL::VERIFY_NONE ) do |response, _request, _result| success = (200..299).include?(response.code) unless success logger.error "Error for #{doc_serialized}\n\n response: #{response.body}" status = Helpers::HarvestStatus::INGEST_ERR_SOLR_ERROR end end rescue StandardError => e # TODO: Need to provide more detail re: this failure so we know whether to # exit the job with a status != 0 logger.error "Rest exception while POSTing to Solr: #{e}, for doc: #{doc_serialized}" status = Helpers::HarvestStatus::INGEST_ERR_SOLR_ERROR end status end |
#insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
Update Solr with an array of Nokogiri xml documents, report number of successfully added documents
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 124 def insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) success = 0 failure = 0 status = Helpers::HarvestStatus.new docs.each do |doc| doc_status = insert_solr_doc(doc, content_type, core) status.record_status doc_status doc_status == Helpers::HarvestStatus::INGEST_OK ? success += 1 : failure += 1 end logger.info "#{success} document#{'s' if success == 1} successfully added to Solr." logger.info "#{failure} document#{'s' if failure == 1} not added to Solr." status end |
#ping_solr(core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
Ping the Solr instance to ensure that it’s running. The ping query is specified to manually check the title, as it’s possible there is no “default” query in the solr instance.
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 48 def ping_solr(core = SolrEnvironments[@environment][:collection_name]) url = solr_url + "/#{core}/admin/ping?df=title" success = false # Some docs will cause solr to time out during the POST begin RestClient::Request.execute(method: :get, url: url, verify_ssl: OpenSSL::SSL::VERIFY_NONE) do |response, _request, _result| success = (200..299).include?(response.code) logger.error "Error in ping request: #{response.body}" unless success end rescue StandardError => e logger.error "Rest exception while pinging Solr at #{url}: #{e}" end success end |
#ping_source ⇒ Object
This should be overridden by child classes to implement the ability to “ping” the data center. Returns true if the ping is successful (or, as in this default, no ping method was defined)
67 68 69 70 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 67 def ping_source logger.info 'Harvester does not have ping method defined, assuming true' true end |
#remove_documents(solr, delete_query, constraints, force, numfound) ⇒ Object
110 111 112 113 114 115 116 117 118 119 120 121 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 110 def remove_documents(solr, delete_query, constraints, force, numfound) all_response_count = (solr.get 'select', params: { wt: :ruby, q: constraints, rows: 0 })['response']['numFound'] if force || (numfound / all_response_count.to_f < DELETE_DOCUMENTS_RATIO) logger.info "Deleting #{numfound} documents for #{constraints}" solr.delete_by_query delete_query solr.commit else logger.info "Failed to delete records older than current harvest start because they exceeded #{DELETE_DOCUMENTS_RATIO} of the total records for this data center." logger.info "\tTotal records: #{all_response_count}" logger.info "\tNon-updated records: #{numfound}" end end |
#sanitize_data_centers_constraints(query_string) ⇒ Object
102 103 104 105 106 107 108 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 102 def sanitize_data_centers_constraints(query_string) # Remove lucene special characters, preserve the query parameter and compress whitespace query_string = query_string.gsub(/[:&|!~\-(){}\[\]^*?+]+/, ' ') query_string = query_string.gsub('data_centers ', 'data_centers:') query_string = query_string.gsub('source ', 'source:') query_string.squeeze(' ').strip end |
#solr_url ⇒ Object
32 33 34 35 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 32 def solr_url env = SolrEnvironments[@environment] "https://#{env[:host]}/#{env[:collection_path]}" end |
#valid_solr_spatial_coverage?(spatial_coverages) ⇒ Boolean
spatial_coverages is an array with length 4:
- North, East, South, West
243 244 245 246 247 248 249 |
# File 'lib/search_solr_tools/harvesters/base.rb', line 243 def valid_solr_spatial_coverage?(spatial_coverages) north, east, south, west = spatial_coverages polar_point = (north == south) && (north.to_f.abs == 90) (east == west) || !polar_point end |