Class: SearchSolrTools::Harvesters::Base

Inherits:

Object

Object
SearchSolrTools::Harvesters::Base

show all

Includes:: SSTLogger

Defined in:: lib/search_solr_tools/harvesters/base.rb

Overview

base class for solr harvesters

Direct Known Subclasses

AutoSuggest, NsidcJson

Constant Summary collapse

DELETE_DOCUMENTS_RATIO =

0.1

XML_CONTENT_TYPE =

'text/xml; charset=utf-8'

JSON_CONTENT_TYPE =

'application/json; charset=utf-8'

Constants included from SSTLogger

SSTLogger::LOG_LEVELS

Instance Attribute Summary collapse

#environment ⇒ Object

Returns the value of attribute environment.

Instance Method Summary collapse

#create_new_solr_add_doc ⇒ Object

returns Nokogiri XML document with content ‘<?xml version=“1.0”?><add/>’.
#create_new_solr_add_doc_with_child(child) ⇒ Object

returns a Nokogiri XML document with content ‘<?xml version=“1.0”?><add> <child /> </add>’.
#delete_old_documents(timestamp, constraints, solr_core, force: false) ⇒ Object
#doc_valid?(doc) ⇒ Boolean

Make sure that Solr is able to accept this doc in a POST.
#encode_data_provider_url(url) ⇒ Object

Some data providers require encoding (such as URI.encode), while others barf on encoding.
#get_results(request_url, metadata_path, content_type = 'application/xml') ⇒ Object

Get results from an end point specified in the request_url.
#get_serialized_doc(doc, content_type) ⇒ Object
#harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) ⇒ Object
#initialize(env = 'development', die_on_failure: false) ⇒ Base constructor

A new instance of Base.
#insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

TODO: Need to return a specific type of failure: - Bad record content identified and no ingest attempted - Solr tries to ingest document and fails (bad content not detected prior to ingest) - Solr cannot insert document for reasons other than the document structure and content.
#insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

Update Solr with an array of Nokogiri xml documents, report number of successfully added documents.
#ping_solr(core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

Ping the Solr instance to ensure that it’s running.
#ping_source ⇒ Object

This should be overridden by child classes to implement the ability to “ping” the data center.
#remove_documents(solr, delete_query, constraints, force, numfound) ⇒ Object
#sanitize_data_centers_constraints(query_string) ⇒ Object
#solr_url ⇒ Object
#valid_solr_spatial_coverage?(spatial_coverages) ⇒ Boolean

spatial_coverages is an array with length 4: [North, East, South, West].

Methods included from SSTLogger

#logger, logger

Constructor Details

#initialize(env = 'development', die_on_failure: false) ⇒ `Base`

Returns a new instance of Base.

# File 'lib/search_solr_tools/harvesters/base.rb', line 27

def initialize(env = 'development', die_on_failure: false)
  @environment = env
  @die_on_failure = die_on_failure
end

Instance Attribute Details

#environment ⇒ `Object`

Returns the value of attribute environment.



21
22
23

# File 'lib/search_solr_tools/harvesters/base.rb', line 21

def environment
  @environment
end

Instance Method Details

#create_new_solr_add_doc ⇒ `Object`

returns Nokogiri XML document with content ‘<?xml version=“1.0”?><add/>’

# File 'lib/search_solr_tools/harvesters/base.rb', line 214

def create_new_solr_add_doc
  doc = Nokogiri::XML::Document.new
  doc.root = Nokogiri::XML::Node.new('add', doc)
  doc
end

#create_new_solr_add_doc_with_child(child) ⇒ `Object`

returns a Nokogiri XML document with content ‘<?xml version=“1.0”?><add> <child /> </add>’

# File 'lib/search_solr_tools/harvesters/base.rb', line 222

def create_new_solr_add_doc_with_child(child)
  doc = create_new_solr_add_doc
  doc.root.add_child(child)
  doc
end

#delete_old_documents(timestamp, constraints, solr_core, force: false) ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 81

def delete_old_documents(timestamp, constraints, solr_core, force: false)
  constraints = sanitize_data_centers_constraints(constraints)
  delete_query = "last_update:[* TO #{timestamp}] AND #{constraints}"
  full_solr_url = "#{solr_url}/#{solr_core}"

  faraday_connection = Faraday.new(url: full_solr_url, ssl: { verify: false }) do |conn|
    conn.request :url_encoded
    conn.adapter Faraday.default_adapter
  end

  solr = RSolr.connect(faraday_connection, url: full_solr_url)

  unchanged_count = (solr.get 'select', params: { wt: :ruby, q: delete_query, rows: 0 })['response']['numFound'].to_i
  if unchanged_count.zero?
    logger.info "All documents were updated after #{timestamp}, nothing to delete"
  else
    logger.info "Begin removing documents older than #{timestamp}"
    remove_documents(solr, delete_query, constraints, force, unchanged_count)
  end
end

#doc_valid?(doc) ⇒ `Boolean`

Make sure that Solr is able to accept this doc in a POST

Returns:

(Boolean)

# File 'lib/search_solr_tools/harvesters/base.rb', line 229

def doc_valid?(doc)
  spatial_coverages = doc.xpath(".//field[@name='spatial_coverages']").first
  return true if spatial_coverages.nil?

  spatial_coverages = spatial_coverages.text.split

  # We've only seen the failure with 4 spatial coverage values
  return true if spatial_coverages.size < 4

  valid_solr_spatial_coverage?(spatial_coverages)
end

#encode_data_provider_url(url) ⇒ `Object`

Some data providers require encoding (such as URI.encode), while others barf on encoding. The default is to just return url, override this in the subclass if special encoding is needed.



41
42
43

# File 'lib/search_solr_tools/harvesters/base.rb', line 41

def encode_data_provider_url(url)
  url
end

#get_results(request_url, metadata_path, content_type = 'application/xml') ⇒ `Object`

Get results from an end point specified in the request_url

# File 'lib/search_solr_tools/harvesters/base.rb', line 186

def get_results(request_url, metadata_path, content_type = 'application/xml')
  timeout = 300
  retries_left = 3

  request_url = encode_data_provider_url(request_url)

  begin
    logger.debug "Request: #{request_url}"
    response = URI.parse(request_url).open(read_timeout: timeout, 'Content-Type' => content_type)
  rescue OpenURI::HTTPError, Timeout::Error, Errno::ETIMEDOUT => e
    retries_left -= 1
    logger.error "## REQUEST FAILED ## #{e.class} ## Retrying #{retries_left} more times..."

    retry if retries_left.positive?

    # TODO: Do we really need this "die_on_failure" anymore?  The empty return
    #  will cause the "No Documents" error to be thrown in the harvester class
    #  now, so it will pretty much always "die on failure"
    raise e if @die_on_failure

    return
  end
  doc = Nokogiri.XML(response)
  doc.xpath(metadata_path, Helpers::IsoNamespaces.namespaces(doc))
end

#get_serialized_doc(doc, content_type) ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 175

def get_serialized_doc(doc, content_type)
  if content_type.eql?(XML_CONTENT_TYPE)
    doc.respond_to?(:to_xml) ? doc.to_xml : doc
  elsif content_type.eql?(JSON_CONTENT_TYPE)
    MultiJson.dump(doc)
  else
    doc
  end
end

#harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 72

def harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name])
  start_time = Time.now.utc.iso8601

  harvest_status = harvest_method.call
  delete_old_documents start_time, delete_constraints, solr_core

  harvest_status
end

#insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ `Object`

TODO: Need to return a specific type of failure:

- Bad record content identified and no ingest attempted
- Solr tries to ingest document and fails (bad content not detected prior to ingest)
- Solr cannot insert document for reasons other than the document structure and content.

# File 'lib/search_solr_tools/harvesters/base.rb', line 145

def insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name])
  url = solr_url + "/#{core}/update?commit=true"
  status = Helpers::HarvestStatus::INGEST_OK

  # Some of the docs will cause Solr to crash - CPU goes to 195% with `top` and it
  # doesn't seem to recover.
  return Helpers::HarvestStatus::INGEST_ERR_INVALID_DOC if content_type == XML_CONTENT_TYPE && !doc_valid?(doc)

  doc_serialized = get_serialized_doc(doc, content_type)

  # Some docs will cause solr to time out during the POST
  begin
    RestClient::Request.execute(
      method: :post, url: url, payload: doc_serialized, headers: { content_type: }, verify_ssl: OpenSSL::SSL::VERIFY_NONE
    ) do |response, _request, _result|
      success = (200..299).include?(response.code)
      unless success
        logger.error "Error for #{doc_serialized}\n\n response: #{response.body}"
        status = Helpers::HarvestStatus::INGEST_ERR_SOLR_ERROR
      end
    end
  rescue StandardError => e
    # TODO: Need to provide more detail re: this failure so we know whether to
    #  exit the job with a status != 0
    logger.error "Rest exception while POSTing to Solr: #{e}, for doc: #{doc_serialized}"
    status = Helpers::HarvestStatus::INGEST_ERR_SOLR_ERROR
  end
  status
end

#insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ `Object`

Update Solr with an array of Nokogiri xml documents, report number of successfully added documents

# File 'lib/search_solr_tools/harvesters/base.rb', line 124

def insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name])
  success = 0
  failure = 0

  status = Helpers::HarvestStatus.new

  docs.each do |doc|
    doc_status = insert_solr_doc(doc, content_type, core)
    status.record_status doc_status
    doc_status == Helpers::HarvestStatus::INGEST_OK ? success += 1 : failure += 1
  end
  logger.info "#{success} document#{'s' if success == 1} successfully added to Solr."
  logger.info "#{failure} document#{'s' if failure == 1} not added to Solr."

  status
end

#ping_solr(core = SolrEnvironments[@environment][:collection_name]) ⇒ `Object`

Ping the Solr instance to ensure that it’s running. The ping query is specified to manually check the title, as it’s possible there is no “default” query in the solr instance.

# File 'lib/search_solr_tools/harvesters/base.rb', line 48

def ping_solr(core = SolrEnvironments[@environment][:collection_name])
  url = solr_url + "/#{core}/admin/ping?df=title"
  success = false

  # Some docs will cause solr to time out during the POST
  begin
    RestClient::Request.execute(method: :get, url: url, verify_ssl: OpenSSL::SSL::VERIFY_NONE) do |response, _request, _result|
      success = (200..299).include?(response.code)
      logger.error "Error in ping request: #{response.body}" unless success
    end
  rescue StandardError => e
    logger.error "Rest exception while pinging Solr at #{url}: #{e}"
  end
  success
end

#ping_source ⇒ `Object`

This should be overridden by child classes to implement the ability to “ping” the data center. Returns true if the ping is successful (or, as in this default, no ping method was defined)

# File 'lib/search_solr_tools/harvesters/base.rb', line 67

def ping_source
  logger.info 'Harvester does not have ping method defined, assuming true'
  true
end

#remove_documents(solr, delete_query, constraints, force, numfound) ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 110

def remove_documents(solr, delete_query, constraints, force, numfound)
  all_response_count = (solr.get 'select', params: { wt: :ruby, q: constraints, rows: 0 })['response']['numFound']
  if force || (numfound / all_response_count.to_f < DELETE_DOCUMENTS_RATIO)
    logger.info "Deleting #{numfound} documents for #{constraints}"
    solr.delete_by_query delete_query
    solr.commit
  else
    logger.info "Failed to delete records older than current harvest start because they exceeded #{DELETE_DOCUMENTS_RATIO} of the total records for this data center."
    logger.info "\tTotal records: #{all_response_count}"
    logger.info "\tNon-updated records: #{numfound}"
  end
end

#sanitize_data_centers_constraints(query_string) ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 102

def sanitize_data_centers_constraints(query_string)
  # Remove lucene special characters, preserve the query parameter and compress whitespace
  query_string = query_string.gsub(/[:&|!~\-(){}\[\]^*?+]+/, ' ')
  query_string = query_string.gsub('data_centers ', 'data_centers:')
  query_string = query_string.gsub('source ', 'source:')
  query_string.squeeze(' ').strip
end

#solr_url ⇒ `Object`

# File 'lib/search_solr_tools/harvesters/base.rb', line 32

def solr_url
  env = SolrEnvironments[@environment]
  "https://#{env[:host]}/#{env[:collection_path]}"
end

#valid_solr_spatial_coverage?(spatial_coverages) ⇒ `Boolean`

spatial_coverages is an array with length 4:

North, East, South, West

Returns:

(Boolean)

# File 'lib/search_solr_tools/harvesters/base.rb', line 243

def valid_solr_spatial_coverage?(spatial_coverages)
  north, east, south, west = spatial_coverages

  polar_point = (north == south) && (north.to_f.abs == 90)

  (east == west) || !polar_point
end

Class: SearchSolrTools::Harvesters::Base

Overview

Direct Known Subclasses

Constant Summary collapse

Constants included from SSTLogger

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from SSTLogger

Constructor Details

#initialize(env = 'development', die_on_failure: false) ⇒ Base

Instance Attribute Details

#environment ⇒ Object

Instance Method Details

#create_new_solr_add_doc ⇒ Object

#create_new_solr_add_doc_with_child(child) ⇒ Object

#delete_old_documents(timestamp, constraints, solr_core, force: false) ⇒ Object

#doc_valid?(doc) ⇒ Boolean

#encode_data_provider_url(url) ⇒ Object

#get_results(request_url, metadata_path, content_type = 'application/xml') ⇒ Object

#get_serialized_doc(doc, content_type) ⇒ Object

#harvest_and_delete(harvest_method, delete_constraints, solr_core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

#insert_solr_doc(doc, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

#insert_solr_docs(docs, content_type = XML_CONTENT_TYPE, core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

#ping_solr(core = SolrEnvironments[@environment][:collection_name]) ⇒ Object

#ping_source ⇒ Object

#remove_documents(solr, delete_query, constraints, force, numfound) ⇒ Object

#sanitize_data_centers_constraints(query_string) ⇒ Object

#solr_url ⇒ Object

#valid_solr_spatial_coverage?(spatial_coverages) ⇒ Boolean