Class: SemanticTextChunker::Splitters::SentenceSplitter

Inherits:
Object
  • Object
show all
Defined in:
lib/semantic_text_chunker/splitters/sentence_splitter.rb

Constant Summary collapse

ABBREVS =
%w[Mr Mrs Dr Prof Sr Jr vs etc e.g i.e U.S U.K U.S.A Fig Vol No].freeze
SPLIT_PATTERN =

Split after a terminator (optionally followed by a closing quote/bracket) and whitespace, when the next sentence starts with an opening quote, an uppercase letter, or a digit.

/(?<=[.?!]|[.?!]["')\]])\s+(?=["'(\[A-Z0-9])/
ABBREV_PLACEHOLDER =
"__STC_ABBREV__".freeze
DECIMAL_PLACEHOLDER =
"__STC_DEC__".freeze

Instance Method Summary collapse

Constructor Details

#initialize(extra_abbreviations: []) ⇒ SentenceSplitter

Returns a new instance of SentenceSplitter.



14
15
16
17
# File 'lib/semantic_text_chunker/splitters/sentence_splitter.rb', line 14

def initialize(extra_abbreviations: [])
  @abbrevs = (ABBREVS + extra_abbreviations).freeze
  @abbrev_pattern = /\b(#{@abbrevs.map { |a| Regexp.escape(a) }.join("|")})\.\s/
end

Instance Method Details

#split(text) ⇒ Object



19
20
21
22
23
24
25
26
27
28
29
30
# File 'lib/semantic_text_chunker/splitters/sentence_splitter.rb', line 19

def split(text)
  # Protect periods inside decimal numbers (e.g. 3.14, v1.2.3)
  protected = text.gsub(/(\d)\.(\d)/) { "#{$1}#{DECIMAL_PLACEHOLDER}#{$2}" }

  # Temporarily replace abbreviation periods
  protected = protected.gsub(@abbrev_pattern) { "#{$1}#{ABBREV_PLACEHOLDER} " }

  protected
    .split(SPLIT_PATTERN)
    .map { |s| s.gsub(ABBREV_PLACEHOLDER, ".").gsub(DECIMAL_PLACEHOLDER, ".").strip }
    .reject(&:empty?)
end