Class: SemanticTextChunker::Splitters::SentenceSplitter
- Inherits:
-
Object
- Object
- SemanticTextChunker::Splitters::SentenceSplitter
- Defined in:
- lib/semantic_text_chunker/splitters/sentence_splitter.rb
Constant Summary collapse
- ABBREVS =
%w[Mr Mrs Dr Prof Sr Jr vs etc e.g i.e U.S U.K U.S.A Fig Vol No].freeze
- SPLIT_PATTERN =
Split after a terminator (optionally followed by a closing quote/bracket) and whitespace, when the next sentence starts with an opening quote, an uppercase letter, or a digit.
/(?<=[.?!]|[.?!]["')\]])\s+(?=["'(\[A-Z0-9])/- ABBREV_PLACEHOLDER =
"__STC_ABBREV__".freeze
- DECIMAL_PLACEHOLDER =
"__STC_DEC__".freeze
Instance Method Summary collapse
-
#initialize(extra_abbreviations: []) ⇒ SentenceSplitter
constructor
A new instance of SentenceSplitter.
- #split(text) ⇒ Object
Constructor Details
#initialize(extra_abbreviations: []) ⇒ SentenceSplitter
Returns a new instance of SentenceSplitter.
14 15 16 17 |
# File 'lib/semantic_text_chunker/splitters/sentence_splitter.rb', line 14 def initialize(extra_abbreviations: []) @abbrevs = (ABBREVS + extra_abbreviations).freeze @abbrev_pattern = /\b(#{@abbrevs.map { |a| Regexp.escape(a) }.join("|")})\.\s/ end |
Instance Method Details
#split(text) ⇒ Object
19 20 21 22 23 24 25 26 27 28 29 30 |
# File 'lib/semantic_text_chunker/splitters/sentence_splitter.rb', line 19 def split(text) # Protect periods inside decimal numbers (e.g. 3.14, v1.2.3) protected = text.gsub(/(\d)\.(\d)/) { "#{$1}#{DECIMAL_PLACEHOLDER}#{$2}" } # Temporarily replace abbreviation periods protected = protected.gsub(@abbrev_pattern) { "#{$1}#{ABBREV_PLACEHOLDER} " } protected .split(SPLIT_PATTERN) .map { |s| s.gsub(ABBREV_PLACEHOLDER, ".").gsub(DECIMAL_PLACEHOLDER, ".").strip } .reject(&:empty?) end |