Class: Kotoshu::Components::PosTagger Abstract
- Inherits:
-
Object
- Object
- Kotoshu::Components::PosTagger
- Defined in:
- lib/kotoshu/components/pos_tagger.rb
Overview
Subclasses must implement #tag
Base class for POS (Part-of-Speech) taggers.
POS taggers assign grammatical categories (NOUN, VERB, ADJ, etc.) to tokens. Different languages use different POS tagging strategies:
-
Latin scripts: Dictionary-based (Hunspell flags → POS tags)
-
CJK: Integrated with morphological analysis (tokenizer provides POS)
-
German: Compound word decomposition affects tagging
Common POS tags (Penn Treebank style):
-
CC: Coordinating conjunction
-
CD: Cardinal number
-
DT: Determiner
-
EX: Existential there
-
FW: Foreign word
-
IN: Preposition or subordinating conjunction
-
JJ: Adjective
-
JJR: Adjective, comparative
-
JJS: Adjective, superlative
-
LS: List item marker
-
MD: Modal
-
NN: Noun, singular or mass
-
NNS: Noun, plural
-
NNP: Proper noun, singular
-
NNPS: Proper noun, plural
-
PDT: Predeterminer
-
POS: Possessive ending
-
PRP: Personal pronoun
-
PRP$: Possessive pronoun
-
RB: Adverb
-
RBR: Adverb, comparative
-
RBS: Adverb, superlative
-
RP: Particle
-
SYM: Symbol
-
TO: to
-
UH: Interjection
-
VB: Verb, base form
-
VBD: Verb, past tense
-
VBG: Verb, gerund or present participle
-
VBN: Verb, past participle
-
VBP: Verb, non-3rd person singular present
-
VBZ: Verb, 3rd person singular present
-
WDT: Wh-determiner
-
WP: Wh-pronoun
-
WP$: Possessive wh-pronoun
-
WRB: Wh-adverb
Language-specific tags:
-
CJK uses its own tagset (e.g., Japanese: 名詞, 動詞, etc.)
-
German uses STTS tagset
Direct Known Subclasses
Languages::English::POSTagger, Languages::French::POSTagger, Languages::German::POSTagger, Languages::Japanese::POSTagger, Languages::Portuguese::POSTagger, Languages::Russian::POSTagger, Languages::Spanish::POSTagger
Instance Method Summary collapse
-
#tag(tokens) ⇒ Array<Hash>
abstract
Tag tokens with POS information.
-
#tag_word(word) ⇒ Hash
Tag a single word.
Instance Method Details
#tag(tokens) ⇒ Array<Hash>
Subclasses must implement
Tag tokens with POS information.
Takes an array of token hashes (from Tokenizer#tokenize) and adds:
-
:pos_tag (String, nil) - POS category (NOUN, VERB, etc.) or nil if unknown
-
:lemma (String, nil) - Lemma/base form or nil if unknown
81 82 83 |
# File 'lib/kotoshu/components/pos_tagger.rb', line 81 def tag(tokens) raise NotImplementedError, "#{self.class} must implement #tag" end |
#tag_word(word) ⇒ Hash
Tag a single word.
Convenience method for single-word tagging.
91 92 93 94 95 |
# File 'lib/kotoshu/components/pos_tagger.rb', line 91 def tag_word(word) token = { token: word, position: 0, length: word.length } result = tag([token]) result.first || { pos_tag: nil, lemma: nil } end |