Class: Kotoshu::Components::PosTagger Abstract

Inherits:

Object

Object
Kotoshu::Components::PosTagger

show all

Defined in:: lib/kotoshu/components/pos_tagger.rb

Overview

This class is abstract.

Subclasses must implement #tag

Base class for POS (Part-of-Speech) taggers.

POS taggers assign grammatical categories (NOUN, VERB, ADJ, etc.) to tokens. Different languages use different POS tagging strategies:

Latin scripts: Dictionary-based (Hunspell flags → POS tags)
CJK: Integrated with morphological analysis (tokenizer provides POS)
German: Compound word decomposition affects tagging

Common POS tags (Penn Treebank style):

CC: Coordinating conjunction
CD: Cardinal number
DT: Determiner
EX: Existential there
FW: Foreign word
IN: Preposition or subordinating conjunction
JJ: Adjective
JJR: Adjective, comparative
JJS: Adjective, superlative
LS: List item marker
MD: Modal
NN: Noun, singular or mass
NNS: Noun, plural
NNP: Proper noun, singular
NNPS: Proper noun, plural
PDT: Predeterminer
POS: Possessive ending
PRP: Personal pronoun
PRP$: Possessive pronoun
RB: Adverb
RBR: Adverb, comparative
RBS: Adverb, superlative
RP: Particle
SYM: Symbol
TO: to
UH: Interjection
VB: Verb, base form
VBD: Verb, past tense
VBG: Verb, gerund or present participle
VBN: Verb, past participle
VBP: Verb, non-3rd person singular present
VBZ: Verb, 3rd person singular present
WDT: Wh-determiner
WP: Wh-pronoun
WP$: Possessive wh-pronoun
WRB: Wh-adverb

Language-specific tags:

CJK uses its own tagset (e.g., Japanese: 名詞, 動詞, etc.)
German uses STTS tagset

Examples:

Tagging tokens

tagger = EnglishPosTagger.new(aff_path: "en_US.aff", dic_path: "en_US.dic")
tokens = [
  { token: "The", position: 0, length: 3 },
  { token: "dog", position: 4, length: 3 },
  { token: "runs", position: 8, length: 4 }
]
tagged = tagger.tag(tokens)
# => [
#      { token: "The", position: 0, length: 3, pos_tag: "DET", lemma: "the" },
#      { token: "dog", position: 4, length: 3, pos_tag: "NOUN", lemma: "dog" },
#      { token: "runs", position: 8, length: 4, pos_tag: "VERB", lemma: "run" }
#    ]

Direct Known Subclasses

Languages::English::POSTagger, Languages::French::POSTagger, Languages::German::POSTagger, Languages::Japanese::POSTagger, Languages::Portuguese::POSTagger, Languages::Russian::POSTagger, Languages::Spanish::POSTagger

Instance Method Summary collapse

#tag(tokens) ⇒ Array<Hash> abstract

Tag tokens with POS information.
#tag_word(word) ⇒ Hash

Tag a single word.

Instance Method Details

#tag(tokens) ⇒ `Array<Hash>`

This method is abstract.

Subclasses must implement

Tag tokens with POS information.

Takes an array of token hashes (from Tokenizer#tokenize) and adds:

:pos_tag (String, nil) - POS category (NOUN, VERB, etc.) or nil if unknown
:lemma (String, nil) - Lemma/base form or nil if unknown

Parameters:

tokens (Array<Hash>) —

Array of token hashes from Tokenizer

Returns:

(Array<Hash>) —

Token hashes with added :pos_tag and :lemma keys

Raises:

(NotImplementedError) —

if not implemented by subclass



81
82
83

# File 'lib/kotoshu/components/pos_tagger.rb', line 81

def tag(tokens)
  raise NotImplementedError, "#{self.class} must implement #tag"
end

#tag_word(word) ⇒ `Hash`

Tag a single word.

Convenience method for single-word tagging.