Module: Girb::LanguageDetector
- Defined in:
- lib/girb/language_detector.rb
Overview
Lightweight, dependency-free heuristic for the response language.
The goal is NOT accurate language identification — it is to stop the model from drifting to English when the user clearly wrote in Japanese (a recurring bug), without forcing a wrong language onto users of other languages.
Returns one of:
"Japanese" - high confidence (kana present)
"English" - high confidence (Latin-only, no other scripts)
"the user's language" - uncertain; fall back to "match the user" guidance
Constant Summary collapse
- FALLBACK =
"the user's language"- ENGLISH_MARKERS =
Common English function words. Latin-only text is only called “English” when one of these appears, so Spanish/French/German/romaji questions fall back to “match the user” instead of being forced into English. Note: single-letter “i” is intentionally excluded — with the /i flag it would match a Ruby loop variable ‘i` and misclassify code as English.
/\b(?:the|is|are|was|were|what|why|how|does|do|did|this|that|these|those| can|could|should|would|will|when|where|which|who|of|to|in|on|for| with|and|or|not|please|explain|show|tell|my|it|its)\b/xi
Class Method Summary collapse
Class Method Details
.detect(text) ⇒ Object
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# File 'lib/girb/language_detector.rb', line 28 def detect(text) s = text.to_s return FALLBACK if s.strip.empty? # Kana is unique to Japanese — any occurrence is a strong signal. return "Japanese" if s.match?(/[\p{Hiragana}\p{Katakana}]/) # Han without kana is ambiguous (could be Chinese); stay cautious. return FALLBACK if s.match?(/\p{Han}/) # Any other non-Latin script (Hangul, Cyrillic, Arabic, ...) -> don't guess. return FALLBACK if s.match?(/[^\p{Latin}\p{Common}\p{Inherited}]/) # Latin-only: only commit to English when a clear English marker is present. return "English" if s.match?(ENGLISH_MARKERS) FALLBACK end |