Module: Girb::LanguageDetector

Defined in:
lib/girb/language_detector.rb

Overview

Lightweight, dependency-free heuristic for the response language.

The goal is NOT accurate language identification — it is to stop the model from drifting to English when the user clearly wrote in Japanese (a recurring bug), without forcing a wrong language onto users of other languages.

Returns one of:

"Japanese"            - high confidence (kana present)
"English"             - high confidence (Latin-only, no other scripts)
"the user's language" - uncertain; fall back to "match the user" guidance

Constant Summary collapse

FALLBACK =
"the user's language"
ENGLISH_MARKERS =

Common English function words. Latin-only text is only called “English” when one of these appears, so Spanish/French/German/romaji questions fall back to “match the user” instead of being forced into English. Note: single-letter “i” is intentionally excluded — with the /i flag it would match a Ruby loop variable ‘i` and misclassify code as English.

/\b(?:the|is|are|was|were|what|why|how|does|do|did|this|that|these|those|
can|could|should|would|will|when|where|which|who|of|to|in|on|for|
with|and|or|not|please|explain|show|tell|my|it|its)\b/xi

Class Method Summary collapse

Class Method Details

.detect(text) ⇒ Object



28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/girb/language_detector.rb', line 28

def detect(text)
  s = text.to_s
  return FALLBACK if s.strip.empty?

  # Kana is unique to Japanese — any occurrence is a strong signal.
  return "Japanese" if s.match?(/[\p{Hiragana}\p{Katakana}]/)

  # Han without kana is ambiguous (could be Chinese); stay cautious.
  return FALLBACK if s.match?(/\p{Han}/)

  # Any other non-Latin script (Hangul, Cyrillic, Arabic, ...) -> don't guess.
  return FALLBACK if s.match?(/[^\p{Latin}\p{Common}\p{Inherited}]/)

  # Latin-only: only commit to English when a clear English marker is present.
  return "English" if s.match?(ENGLISH_MARKERS)

  FALLBACK
end