Files
galaxy-game/backend/internal/diplomail/detector/detector.go
T
Ilia Denisov e22f4b7800
Tests · Go / test (push) Successful in 1m59s
Tests · Go / test (pull_request) Successful in 2m0s
Tests · Integration / integration (pull_request) Successful in 1m35s
diplomail (Stage D): language detection + lazy translation cache
Replaces the LangUndetermined placeholder with whatlanggo-backed
body detection on every send path, then adds a translation cache
keyed on (message_id, target_lang) populated lazily on the
per-message read endpoint. The noop translator that ships with
Stage D returns engine="noop", which the service treats as
"translation unavailable" — wiring a real backend (LibreTranslate
HTTP client is the documented next step) is a one-file swap.

GetMessage and ListInbox now accept a targetLang argument; the HTTP
layer resolves the caller's accounts.preferred_language and
forwards it. Inbox uses the cache only (never calls the
translator) so bulk reads stay fast under future SaaS backends.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 19:16:12 +02:00

80 lines
2.9 KiB
Go

// Package detector wraps the body-language detection used by the
// diplomail subsystem. The package exposes a narrow `LanguageDetector`
// interface so the implementation can be swapped without touching the
// callers; the default backed-by-whatlanggo detector handles 84
// natural languages and ships with the embedded statistical profiles.
//
// Detection happens only on the body. Subjects are short and
// frequently template-like ("Re: ..."), so detecting on them adds
// noise. The diplomail Service feeds the body, captures the BCP 47
// tag returned here, and stores it in `diplomail_messages.body_lang`.
package detector
import (
"strings"
"unicode/utf8"
"github.com/abadojack/whatlanggo"
)
// Undetermined is the BCP 47 placeholder stored when detection cannot
// confidently identify a language (empty body, too-short body, mixed
// scripts the detector refuses to bet on).
const Undetermined = "und"
// LanguageDetector is the read-only surface diplomail consumes when
// it needs to label a message body. Detect must never panic and
// must never return an error: detection failure simply yields
// `Undetermined`.
type LanguageDetector interface {
Detect(body string) string
}
// New returns the package-default detector backed by `whatlanggo`.
// The instance is safe for concurrent use; whatlanggo's `Detect`
// reads the embedded profiles without state mutation. Callers that
// want a fixed allow-list can build their own implementation around
// the same interface.
func New() LanguageDetector {
return &whatlangDetector{}
}
type whatlangDetector struct{}
// minRunes is the lower bound on body length below which whatlanggo
// can flip between near-synonyms; for shorter bodies we return
// `Undetermined` and let the noop translator skip the slot. The
// value matches whatlanggo's documented "stable above ~25 runes"
// guidance.
const minRunes = 25
// Detect returns the BCP 47 tag for body, or `Undetermined` when the
// body is empty / too short / whatlanggo refuses to label it. The
// trim is applied so leading whitespace does not bias the script
// detector toward Latin. We deliberately do not gate on
// `info.IsReliable()` because the gate is too conservative for the
// short sentences typical of in-game mail; a misclassification only
// hurts the translation cache key, never correctness.
func (d *whatlangDetector) Detect(body string) string {
body = strings.TrimSpace(body)
if body == "" {
return Undetermined
}
if utf8.RuneCountInString(body) < minRunes {
return Undetermined
}
info := whatlanggo.Detect(body)
tag := info.Lang.Iso6391()
if tag == "" {
return Undetermined
}
return tag
}
// NoopDetector returns the placeholder unconditionally. Used by
// tests and by Stage A code paths that predate the real detector.
type NoopDetector struct{}
// Detect always returns `Undetermined` regardless of input.
func (NoopDetector) Detect(string) string { return Undetermined }