Kira 使用 BetterStack 进行统一的可观测性监控,通过 OpenTelemetry 收集:
- Logs - 应用日志
- Traces - 请求链路追踪
- Metrics - 性能指标
kira-be
// src/telemetry/index.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
const sdk = new NodeSDK({
resource,
sampler: new TraceIdRatioBasedSampler(0.1), // 10% 采样率
spanProcessors: [new BatchSpanProcessor(traceExporter)],
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 30000,
}),
instrumentations: [
new HttpInstrumentation(),
new FetchInstrumentation(),
new UndiciInstrumentation(),
new IORedisInstrumentation(),
],
});
kira-imgproxy
IMGPROXY_OPEN_TELEMETRY_ENABLE = "true"
IMGPROXY_OPEN_TELEMETRY_ENABLE_METRICS = "true"
OTEL_EXPORTER_OTLP_PROTOCOL = "http/protobuf"
OTEL_SERVICE_NAME = "kira-imgproxy"
日志 (Logs)
日志级别
| 级别 | 用途 |
|---|
error | 错误和异常 |
warn | 警告信息 |
info | 一般信息 |
debug | 调试信息(仅开发环境) |
自定义日志
import { logs } from "@opentelemetry/api-logs";
const logger = logs.getLogger("kira-be");
logger.emit({
severityText: "INFO",
body: "User generated image",
attributes: {
userId: "xxx",
toolName: "generateImageWithAI",
durationMs: 1234,
},
});
AI 工具日志
每次 AI 工具执行都会记录:
// src/telemetry/logger.ts
export function logAIToolExecuted(
toolName: string,
resourceId: string,
threadId: string,
durationMs: number
) {
logger.emit({
severityText: "INFO",
body: `AI tool executed: ${toolName}`,
attributes: {
"ai.tool.name": toolName,
"user.id": resourceId,
"thread.id": threadId,
"duration.ms": durationMs,
},
});
}
追踪 (Traces)
自动追踪
以下请求自动生成 span:
- HTTP 入站请求
- HTTP 出站请求(fetch/undici)
- Redis 操作
外部服务追踪
// src/telemetry/external.ts
export enum ExternalService {
FAL = "fal",
REPLICATE = "replicate",
OPENAI = "openai",
ANTHROPIC = "anthropic",
GOOGLE = "google",
XAI = "xai",
BYTEPLUS = "byteplus",
ILLUSTRIOUS = "illustrious",
}
// 使用示例
const response = await tracedFetch(url, options, {
service: ExternalService.BYTEPLUS,
operation: "seedream_edit",
});
Span 属性
| 属性 | 说明 |
|---|
service.name | 服务名称 |
http.method | HTTP 方法 |
http.url | 请求 URL |
http.status_code | 响应状态码 |
external.service | 外部服务名 |
external.operation | 操作名称 |
指标 (Metrics)
kira-be 指标
| 指标名 | 类型 | 说明 |
|---|
ai_tool_executions | Counter | AI 工具执行次数 |
ai_tool_duration_ms | Histogram | AI 工具执行耗时 |
http_request_duration_ms | Histogram | HTTP 请求耗时 |
// src/telemetry/metrics.ts
export function recordAITool(
toolName: string,
success: boolean,
durationMs: number
) {
aiToolCounter.add(1, {
tool_name: toolName,
success: success.toString(),
});
aiToolDuration.record(durationMs, {
tool_name: toolName,
});
}
kira-imgproxy 指标
| 指标名 | 类型 | 说明 | 报警阈值 |
|---|
workers_utilization | Gauge | Worker 利用率 (0-1) | > 0.5 Warning, > 0.8 Critical |
vips_memory_bytes | Gauge | libvips 内存 | - |
images_in_progress | Gauge | 处理中图片数 | > 100 Warning, > 150 Critical |
goroutines | Gauge | Go 协程数 | > 50 Warning, > 100 Critical |
heap_mb | Gauge | Go Heap 内存 | > 150MB Warning, > 300MB Critical |
process_mb | Gauge | 进程总内存 | > 6GB Warning, > 7GB Critical |
Dashboard 查询
查询语法
BetterStack 使用 ClickHouse SQL 语法:
SELECT
toStartOfInterval(dt, INTERVAL 1 minute) as time,
avgMerge(metric_name) as value
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time
imgproxy Dashboard
Worker 利用率
SELECT
toStartOfInterval(dt, INTERVAL 1 minute) as time,
avgMerge(workers_utilization) * 100 as utilization
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time
内存使用
SELECT
toStartOfInterval(dt, INTERVAL 1 minute) as time,
maxMerge(process_mb) as process_mb,
maxMerge(vips_mb) as vips_mb
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time
处理中图片
SELECT
toStartOfInterval(dt, INTERVAL 1 minute) as time,
maxMerge(images_in_progress) as images,
maxMerge(requests_in_progress) as requests
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time
报警配置
imgproxy 报警
| 报警名 | 条件 | 持续时间 | 级别 |
|---|
| Worker 利用率高 | workers_utilization > 0.5 | 5 分钟 | Warning |
| Worker 利用率饱和 | workers_utilization > 0.8 | 2 分钟 | Critical |
| 内存使用高 | process_mb > 6000 | 5 分钟 | Warning |
| 内存使用危险 | process_mb > 7000 | 2 分钟 | Critical |
| 并发过高 | images_in_progress > 100 | 5 分钟 | Warning |
| Goroutine 泄漏 | goroutines > 50 | 5 分钟 | Warning |
配置步骤
- 进入 BetterStack Dashboard → Alerting
- 点击 Create Alert
- 选择 Source
- 配置查询条件
- 设置阈值和持续时间
- 配置通知渠道
Sentry 集成
用户上下文
后端在 JWT 认证后设置用户上下文:
// src/hono/middleware/auth.ts
const payload = c.get("jwtPayload");
if (payload?.sub) {
Sentry.setUser({
id: payload.sub,
email: payload.email,
});
}
前端在登录后设置:
// login-provider.tsx
useEffect(() => {
if (isAuth && profile) {
Sentry.setUser({
id: profile.id,
email: profile.email,
username: profile.nickname,
});
} else {
Sentry.setUser(null);
}
}, [isAuth, profile]);
错误追踪
try {
// ...
} catch (error) {
Sentry.captureException(error, {
tags: {
toolName: "generateImageWithAI",
},
extra: {
input: sanitizedInput,
},
});
throw error;
}
环境变量
# BetterStack
BETTERSTACK_HOST=in-otel.logs.betterstack.com
BETTERSTACK_TOKEN=xxx
# Sentry
SENTRY_DSN=https://[email protected]/xxx
# PostHog (Analytics)
POSTHOG_API_KEY=xxx
POSTHOG_PROJECT_ID=xxx
生产环境才会启用 OpenTelemetry 和 Sentry,开发环境自动跳过。