Skip to main content

概述

Kira 使用 BetterStack 进行统一的可观测性监控,通过 OpenTelemetry 收集:
  • Logs - 应用日志
  • Traces - 请求链路追踪
  • Metrics - 性能指标

配置

kira-be

// src/telemetry/index.ts
import { NodeSDK } from "@opentelemetry/sdk-node";

const sdk = new NodeSDK({
  resource,
  sampler: new TraceIdRatioBasedSampler(0.1), // 10% 采样率
  spanProcessors: [new BatchSpanProcessor(traceExporter)],
  metricReader: new PeriodicExportingMetricReader({
    exporter: metricExporter,
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new FetchInstrumentation(),
    new UndiciInstrumentation(),
    new IORedisInstrumentation(),
  ],
});

kira-imgproxy

IMGPROXY_OPEN_TELEMETRY_ENABLE = "true"
IMGPROXY_OPEN_TELEMETRY_ENABLE_METRICS = "true"
OTEL_EXPORTER_OTLP_PROTOCOL = "http/protobuf"
OTEL_SERVICE_NAME = "kira-imgproxy"

日志 (Logs)

日志级别

级别用途
error错误和异常
warn警告信息
info一般信息
debug调试信息(仅开发环境)

自定义日志

import { logs } from "@opentelemetry/api-logs";

const logger = logs.getLogger("kira-be");

logger.emit({
  severityText: "INFO",
  body: "User generated image",
  attributes: {
    userId: "xxx",
    toolName: "generateImageWithAI",
    durationMs: 1234,
  },
});

AI 工具日志

每次 AI 工具执行都会记录:
// src/telemetry/logger.ts
export function logAIToolExecuted(
  toolName: string,
  resourceId: string,
  threadId: string,
  durationMs: number
) {
  logger.emit({
    severityText: "INFO",
    body: `AI tool executed: ${toolName}`,
    attributes: {
      "tool_name": toolName,
      "user_id": resourceId,
      "thread_id": threadId,
      "duration_ms": durationMs,
    },
  });
}

追踪 (Traces)

自动追踪

以下请求自动生成 span:
  • HTTP 入站请求
  • HTTP 出站请求(fetch/undici)
  • Redis 操作

外部服务追踪

// src/telemetry/external.ts
export const ExternalService = {
  FAL: "fal",
  REPLICATE: "replicate",
  OPENAI: "openai",
  ANTHROPIC: "anthropic",
  GOOGLE: "google",
  XAI: "xai",
  BYTEPLUS: "byteplus",
  IDEOGRAM: "ideogram",
  RECRAFT: "recraft",
  COHERE: "cohere",
  SERPER: "serper",
} as const;

// 使用示例
const response = await tracedFetch(url, options, {
  service: ExternalService.BYTEPLUS,
  operation: "seedream_edit",
});

Span 属性

属性说明
service.name服务名称
http.methodHTTP 方法
http.url请求 URL
http.status_code响应状态码
external.service外部服务名
external.operation操作名称

指标 (Metrics)

kira-be 指标

指标名类型说明
ai.tool.countCounterAI 工具执行次数
ai.tool.durationHistogramAI 工具执行耗时
ai.chat.countCounterAI 对话次数
ai.chat.durationHistogramAI 对话耗时
ai.image.countCounterAI 图片生成次数
ai.image.durationHistogramAI 图片生成耗时
http.request.countCounterHTTP 请求次数
http.request.durationHistogramHTTP 请求耗时
error.countCounter错误计数
external.service.durationHistogram外部服务调用耗时
cache.hit.countCounter缓存命中次数
cache.miss.countCounter缓存未命中次数
// src/telemetry/metrics.ts
export function recordAITool(
  toolName: string,
  success: boolean,
  durationMs: number
) {
  aiToolCounter.add(1, {
    tool_name: toolName,
    success: success.toString(),
  });
  aiToolDuration.record(durationMs, {
    tool_name: toolName,
    success: success.toString(),
  });
}

kira-imgproxy 指标

指标名类型说明报警阈值
workers_utilizationGaugeWorker 利用率 (0-1)> 0.5 Warning, > 0.8 Critical
vips_memory_bytesGaugelibvips 内存-
images_in_progressGauge处理中图片数> 100 Warning, > 150 Critical
goroutinesGaugeGo 协程数> 50 Warning, > 100 Critical
heap_mbGaugeGo Heap 内存> 150MB Warning, > 300MB Critical
process_mbGauge进程总内存> 6GB Warning, > 7GB Critical

Dashboard 查询

查询语法

BetterStack 使用 ClickHouse SQL 语法:
SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  avgMerge(metric_name) as value
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

imgproxy Dashboard

Worker 利用率

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  avgMerge(workers_utilization) * 100 as utilization
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

内存使用

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  maxMerge(process_mb) as process_mb,
  maxMerge(vips_mb) as vips_mb
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

处理中图片

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  maxMerge(images_in_progress) as images,
  maxMerge(requests_in_progress) as requests
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

报警配置

imgproxy 报警

报警名条件持续时间级别
Worker 利用率高workers_utilization > 0.55 分钟Warning
Worker 利用率饱和workers_utilization > 0.82 分钟Critical
内存使用高process_mb > 60005 分钟Warning
内存使用危险process_mb > 70002 分钟Critical
并发过高images_in_progress > 1005 分钟Warning
Goroutine 泄漏goroutines > 505 分钟Warning

配置步骤

  1. 进入 BetterStack Dashboard → Alerting
  2. 点击 Create Alert
  3. 选择 Source
  4. 配置查询条件
  5. 设置阈值和持续时间
  6. 配置通知渠道

Sentry 集成

用户上下文

后端在 JWT 认证后设置用户上下文:
// src/hono/middleware/auth.ts
const payload = c.get("jwtPayload");
if (payload?.sub) {
  Sentry.setUser({
    id: payload.sub,
    email: payload.email,
  });
}
前端在登录后设置:
// login-provider.tsx
useEffect(() => {
  if (isAuth && profile) {
    Sentry.setUser({
      id: profile.id,
      email: profile.email,
      username: profile.nickname,
    });
  } else {
    Sentry.setUser(null);
  }
}, [isAuth, profile]);

错误追踪

Sentry 通过 src/telemetry/sentry.ts 初始化,并在中间件层自动捕获异常。路由处理器中不直接调用 Sentry.captureException,而是依赖中间件级别的集成自动上报未捕获的错误。
// src/telemetry/sentry.ts - 初始化示例
import * as Sentry from "@sentry/bun";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  // ...配置
});

环境变量

# BetterStack
BETTERSTACK_HOST=in-otel.logs.betterstack.com
BETTERSTACK_TOKEN=xxx

# Sentry
SENTRY_DSN=https://xxx@o.betterstack.com/xxx

# PostHog (Analytics)
POSTHOG_API_KEY=xxx
POSTHOG_PROJECT_ID=xxx
生产环境才会启用 OpenTelemetry 和 Sentry,开发环境自动跳过。