Skip to main content

概述

Kira 使用 BetterStack 进行统一的可观测性监控,通过 OpenTelemetry 收集:
  • Logs - 应用日志
  • Traces - 请求链路追踪
  • Metrics - 性能指标

配置

kira-be

// src/telemetry/index.ts
import { NodeSDK } from "@opentelemetry/sdk-node";

const sdk = new NodeSDK({
  resource,
  sampler: new TraceIdRatioBasedSampler(0.1), // 10% 采样率
  spanProcessors: [new BatchSpanProcessor(traceExporter)],
  metricReader: new PeriodicExportingMetricReader({
    exporter: metricExporter,
    exportIntervalMillis: 30000,
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new FetchInstrumentation(),
    new UndiciInstrumentation(),
    new IORedisInstrumentation(),
  ],
});

kira-imgproxy

IMGPROXY_OPEN_TELEMETRY_ENABLE = "true"
IMGPROXY_OPEN_TELEMETRY_ENABLE_METRICS = "true"
OTEL_EXPORTER_OTLP_PROTOCOL = "http/protobuf"
OTEL_SERVICE_NAME = "kira-imgproxy"

日志 (Logs)

日志级别

级别用途
error错误和异常
warn警告信息
info一般信息
debug调试信息(仅开发环境)

自定义日志

import { logs } from "@opentelemetry/api-logs";

const logger = logs.getLogger("kira-be");

logger.emit({
  severityText: "INFO",
  body: "User generated image",
  attributes: {
    userId: "xxx",
    toolName: "generateImageWithAI",
    durationMs: 1234,
  },
});

AI 工具日志

每次 AI 工具执行都会记录:
// src/telemetry/logger.ts
export function logAIToolExecuted(
  toolName: string,
  resourceId: string,
  threadId: string,
  durationMs: number
) {
  logger.emit({
    severityText: "INFO",
    body: `AI tool executed: ${toolName}`,
    attributes: {
      "ai.tool.name": toolName,
      "user.id": resourceId,
      "thread.id": threadId,
      "duration.ms": durationMs,
    },
  });
}

追踪 (Traces)

自动追踪

以下请求自动生成 span:
  • HTTP 入站请求
  • HTTP 出站请求(fetch/undici)
  • Redis 操作

外部服务追踪

// src/telemetry/external.ts
export enum ExternalService {
  FAL = "fal",
  REPLICATE = "replicate",
  OPENAI = "openai",
  ANTHROPIC = "anthropic",
  GOOGLE = "google",
  XAI = "xai",
  BYTEPLUS = "byteplus",
  ILLUSTRIOUS = "illustrious",
}

// 使用示例
const response = await tracedFetch(url, options, {
  service: ExternalService.BYTEPLUS,
  operation: "seedream_edit",
});

Span 属性

属性说明
service.name服务名称
http.methodHTTP 方法
http.url请求 URL
http.status_code响应状态码
external.service外部服务名
external.operation操作名称

指标 (Metrics)

kira-be 指标

指标名类型说明
ai_tool_executionsCounterAI 工具执行次数
ai_tool_duration_msHistogramAI 工具执行耗时
http_request_duration_msHistogramHTTP 请求耗时
// src/telemetry/metrics.ts
export function recordAITool(
  toolName: string,
  success: boolean,
  durationMs: number
) {
  aiToolCounter.add(1, {
    tool_name: toolName,
    success: success.toString(),
  });
  aiToolDuration.record(durationMs, {
    tool_name: toolName,
  });
}

kira-imgproxy 指标

指标名类型说明报警阈值
workers_utilizationGaugeWorker 利用率 (0-1)> 0.5 Warning, > 0.8 Critical
vips_memory_bytesGaugelibvips 内存-
images_in_progressGauge处理中图片数> 100 Warning, > 150 Critical
goroutinesGaugeGo 协程数> 50 Warning, > 100 Critical
heap_mbGaugeGo Heap 内存> 150MB Warning, > 300MB Critical
process_mbGauge进程总内存> 6GB Warning, > 7GB Critical

Dashboard 查询

查询语法

BetterStack 使用 ClickHouse SQL 语法:
SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  avgMerge(metric_name) as value
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

imgproxy Dashboard

Worker 利用率

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  avgMerge(workers_utilization) * 100 as utilization
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

内存使用

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  maxMerge(process_mb) as process_mb,
  maxMerge(vips_mb) as vips_mb
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

处理中图片

SELECT
  toStartOfInterval(dt, INTERVAL 1 minute) as time,
  maxMerge(images_in_progress) as images,
  maxMerge(requests_in_progress) as requests
FROM {{source}}
WHERE dt BETWEEN {{start_time}} AND {{end_time}}
GROUP BY time
ORDER BY time

报警配置

imgproxy 报警

报警名条件持续时间级别
Worker 利用率高workers_utilization > 0.55 分钟Warning
Worker 利用率饱和workers_utilization > 0.82 分钟Critical
内存使用高process_mb > 60005 分钟Warning
内存使用危险process_mb > 70002 分钟Critical
并发过高images_in_progress > 1005 分钟Warning
Goroutine 泄漏goroutines > 505 分钟Warning

配置步骤

  1. 进入 BetterStack Dashboard → Alerting
  2. 点击 Create Alert
  3. 选择 Source
  4. 配置查询条件
  5. 设置阈值和持续时间
  6. 配置通知渠道

Sentry 集成

用户上下文

后端在 JWT 认证后设置用户上下文:
// src/hono/middleware/auth.ts
const payload = c.get("jwtPayload");
if (payload?.sub) {
  Sentry.setUser({
    id: payload.sub,
    email: payload.email,
  });
}
前端在登录后设置:
// login-provider.tsx
useEffect(() => {
  if (isAuth && profile) {
    Sentry.setUser({
      id: profile.id,
      email: profile.email,
      username: profile.nickname,
    });
  } else {
    Sentry.setUser(null);
  }
}, [isAuth, profile]);

错误追踪

try {
  // ...
} catch (error) {
  Sentry.captureException(error, {
    tags: {
      toolName: "generateImageWithAI",
    },
    extra: {
      input: sanitizedInput,
    },
  });
  throw error;
}

环境变量

# BetterStack
BETTERSTACK_HOST=in-otel.logs.betterstack.com
BETTERSTACK_TOKEN=xxx

# Sentry
SENTRY_DSN=https://[email protected]/xxx

# PostHog (Analytics)
POSTHOG_API_KEY=xxx
POSTHOG_PROJECT_ID=xxx
生产环境才会启用 OpenTelemetry 和 Sentry,开发环境自动跳过。