핵심 요약

OpenTelemetry 1.0은 Trace·Metric·Log 3개 시그널을 단일 SDK·단일 콜렉터로 통합한 첫 안정 버전. Datadog·New Relic·Grafana 모두 OTLP를 1급 입력으로 받는다. 벤더 락인 없이 관찰성 인프라를 구축할 수 있는 시점.

1. 아키텍처 한눈에

+--------+     OTLP/gRPC      +------------+    +-------+
| App    |  -- traces ------> | OTel       | -> | Tempo |
|        |  -- metrics ----- > | Collector  | -> | Mimir |
|        |  -- logs --------> |            | -> | Loki  |
+--------+                    +------------+    +-------+

SDK는 Auto Instrumentation으로 코드 변경 거의 없이 도입. Collector가 라우팅·샘플링·변환 처리.

2. Node.js — 5분 도입

npm i @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
npm i @opentelemetry/exporter-metrics-otlp-http @opentelemetry/exporter-logs-otlp-http

// instrumentation.ts (Next.js·Express 모두 동일)
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'

new NodeSDK({
  serviceName: 'my-api',
  traceExporter: new OTLPTraceExporter({ url: 'http://otelcol:4318/v1/traces' }),
  instrumentations: [getNodeAutoInstrumentations()],
}).start()

3. Python — FastAPI

pip install opentelemetry-distro opentelemetry-instrumentation-fastapi
opentelemetry-bootstrap -a install

# 환경변수만으로 시작
OTEL_SERVICE_NAME=my-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otelcol:4318 \
opentelemetry-instrument uvicorn main:app

4. Collector 설정 — 핵심

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch:
    timeout: 10s
    send_batch_size: 8192
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlp/tempo:    { endpoint: tempo:4317, tls: { insecure: true } }
  prometheus:    { endpoint: 0.0.0.0:8889 }
  loki:          { endpoint: http://loki:3100/loki/api/v1/push }

service:
  pipelines:
    traces:  { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlp/tempo] }
    metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
    logs:    { receivers: [otlp], processors: [batch], exporters: [loki] }

5. 비용 절감 핵심 — Tail Sampling

Head sampling은 요청 시작 시 결정. Tail sampling은 trace 완료 후 결정 → 에러·느린 요청만 정확히 보관.

전략	저장 비용	에러 누락 가능성
1% Head	매우 낮음	높음
100% + Tail (5% 정상 + 100% 에러)	낮음	0%
100% 전체 보관	높음	0%

6. 컨텍스트 전파 — gRPC·메시지큐

// Kafka 메시지 헤더에 trace context 주입/추출
import { propagation, context, trace } from '@opentelemetry/api'

// 발행 측
const span = trace.getActiveSpan()
const headers = {}
propagation.inject(context.active(), headers)
producer.send({ topic: 't', messages: [{ value: 'x', headers }] })

// 소비 측
const ctx = propagation.extract(context.active(), msg.headers)
context.with(ctx, () => {
  // 여기서 시작하는 span은 발행자 trace에 연결됨
})

7. 콜드 스타트 영향

Lambda·Cloud Run에서 SDK 초기화가 콜드 스타트 +200~400ms. 대안:

람다 익스텐션(@opentelemetry/lambda)으로 외부화
BatchSpanProcessor 대신 SimpleSpanProcessor 금지(동기 export)
cold path만 head sampling 1% 권장

8. 비용 — 실수치

규모	월 trace 양	저장 비용
일 100k req, 100% 보관	~3M traces	$80~120
일 100k req, tail 5%	~150k traces	$5~10
일 1M req, tail 5% + 100% 에러	~2M traces	$60~90

9. 흔한 실수

Auto-instrumentation만 켜고 비즈니스 로직에 manual span 미추가 → 의미 있는 정보 부족
모든 환경에서 100% 샘플링 → 비용 폭발
Resource attribute에 PII 포함 → GDPR 위반
Collector를 거치지 않고 직접 SaaS로 export → 락인·재처리 불가

참고

opentelemetry.io/docs/concepts/sampling
Grafana LGTM 스택 가이드

OpenTelemetry 1.0 — Trace·Metric·Log 단일 표준 통합 운영 가이드