LLM 비용 최적화 종합 — 캐싱·라우팅·프롬프트 압축·로컬 폴백 | 기술노트

핵심 요약

프런티어 LLM(Opus 4.7·GPT-5.5)을 그대로 쓰면 월 $5,000~$50,000 쉽게 도달. 5가지 기법을 조합하면 같은 품질을 1/5~1/10 비용으로 달성 가능.

Prompt Caching — 반복되는 컨텍스트 재사용
Model Routing — 작업 난이도별 모델 선택
Prompt Compression — 토큰 수 자체를 줄임
Local Fallback — 단순 작업은 로컬 모델
Batch API — 비동기 작업 50% 할인

1. Prompt Caching (가장 효과 큼)

같은 시스템 프롬프트·코드베이스·지침을 여러 호출에 걸쳐 반복하는 경우, Anthropic 기준 cache hit는 정가의 10%.

{
  "model": "claude-opus-4-7",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "<codebase>...300K tokens...</codebase>",
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
      },
      {
        "type": "text",
        "text": "사용자 질문: ..."
      }
    ]
  }]
}

호출 패턴	비용
1차 (cache write)	$5.625 (1.25배)
2~10차 (cache hit, 같은 컨텍스트)	각 $0.45 (0.1배)
10회 합계	$9.68 vs 캐시 없이 $45

2. Model Routing

모든 요청을 가장 비싼 모델로 처리하지 말 것. 난이도 분류 후 라우팅.

def route(query: str) -> str:
    # 분류 (작은 모델 또는 룰)
    classification = classify(query)
    if classification == "simple":
        return "haiku-3.5"  # $0.80 / 1M
    elif classification == "moderate":
        return "sonnet-4.6"  # $3 / 1M
    elif classification == "complex":
        return "opus-4.7"  # $15 / 1M

class Classifier:
    def classify(self, q):
        # 1) 길이·키워드 룰
        if len(q) < 100 and not requires_reasoning(q):
            return "simple"
        # 2) 작은 모델 분류
        score = haiku.score_complexity(q)
        return "simple" if score < 0.3 else "moderate" if score < 0.7 else "complex"

실측: 사용자 질문 분포에서 simple 60%·moderate 30%·complex 10%. 평균 비용 70~80% 감소.

3. Prompt Compression

토큰 수 자체를 줄이는 기법.

3-1. 코드 압축

def compress_code(code: str) -> str:
    return code
        .replace(re.compile(r'/\*[\s\S]*?\*/'), '')  # 블록 주석
        .replace(re.compile(r'//.*'), '')             # 라인 주석
        .replace(re.compile(r'\n\s*\n'), '\n')        # 빈 줄
        .replace(re.compile(r'^\s+'), '')             # 들여쓰기 (LLM은 들여쓰기 없어도 이해)

30~50% 토큰 감소, 정확도 거의 영향 없음.

3-2. LLMLingua — 의미 기반 압축

Microsoft의 LLMLingua는 작은 모델로 중요도 낮은 토큰 제거. 평균 15~30% 추가 감소.

4. Local Fallback

단순 작업은 로컬 LLM(Llama 4·Qwen3)으로. 비용 0.

def smart_query(q):
    if is_routine(q):  # 분류·요약·번역
        return local_llm(q)  # Llama 4 70B
    return cloud_llm(q)  # Claude/GPT

적합 작업: 분류·요약·OCR 후처리·간단한 Q&A. 부적합: 복잡 추론·코드 생성.

5. Batch API

비동기 처리 OK인 작업은 Anthropic·OpenAI Batch API. 50% 할인, 24시간 내 완료.

일일 보고서 생성
백 분석 (sentiment·entity)
대량 임베딩
야간 데이터 처리

6. 종합 전략 — RAG 시스템 사례

월 100만 쿼리, RAG 시스템:

전략	월 비용
모든 쿼리 Opus 4.7	$15,000
+ Prompt Caching	$8,500
+ Model Routing (대부분 Sonnet)	$3,200
+ Prompt Compression	$2,400
+ Batch API (분석용)	$1,800
+ Local Fallback (분류)	$1,200

최종 92% 비용 절감.

7. 모니터링

# Anthropic 응답 헤더 활용
response.usage = {
    "input_tokens": 30000,
    "output_tokens": 1500,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 280000
}

# 캐시 hit율 = cache_read / total_input
# 0.6 이상이 좋은 운영

8. 흔한 함정

캐시는 5분(또는 1시간) TTL — 오래 idle하면 다시 write 비용
routing 분류기 자체가 비용·지연 추가 — 룰 + 작은 모델 조합 권장
compression이 너무 강하면 LLM 정확도 손실 — A/B 검증 필수
local fallback은 모델 품질 모니터링 — drift 감지

자주 묻는 질문

가장 효과 큰 단일 기법?

Prompt Caching. 같은 컨텍스트 반복 사용 시 90% 절감. 대부분 워크로드에 즉시 적용 가능.

품질 손실 없이 정말 5~10배 절감 가능?

적절한 routing이 핵심. 단순 쿼리에 Opus 쓰는 낭비를 줄이면 가능. 단 분류 기준은 신중히.

로컬 모델 운영 부담?

vLLM·Ollama로 GPU 서버 1대 운영. 24/7이면 클라우드 GPU 임대 ($1500/월) vs 자체 운영 (전기료 $200). 1만 쿼리/일 이상이면 자체 운영 유리.