obsidian/wiki/tech-patterns/cost-tracker-providers.md
2026-04-27 11:11:54 +01:00

4.2 KiB
Raw Blame History

title tags created updated
AI Cost Tracker — Billing Units per Provider
reference
ai
cost-tracking
providers
2026-04-27 2026-04-27

Billing Units per Provider

Reference for how each AI provider bills and how to extract usage data from their API responses.

Gemini (Google AI / Vertex AI)

Billing unit: tokens (input + output separately)

SDK: google-genai Python SDK

How to get usage:

response = await client.models.generate_content(...)

input_tokens  = response.usage_metadata.prompt_token_count
output_tokens = response.usage_metadata.candidates_token_count
total_tokens  = response.usage_metadata.total_token_count

⚠️ usage_metadata is available on all generate_content responses including multimodal (video + text prompts). It was not being read in video-accessibility before the cost-tracker integration — added as part of Phase 1.

Token estimation before the call:

  • Text: len(text) / 4 (rough heuristic; actual tokenisation varies ±30%)
  • Video file: use Google's published token table:
    • < 1 min video ≈ 1,0002,000 tokens + audio
    • Exact: check google.genai file metadata after upload
  • Image: ~258 tokens per 512×512 tile

Pricing: auto-synced from LiteLLM. See wiki/tech-patterns/cost-tracker-pricing-sources.


Gemini TTS (audio generation via generate_content)

Billing unit: tokens (output audio tokens, different rate from text)

SDK: same google-genai, with response_modalities=["AUDIO"]

How to get usage:

response = await client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents=...,
    config=GenerateContentConfig(response_modalities=["AUDIO"]),
)
output_tokens = response.usage_metadata.candidates_token_count

Audio output token rate differs from text output rate — verify in LiteLLM for model gemini-2.5-flash-preview-tts.


ElevenLabs TTS

Billing unit: characters (input text length)

SDK: custom HTTP (aiohttp POST to https://api.elevenlabs.io/v1/text-to-speech/{voice_id})

Response: returns raw audio bytes. No usage metadata in response.

How to measure: compute len(text) at the call site before making the request:

char_count = len(text)
# make the ElevenLabs call
await ct.record(..., chars=char_count, model="eleven_multilingual_v2", ...)

Subscription vs pay-as-you-go: ElevenLabs bills against a monthly character quota. When quota is exceeded, pay-as-you-go rate applies. The cost-tracker assumes pay-as-you-go for all characters (conservative upper bound). Adjust via admin override if on a subscription plan.


Google Cloud TTS

Billing unit: characters (input text length, after SSML stripping)

SDK: google.cloud.texttospeech Python SDK

Response: SynthesizeSpeechResponse with audio_content (bytes). No character count in response.

How to measure:

char_count = len(synthesis_input.text)
# for SSML Google bills stripped char count — approximate with len(ssml)
await ct.record(..., chars=char_count, model="standard", ...)

Voice tiers and pricing:

Voice type Billing model name Price per 1M chars
Standard google_tts/standard $4.00
WaveNet google_tts/wavenet $16.00
Neural2 google_tts/neural2 $16.00
Studio google_tts/studio $160.00

Defined in pricing/models.yaml in the cost-tracker repo.


OpenAI (future)

Billing unit: tokens (input + output)

response = client.chat.completions.create(...)
input_tokens  = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens

Auto-synced by LiteLLM.


Anthropic Claude (future)

Billing unit: tokens (input + output)

response = client.messages.create(...)
input_tokens  = response.usage.input_tokens
output_tokens = response.usage.output_tokens

Auto-synced by LiteLLM.


Whisper (self-hosted)

Not billed per token. Runs on Cloud Run / GPU compute.

Billing = infrastructure cost (compute time). Phase 1 does not track this. Future Phase 2: track audio_duration_seconds and approximate cost from Cloud Run billing data.