Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/remorses/kimaki/llms.txt

Use this file to discover all available pages before exploring further.

Voice Messages

Kimaki can transcribe Discord voice messages using AI-powered transcription that understands your project context for accurate file names and technical terms.

Overview

When you record a voice message in a Kimaki-linked channel:
  1. Upload completes - Discord processes the audio
  2. Kimaki detects the voice attachment
  3. Transcription starts using Gemini or OpenAI
  4. Project context is included for accuracy
  5. Text appears in the thread
  6. Session starts (unless it’s a /queue message)

Setup

Voice transcription requires an API key from either OpenAI or Google.
1

Choose a provider

  • OpenAI (recommended): Uses Whisper for high-quality transcription
  • Gemini: Google’s transcription with multimodal understanding
2

Set your API key

Run /login in a Kimaki channel, or set environment variables:
# OpenAI (recommended)
export OPENAI_API_KEY="sk-..."

# Gemini
export GEMINI_API_KEY="..."
3

Record and send

Hold the microphone icon in Discord to record, release to send.Kimaki transcribes automatically.
OpenAI is the preferred provider. Set it first if both keys are available.

Context-Aware Transcription

Kimaki enhances transcription accuracy by including your project’s file tree:
# File tree is generated with:
git ls-files | tree --fromfile -a
This helps the AI correctly transcribe:
  • File names (“auth utils dot TS” → auth-utils.ts)
  • Function names from your codebase
  • Project-specific terminology
  • Path references

Example

Without context:
“Open the off file and fix the error in line 42”
With project context:
“Open the auth.ts file and fix the error in line 42”

Session Context

For even better transcription, Kimaki also sends:
  • Current session context (last message in the thread)
  • Last session context (previous session’s final state)
This helps the AI understand what you’re referring to when you say “that file” or “the function we just edited”.

Implementation Details

From discord/src/voice-handler.ts:564-572:
const transcription = await transcribeAudio({
  audio: audioBuffer,
  prompt: transcriptionPrompt, // includes file tree
  apiKey: transcriptionApiKey,
  provider: transcriptionProvider, // 'openai' or 'gemini'
  mediaType: audioAttachment.contentType,
  currentSessionContext,
  lastSessionContext,
})

Queue Messages

Voice messages work with the /queue command:
/queue [voice message]
Kimaki transcribes the audio and queues it to send when the current response finishes.

Permissions

The same permission rules apply to voice messages:
  • Server Owner
  • Administrator
  • Manage Server
  • “Kimaki” role
Messages from unauthorized users are ignored.

Thread Naming

When a voice message starts a new thread, the transcribed text becomes the thread name (truncated to 80 characters). From discord/src/voice-handler.ts:595-617:
if (isNewThread) {
  const threadName = text.replace(/\s+/g, ' ').trim().slice(0, 80)
  if (threadName) {
    await thread.setName(threadName)
  }
}

Audio Processing Pipeline

For real-time voice chat (when bot joins voice channels):
1

Audio capture

Discord Opus stream → Opus decoder
2

Resampling

48kHz stereo → 16kHz mono PCMFrom voice-handler.ts:63-87:
export function convertToMono16k(buffer: Buffer): Buffer {
  const inputSampleRate = 48000
  const outputSampleRate = 16000
  const ratio = inputSampleRate / outputSampleRate
  // ... downsampling logic
}
3

Streaming to Gemini

100ms audio frames → Gemini Multimodal Live APIReal-time bidirectional audio streaming for voice assistant interaction.
Voice channel features (real-time voice chat with the AI) require a Gemini API key and use the Multimodal Live API.

Error Handling

If transcription fails:
  • No API key: Button appears to set one via /login
  • Network error: Error message with details
  • Empty transcription: “No speech detected” message
  • Invalid format: Unsupported audio format error

Supported Formats

Discord voice messages are sent as:
  • OGG Opus (most common)
  • Webm Opus
Both OpenAI Whisper and Gemini handle these formats natively.

Privacy

Voice messages are:
  • Sent to OpenAI or Google APIs for transcription
  • Not stored by Kimaki beyond the session
  • Processed with the same privacy as text messages
If you’re concerned about audio privacy, use text messages or file attachments instead.