What is code-switched voice data?

Code-switched voice data is audio in which speakers move between two or more languages within a single sentence — for example, Hinglish (Hindi + English), Tanglish (Tamil + English), or Banglish (Bengali + English). It is the way most urban Indians actually speak, and it is missing from almost every public speech dataset.

What audio format does Capstrix AI deliver?

Every Capstrix recording is captured natively at 16 kHz, 16-bit, mono, uncompressed WAV — the buyer-standard format for STT and TTS training pipelines including OpenAI Whisper, NVIDIA NeMo, and Meta SeamlessM4T.

How is contributor consent handled?

Every contributor reviews and accepts a versioned consent text before recording. The consent version is logged with each clip, the contributor can request deletion at any time, and the full audit trail is preserved.

Which languages are covered today?

Today: Hinglish, Tanglish, and Banglish — recorded by native speakers in Mumbai, Bangalore, Chennai, Kolkata, and Delhi. Next: Taglish (Philippines) and code-switched English in Nigeria.

How do I request a dataset sample?

Email hi@capstrix.com with the languages, scenarios, and approximate volume you need. We respond within one business day with a sample clip and a delivery quote.

Hinglish · Tanglish · Banglish voice data

The voice India actually speaks. Captured natively.

Name: Capstrix AI — Indian Code-Switched Voice Dataset
Creator: Capstrix AI

Capstrix AI supplies consented, native-speaker audio in Hinglish, Tanglish and Banglish — the code-switched Indian languages your STT and TTS models have never properly heard. / Indian voice today. Global, multi-modal tomorrow.

Request a sample See what's in the dataset

01 / The gap

Models trained on English break the moment a sentence switches code.

Nearly a billion Indians speak in two languages within a single breath — "main aaj office nahi ja raha, working from home", "naan office poren, traffic romba heavy iruku", "aami coffee khabo, then meeting attend korbo". Almost none of it lives in an open training set.

Capstrix AI records it the way it's actually spoken — on the phones of native urban speakers across Mumbai, Bangalore, Chennai, Kolkata and Delhi, with explicit consent, scored for quality, and shipped in the format your training pipeline already expects.

02 / The delivery

Spec-clean audio, drop-in for any voice training stack.

Format

16 kHz

16-bit, mono PCM WAV. No transcoding, captured native on device.

Transcripts

Aligned

Per-utterance text with language tags on each code-switch boundary.

Consent

Logged

Versioned consent record bound to every clip. Audit trail by default.

QA

Scored

SNR, VAD, dedup, speaker-match. Borderline clips go to human review.

Languages & geographies

Starting India-first. Urban native speakers across Mumbai, Bangalore, Chennai, Kolkata and Delhi — recorded on their own mobile devices, not in a studio, so the acoustics match the products you're shipping.

Hinglish

Hindi × English — Mumbai, Delhi, Bangalore

IN · ~600M

Tanglish

Tamil × English — Chennai, Bangalore

IN · ~80M

Banglish

Bengali × English — Kolkata & the broader Bengali belt

IN · ~270M

Coming next

Expanding to Taglish (Philippines), Nigerian Pidgin English (Nigeria), and Indonesian–English — then text, image, video and behavioral modalities on the same consent-logged pipeline.

03 / The pipeline

From a speaker's phone to your training bucket.

01

Native capture

Verified speakers record scenario-driven prompts on their own devices, in their own homes — matching real product acoustics, not studio booths.
02

Automated QA

Every clip passes format, SNR, voice-activity, transcript and speaker-match checks. Failures never reach the dataset.
03

Human review

Borderline scores route to internal reviewers. We'd rather drop a clip than ship one that pollutes your eval set.
04

Signed delivery

Manifest CSV, hashed identifiers, signed checksums. Under DPA, sized to the partnership.

04 / Who we work with

Built for the teams shipping voice models for India — and the global, multi-modal ones after that.

If you're training STT, TTS, voice agents, or speech-aware multi-modal systems and your evals fall apart on Hinglish, Tanglish or Banglish audio, we want to talk.

Direct line

hi@capstrix.com

Tell us your target Indian language pair, target hours, and the eval you keep failing. We'll send a representative sample within a few days.

Request a sample