Back to Projects

Speech Data Builder: TTS & STT Dataset Creation Tool

2025-09-20
Web DevelopmentAI ToolsOpen SourceJavaScriptHTMLCSSBootstrapWaveSurfer.jsIndexedDB

Overview

Speech Data Builder is a powerful, open-source web application designed to streamline the creation of professional speech datasets for Text-to-Speech (TTS) and Speech-to-Text (STT) model training. It runs entirely in the browser with no server-side processing, ensuring data privacy.

Project detail 1

Main Interface (Light Mode) showing the waveform and transcription tools.

Project detail 3

Main Interface (Dark Mode). The tool features AI-powered transcription (integrating Google Gemini and OpenAI), precise audio visualization using WaveSurfer.js, and automatic text normalization. Users can export datasets in popular formats like LJSpeech, CSV, and JSON.

Objectives

  • Create a privacy-focused, client-side tool for speech dataset creation.
  • Integrate AI services (Google Gemini, OpenAI) to speed up transcription.
  • Provide precise audio visualization and region selection.
  • Support industry-standard export formats like LJSpeech.

Key Features

AI-Powered Transcription

Automatically transcribe audio using Google Gemini or OpenAI Whisper models.

Audio Visualization

Visualizes audio waveforms with region selection for accurate timestamping.

Text Normalization

Automatically normalizes text (numbers to words, special characters) for TTS training.

Multiple Export Formats

Exports data to LJSpeech, CSV, JSON, and TXT formats.

Offline Support (PWA)

Installable as a Progressive Web App (PWA) that works offline.

Challenges & Solutions

Challenge:

Handling large audio files and datasets purely in the browser.

Solution:

Utilized IndexedDB for efficient client-side storage and JSZip for generating large export files without crashing the browser.

Challenge:

Visualizing audio data accurately for precise splitting.

Solution:

Implemented WaveSurfer.js to render detailed waveforms and allow precise region manipulation.