Hugging Face Dataset

Transcription-Cleanup-Trainer

Text Cleanup Fine-Tuning Dataset A curated dataset for training speech-to-text cleanup models to achieve optimal transcript refinement. Dataset Description This dataset contains paired examples of raw speech-to-text transcriptions and manually-cleaned versions, designed for fine-tuning models to clean up transcripts to a specific quality level ("Goldilocks" cleanup - not too much, not too little). Dataset Structure dataset/ ├── data/ │ ├── audio/… See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Transcription-Cleanup-Trainer.

Project Information

Created

December 18, 2025

Platform

Hugging Face Dataset

Type

Dataset

Topics

size_categories:n<1Kformat:textmodality:audiomodality:textlibrary:datasetslibrary:mlcroissantregion:us

View on Hugging Face Dataset

← Back to Projects Index