Introducing synthetictext
This article is the Quarto version of my post introducing synthetictext, an LLM-powered Python package for generating synthetic text data for text classification tasks.
Interactive demo
Use the controls below to simulate how I think about picking a generation strategy. This is intentionally simple, but it proves the article can host interactive HTML and JavaScript, not just static text.
What it does
- Task-agnostic generation for binary and multi-class text classification tasks
- Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
- Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks
- Multilingual support, including cross-lingual transfer for lower-resource settings
- Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers
- Both a Python API and a CLI
Why I built it
I wanted a reusable way to create synthetic training data for classification workflows without rebuilding the whole pipeline for every new task.