Introducing synthetictext

Published:

This article is the Quarto version of my post introducing [synthetictext](https://pypi.org/project/synthetictext/), an LLM-powered Python package for generating synthetic text data for text classification tasks. ## Interactive demo Use the controls below to simulate how I think about picking a generation strategy. This is intentionally simple, but it proves the article can host interactive HTML and JavaScript, not just static text.
## What it does - Task-agnostic generation for binary and multi-class text classification tasks - Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation - Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks - Multilingual support, including cross-lingual transfer for lower-resource settings - Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers - Both a Python API and a CLI ## Why I built it I wanted a reusable way to create synthetic training data for classification workflows without rebuilding the whole pipeline for every new task. ## Links - PyPI: - GitHub: