Introducing synthetictext

Published

June 1, 2026

This article is the Quarto version of my post introducing synthetictext, an LLM-powered Python package for generating synthetic text data for text classification tasks.

Interactive demo

Use the controls below to simulate how I think about picking a generation strategy. This is intentionally simple, but it proves the article can host interactive HTML and JavaScript, not just static text.

What it does

  • Task-agnostic generation for binary and multi-class text classification tasks
  • Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
  • Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks
  • Multilingual support, including cross-lingual transfer for lower-resource settings
  • Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers
  • Both a Python API and a CLI

Why I built it

I wanted a reusable way to create synthetic training data for classification workflows without rebuilding the whole pipeline for every new task.