Synthetic Text
Python package for generating and filtering synthetic text data for NLP tasks
Building synthetictext, a Python package for generating synthetic training and evaluation data for text classification, style transfer, and RAG evaluation.
Current focus:
- Task-agnostic generation for binary and multi-class classification
- Multiple generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
- Quality filters for deduplication, label leakage, embedding similarity, marker artifacts, and LLM-as-judge scoring
- Multilingual and low-resource workflows, including cross-lingual transfer and provider-agnostic model support
Tech: Python, OpenAI API, embeddings, Google Translate, packaging, CLI design