Synthetic Data Generator Simplifies Dataset Creation with Large Language Models

Hugging Face has introduced the Synthetic Data Generator, a new tool leveraging Large Language Models (LLMs), that offers a streamlined, no-code approach to creating custom datasets. The tool facilitates the creation of text classification and chat datasets through a clear and accessible process, making it usable for both non-technical users and experienced AI practitioners.
The Synthetic Data Generator uses a simple three-step process to create datasets:

Describe Your Dataset: Users start by defining the dataset's purpose and providing examples to guide the tool. This step ensures the generator aligns with the user's specific requirements.
Configure and Refine: After generating an initial sample dataset, users can refine it by adjusting task-specific settings, such as the system prompt or dataset parameters, iterating until they achieve the desired output.
Generate and Push: Finally, users can name the dataset, specify the number of samples to generate, and set parameters like the temperature for the output. The completed dataset is saved directly to Argilla and the Hugging Face Hub for further use.

After generation, the tool integrates with Argilla, enabling users to review, explore, and curate the dataset with features like semantic search and composable filters. This step is critical for maintaining data quality, even in synthetic datasets. Once the dataset is reviewed, it can be exported to the Hugging Face Hub to fine-tune models.

The tool currently supports two tasks:

Text Classification: For categorizing text into predefined classes.
Chat Datasets: Designed for conversational AI tasks, such as training customer support chatbots.

For instance, users can train a text classification model using the argilla/synthetic-text-classification-news dataset, which classifies news articles into eight categories. The Synthetic Data Generator simplifies the process by enabling model training through AutoTrain, a no-code platform for creating AI models.

Shashi Bhushan, a data scientist, highlighted the broader impact of the tool:

This is a fantastic development! The ability to generate high-quality datasets rapidly without requiring coding skills will democratize AI and empower a broader range of professionals to leverage machine learning. This tool could significantly reduce the time and resources typically needed for data preparation, allowing teams to focus more on model development and innovation. Additionally, the integration with AutoTrain suggests a seamless workflow from data generation to model training, which is a huge plus for efficiency. Looking forward to seeing the impact this will have on the AI community!

The Synthetic Data Generator can produce 50 text classification samples or 20 chat samples per minute with the free Hugging Face API. Users can scale this further by using custom APIs or advanced models. Planned improvements include support for Retrieval Augmented Generation (RAG) and customized evaluations using LLMs as judges.

The tool is also available as an open-source Python package via GitHub, enabling local deployment and further customization under an Apache 2 license.

About the Author

Robert Krzaczyński

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Robert Krzaczyński

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter