Paper ID: 2410.09168

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

Alexey Zhezherau, Alexei Yanockin

This research explores a hybrid approach to fine-tuning large language models (LLMs) by integrating real-world and synthetic data to boost model performance, particularly in generating accurate and contextually relevant responses. By leveraging a dataset combining transcribed real interactions with high-quality synthetic sessions, we aimed to overcome the limitations of scarce, noisy, and domain-specific real data. Synthetic personas and scenarios were employed to enhance training diversity. The study evaluated three models: a base foundational model, a model fine-tuned with real data, and a hybrid fine-tuned model. Experimental results showed that the hybrid model consistently outperformed the others in specific vertical applications, achieving the highest scores across all metrics. Further testing confirmed the hybrid model's superior adaptability and contextual understanding across diverse scenarios. These findings suggest that combining real and synthetic data can significantly improve the robustness and contextual sensitivity of LLMs, particularly in domain-specific and vertical use cases.

Submitted: Oct 11, 2024