Domain Specific Continual Pre Training
Domain-specific continual pre-training (CPT) enhances large language models (LLMs) by incrementally training them on data from specific domains, improving performance on domain-relevant tasks while preserving general knowledge. Current research focuses on optimizing data selection strategies, developing efficient algorithms for knowledge integration (e.g., model merging, soft-masking), and establishing scaling laws to guide the optimal balance between general and domain-specific data. This approach offers a cost-effective alternative to training domain-specific LLMs from scratch, impacting various fields by enabling the creation of specialized models for finance, e-commerce, mental health, and other areas with limited labeled data.
Papers
September 30, 2024
September 25, 2024
June 3, 2024
April 16, 2024
February 27, 2024
December 25, 2023
November 14, 2023
April 20, 2023