Paper ID: 2412.13205 • Published Dec 3, 2024
Adaptive Two-Phase Finetuning LLMs for Japanese Legal Text Retrieval
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Text Retrieval (TR) involves finding and retrieving text-based content
relevant to a user's query from a large repository, with applications in
real-world scenarios such as legal document retrieval. While most existing
studies focus on English, limited work addresses Japanese contexts. In this
paper, we introduce a new dataset specifically designed for Japanese legal
contexts and propose a novel two-phase pipeline tailored to this domain.
In the first phase, the model learns a broad understanding of global
contexts, enhancing its generalization and adaptability to diverse queries. In
the second phase, the model is fine-tuned to address complex queries specific
to legal scenarios. Extensive experiments are conducted to demonstrate the
superior performance of our method, which outperforms existing baselines.
Furthermore, our pipeline proves effective in English contexts, surpassing
comparable baselines on the MS MARCO dataset. We have made our code publicly
available on GitHub, and the model checkpoints are accessible via HuggingFace.