Data Recipe

"Data recipes" in the context of large language models (LLMs) refer to optimized combinations of training data sources designed to improve model performance on specific tasks or across a range of benchmarks. Current research focuses on developing algorithms and frameworks to automatically generate and evaluate these recipes, including methods for programmatically creating synthetic data and efficiently processing massive, heterogeneous datasets. This work is significant because it addresses the high cost and complexity of manually curating LLM training data, potentially leading to more efficient and effective LLM development and deployment across various applications.

Papers

October 7, 2024

Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Avanika Narayan, Mayee F. Chen, Kush Bhatia, Christopher Ré
Training Data New Framework Complete Recipe Instruction Tuned Model Instruction Dataset Generative LLM Recipe Dataset Data Recipe

June 6, 2024

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe
Alicja Ziarko, Albert Q. Jiang, Bartosz Piotrowski, Wenda Li, Mateja Jamnik, Piotr Miłoś
Language Model Full Model Semantic Similarity Text Embeddings Low Rank Adaptation Decoder Only Language Model Data Recipe

September 5, 2023

Data-Juicer: A One-Stop Data Processing System for Large Language Models
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou
Processing Framework Recipe Dataset Data Recipe

June 24, 2023

Large Language Models as Sous Chefs: Revising Recipes with GPT-3
Alyssa Hwang, Bryan Li, Zhaoyi Hou, Dan Roth
Large Language Model GPT 3 Amazon Mechanical Turk Recipe Completion Data Recipe

February 2, 2023

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh
Cross Over Step Scratch Project Sparse Mask Mask Learning Adam Algorithm Data Recipe

October 14, 2022

RecipeMind: Guiding Ingredient Choices from Food Pairing to Recipe Completion using Cascaded Set Transformer
Mogan Gim, Donghee Choi, Kana Maruyama, Jihun Choi, Hajung Kim, Donghyeon Park, Jaewoo Kang
Recipe Completion Data Recipe Key Ingredient

November 9, 2021

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation
Bernardo Aceituno, Alberto Rodriguez, Shubham Tulsiani, Abhinav Gupta, Mustafa Mukadam
Differentiable Model Differentiable Architecture Contact Aware Non Prehensile Planar Manipulation Data Recipe