Model Deployment
Model deployment focuses on efficiently and reliably integrating machine learning models into real-world applications, addressing challenges like resource constraints, privacy concerns, and ensuring model stability and fairness. Current research emphasizes optimizing resource utilization through techniques such as microservice architectures, efficient model compression (e.g., using smaller, locally deployable LLMs or stitching together parts of larger models), and automated model training and deployment pipelines (MLOps). These advancements are crucial for making AI accessible across diverse environments and applications, improving cost-effectiveness, and mitigating risks associated with deploying complex models.
Papers
ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models
Yujeong Choi, Jiin Kim, Minsoo Rhu
Flextron: Many-in-One Flexible Large Language Model
Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov