Cloud Incident

Cloud incident management aims to automate the complex process of identifying and resolving issues in large-scale cloud services, reducing downtime and manual effort for engineers. Current research heavily utilizes large language models (LLMs), such as GPT-3 and GPT-4, employing techniques like in-context learning and prompt engineering to improve root cause analysis and mitigation recommendation accuracy. This work focuses on leveraging diverse data sources, including logs, monitoring data, and historical incident reports, to enhance the performance and reliability of AI-driven incident management systems, ultimately improving service reliability and operational efficiency.

Papers