Apache Spark

Apache Spark is an open-source distributed computing framework designed for processing massive datasets efficiently. Current research emphasizes optimizing Spark's performance through automated parameter tuning, employing techniques like Bayesian optimization and transfer learning to minimize resource consumption (CPU, memory) and improve execution speed across diverse workloads, including machine learning tasks. This focus on efficient resource management and automated optimization is crucial for enabling large-scale data analysis in various domains, such as healthcare, finance, and general scientific computing, where handling massive datasets is increasingly critical.

Papers