Apache Spark

Apache Spark is an open-source distributed computing framework designed for processing massive datasets efficiently. Current research emphasizes optimizing Spark's performance through automated parameter tuning, employing techniques like Bayesian optimization and transfer learning to minimize resource consumption (CPU, memory) and improve execution speed across diverse workloads, including machine learning tasks. This focus on efficient resource management and automated optimization is crucial for enabling large-scale data analysis in various domains, such as healthcare, finance, and general scientific computing, where handling massive datasets is increasingly critical.

Papers

December 2, 2024

IIntelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows
Jialin Wang, Zhihua Duan
Machine Learning Model Software Agent Apache Spark Graph Grammar

March 9, 2024

Distributed Record Linkage in Healthcare Data with Apache Spark
Mohammad Heydari, Reza Sarshar, Mohammad Ali Soltanshahi
Healthcare Data Record Linkage Apache Spark

September 5, 2023

Towards General and Efficient Online Tuning for Spark
Yang Li, Huaijun Jiang, Yu Shen, Yide Fang, Xiaofeng Yang, Danqing Huang, Xinyi Zhang, Wentao Zhang, Ce Zhang, Peng Chen, Bin Cui
General Text Apache Spark

March 17, 2023

Autonomic Architecture for Big Data Performance Optimization
Mikhail Genkin, Frank Dehne, Anousheh Shahmirza, Pablo Navarro, Siyu Zhou
Big Data Self Adaptive System Apache Spark Hadoop MapReduce

February 8, 2023

Rover: An online Spark SQL tuning service via generalized transfer learning
Yu Shen, Xinyuyang Ren, Yupeng Lu, Huaijun Jiang, Huanyong Xu, Di Peng, Yang Li, Wentao Zhang, Bin Cui
Bayesian Optimization Service Provider Apache Spark

September 21, 2022

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification
Taha Tekdogan, Ali Cakmak
Big Data Apache Spark Hadoop MapReduce

September 17, 2022

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
Elham Azhir, Mehdi Hosseinzadeh, Faheem Khan, Amir Mosavi
Performance Evaluation Parallel Data Apache Spark Hadoop MapReduce Query Plan

September 7, 2022

Parallel and Streaming Wavelet Neural Networks for Classification and Regression under Apache Spark
Eduru Harindra Venkatesh, Yelleti Vivek, Vadlamani Ravi, Orsu Shiva Shankar
Neural Network Classification Code Novel Regression Activation Function Wavelet Based Wavelet Neural Network Apache Spark Parallel Stochastic

April 4, 2022

Empirical Analysis of Lifelog Data using Optimal Feature Selection based Unsupervised Logistic Regression (OFS-ULR) Model with Spark Streaming
Sadhana Tiwari, Sonali Agarwal
Supervised Learning Feature Selection Empirical Analysis Disease Classification Apache Spark Unsupervised Model Selection Lifelog Data

February 23, 2022

Nowcasting the Financial Time Series with Streaming Data Analytics under Apache Spark
Mohammad Arafat Ali Khan, Chandra Bhushan, Vadlamani Ravi, Vangala Sarveswara Rao, Shiva Shankar Orsu
Machine Learning Financial Time Series Data Driven Nowcasting Apache Spark Stream Analytics

February 8, 2022

Feature subset selection for Big Data via Chaotic Binary Differential Evolution under Apache Spark
Yelleti Vivek, Vadlamani Ravi, P. Radhakrishna
Big Data Differential Evolution Feature Subset Apache Spark

December 16, 2021

Predictive Price-Performance Optimization for Serverless Query Processing
Rathijit Sen, Abhishek Roy, Alekh Jindal
Serverless Computing Apache Spark Near Optimal Executor

Apache Spark

Papers

IIntelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows

Distributed Record Linkage in Healthcare Data with Apache Spark

Towards General and Efficient Online Tuning for Spark

Autonomic Architecture for Big Data Performance Optimization

Rover: An online Spark SQL tuning service via generalized transfer learning

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Parallel and Streaming Wavelet Neural Networks for Classification and Regression under Apache Spark

Empirical Analysis of Lifelog Data using Optimal Feature Selection based Unsupervised Logistic Regression (OFS-ULR) Model with Spark Streaming

Nowcasting the Financial Time Series with Streaming Data Analytics under Apache Spark

Feature subset selection for Big Data via Chaotic Binary Differential Evolution under Apache Spark

Predictive Price-Performance Optimization for Serverless Query Processing