Web Scraping

Web scraping automates the extraction of data from websites, aiming to efficiently gather information from diverse online sources. Current research focuses on improving data quality and efficiency, employing techniques like leveraging large language models (LLMs) with retrieval augmented generation (RAG) architectures and bespoke content extractors tailored to specific website structures. This field is significant for its applications in various domains, from generating training datasets for machine learning models to facilitating large-scale data analysis in fields like news aggregation and medical research, while simultaneously raising concerns about potential sampling biases in the collected data.

Papers