Web Scraping
Web scraping automates the extraction of data from websites, aiming to efficiently gather information from diverse online sources. Current research focuses on improving data quality and efficiency, employing techniques like leveraging large language models (LLMs) with retrieval augmented generation (RAG) architectures and bespoke content extractors tailored to specific website structures. This field is significant for its applications in various domains, from generating training datasets for machine learning models to facilitating large-scale data analysis in fields like news aggregation and medical research, while simultaneously raising concerns about potential sampling biases in the collected data.
Papers
October 12, 2024
June 12, 2024
March 22, 2024
February 22, 2024
August 4, 2023
June 21, 2023