Web Crawler
Web crawlers are automated programs designed to systematically browse and index the World Wide Web, fulfilling crucial roles in search engines and data extraction. Current research emphasizes improving crawler efficiency, particularly through targeted crawling strategies that prioritize relevant content and reduce unnecessary data downloads, often employing machine learning models to predict promising links or infer document language. These advancements are significant for enhancing search engine performance, facilitating data collection for various applications (e.g., speech recognition corpus creation), and enabling more effective analysis of online information, including monitoring online safety for children.
Papers
Targeted and Troublesome: Tracking and Advertising on Children's Websites
Zahra Moti, Asuman Senol, Hamid Bostani, Frederik Zuiderveen Borgesius, Veelasha Moonsamy, Arunesh Mathur, Gunes Acar
Web crawler strategies for web pages under robot.txt restriction
Piyush Vyas, Akhilesh Chauhan, Tushar Mandge, Surbhi Hardikar