Issue |
BIO Web Conf.
Volume 102, 2024
70th Scientific Conference with International Participation “FOOD SCIENCE, ENGINEERING AND TECHNOLOGY – 2023”
|
|
---|---|---|
Article Number | 03008 | |
Number of page(s) | 5 | |
Section | Food Process Engineering | |
DOI | https://doi.org/10.1051/bioconf/202410203008 | |
Published online | 11 April 2024 |
JavaScript Web Scraping Tool for Extraction Information from Agriculture Websites
University of Food Technology, 26 Maritza Blvd, Plovdiv 4002, Bulgaria
* Corresponding author: m_jekova@uft-plovdiv.bg
Extracting information from an information platform, site or system is possible if the information is structured or annotated in a way that is convenient for subsequent analysis and data processing, decision making and reasoning. The goal of this paper is to review and categorize various techniques, tools, and libraries for extracting information from unstructured web content (platforms, sites, systems), and to develop a JavaScript application that crawls and extracts data from dynamic web pages without the need to browse, read and search the page content. The paper presents an implementation of a particular JavaScript web scraper that retrieves a list of news headlines from the official European Union Agriculture and Rural Development website without the need for the content of the document to be read by users. The web scraper is configured to extract the searched content directly from the source HTML code of the document, regardless of whether the information is explicit or implicit. It also searches all pages related to the document. Finally exports data in a proper format. The benefits of such a tool for extracting web content from source code are related to saving time, manual labour and means of generating quality content in the biotech and agriculture industry.
© The Authors, published by EDP Sciences, 2024
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.