Catalogue Search | MBRL

Tutorial: Legality and Ethics of Web Scraping

by Krotov, Vlad , Johnson, Leigh , Silva, Leiser in Data collection , Ethics , Literature reviews

2020

Researchers and practitioners often use various tools and technologies to automatically retrieve data from the Web (often referred to as Web scraping) when conducting their projects. Unfortunately, they often overlook the legality and ethics of using these tools to collect data. Failure to pay due attention to these aspects of Web Scraping can result in serious ethical controversies and lawsuits. Accordingly, we review legal literature together with the literature on ethics and privacy to identify broad areas of concern together with a list of specific questions that researchers and practitioners engaged in Web scraping need to address. Reflecting on these questions and concerns can potentially help researchers and practitioners decrease the likelihood of ethical and legal controversies in their work.

Journal Article

Share this book

Add to My Shelf

Efficient intelligent crawler for hamming distance based on prioritization of web documents

by Dange, Amol Subhash , Hanumanthaiah, Asha Kethaganahalli , Rao, Manju More Eshwar

2024

Search engines play a crucial role in today's Internet landscape, especially with the exponential increase in data storage. Ranking models are used in search engines to locate relevant pages and rank them in decreasing order of relevance. They are an integral component of a search engine. The offline gathering of the document is crucial for providing the user with more accurate and pertinent findings. With the web’s ongoing expansions, the number of documents that need to be crawled has grown enormously. It is crucial to wisely prioritize the documents that need to be crawled in each iteration for any academic or mid-level organization because the resources for continuous crawling are fixed. The advantages of prioritization are implemented by algorithms designed to operate with the existing crawling pipeline. To avoid becoming the bottleneck in pipeline, these algorithms must be fast and efficient. A highly efficient and intelligent web crawler has been developed, which employs the hamming distance method for prioritizing the pages to be downloaded in each iteration. This cutting-edge search engine is specifically designed to make the crawling process more streamlined and effective. When compared with other existing methods, the implemented hamming distance method achieves a high value of 99.8% accuracy.

Journal Article

Share this book

Add to My Shelf

The impact of energy certification on the real estate market: an analysis using Big Data and hedonic pricing

by Tempesta, Tiziano , Vecchiato, Daniel , Bonardi Pellizzari, Carolina in Appraisal , Big Data , Energy

2026

This study analyses the effect of Energy Performance Certificates (EPC) on residential buildings in Padua (Italy). Introduced by the 2002/91/EC Directive, EPCs became mandatory in European Union countries for buildings at sale or lease. Using web-scraping we collected 5,188 real estate offers in 2022 and 2023, of which 1,738 were in the ‘apartment’ category and suitable for data analysis. We examined EPC effects on prices both in aggregate terms (2022–2023 combined) and by year, to test short-run stability. Results confirm previous findings: a price premium emerges as energy classes improve, with the highest values for top EPC ratings. While most housing characteristics showed stable short-run effects, EPC classes revealed a stronger and more significant impact over time, especially for lower energy classes.

Journal Article

Share this book

Add to My Shelf

Tractable near-optimal policies for crawling

by Lubetzky, Eyal , Azar, Yossi , Horvitz, Eric in Cache , Computer Sciences , Optimization

2018

The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n log n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

Journal Article

Share this book

Add to My Shelf

Exploring Public Response to COVID-19 on Weibo with LDA Topic Modeling and Sentiment Analysis

by Wang, Yangshu , Chu, Samuel Kai Wah , Xie, Runbin in Coronaviruses , COVID-19 , Data mining

2021

It is necessary and important to understand public responses to crises, including disease outbreaks. Traditionally, surveys have played an essential role in collecting public opinion, while nowadays, with the increasing popularity of social media, mining social media data serves as another popular tool in opinion mining research. To understand the public response to COVID-19 on Weibo, this research collects 719,570 Weibo posts through a web crawler and analyzes the data with text mining techniques, including Latent Dirichlet Allocation (LDA) topic modeling and sentiment analysis. It is found that, in response to the COVID-19 outbreak, people learn about COVID-19, show their support for frontline warriors, encourage each other spiritually, and, in terms of taking preventive measures, express concerns about economic and life restoration, and so on. Analysis of sentiments and semantic networks further reveals that country media, as well as influential individuals and “self-media,” together contribute to the information spread of positive sentiment.

Journal Article

Share this book

Add to My Shelf

Global Trends in Social Prescribing: Web-Based Crawling Approach

by Koh, Sang Baek , Lim, Subeen , Lim, Joo Aeh in Aged , Aging , Alzheimer's disease

2023

Social loneliness is a prevalent issue in industrialized countries that can lead to adverse health outcomes, including a 26% increased risk of premature mortality, coronary heart disease, stroke, depression, cognitive impairment, and Alzheimer disease. The United Kingdom has implemented a strategy to address loneliness, including social prescribing-a health care model where physicians prescribe nonpharmacological interventions to tackle social loneliness. However, there is a need for evidence-based plans for global social prescribing dissemination. This study aims to identify global trends in social prescribing from 2018. To this end, we intend to collect and analyze words related to social prescribing worldwide and evaluate various trends of related words by classifying the core areas of social prescribing. Google's searchable data were collected to analyze web-based data related to social prescribing. With the help of web crawling, 3796 news items were collected for the 5-year period from 2018 to 2022. Key topics were selected to identify keywords for each major topic related to social prescribing. The topics were grouped into 4 categories, namely Healthy, Program, Governance, and Target, and keywords for each topic were selected thereafter. Text mining was used to determine the importance of words collected from new data. Word clouds were generated for words related to social prescribing, which collected 3796 words from Google News databases, including 128 in 2018, 432 in 2019, 566 in 2020, 748 in 2021, and 1922 in 2022, increasing nearly 15-fold between 2018 and 2022 (5 years). Words such as health, prescribing, and GPs (general practitioners) were the highest in terms of frequency in the list for all the years. Between 2020 and 2021, COVID, gardening, and UK were found to be highly related words. In 2022, NHS (National Health Service) and UK ranked high. This dissertation examines social prescribing-related term frequency and classification (2018-2022) in Healthy, Program, Governance, and Target categories. Key findings include increased \"Healthy\" terms from 2020, \"gardening\" prominence in \"Program,\" \"community\" growth across categories, and \"Target\" term spikes in 2021. This study's discussion highlights four key aspects: (1) the \"Healthy\" category trends emphasize mental health, cancer, and sleep; (2) the \"Program\" category prioritizes gardening, community, home-schooling, and digital initiatives; (3) \"Governance\" underscores the significance of community resources in social prescribing implementation; and (4) \"Target\" focuses on 4 main groups: individuals with long-term conditions, low-level mental health issues, social isolation, or complex social needs impacting well-being. Social prescribing is gaining global acceptance and is becoming a global national policy, as the world is witnessing a sharp rise in the aging population, noncontagious diseases, and mental health problems. A successful and sustainable model of social prescribing can be achieved by introducing social prescribing schemes based on the understanding of roles and the impact of multisectoral partnerships.

Journal Article

Share this book

Add to My Shelf

A novel combining method of dynamic and static web crawler with parallel computing

by Yahyapour, Ramin , Liu, Qingyang , Hu, Yanrong in Computer Communication Networks , Computer Science , Data points

2024

Recovering information from a targeted website that undergoes dynamic changes is a complicated undertaking. It necessitates the use of a highly efficient web crawler by search engines. In this study, we merged two web crawlers: Selenium with parallel computing capabilities and Scrapy , to gather electron molecular collision cross-section data from the National Fusion Research Institute ( NFRI ) database. The method effectively combines static and dynamic web crawling. The primary challenges lie in the time-consuming nature of dynamic web crawling using Selenium and that Scrapy ’s limited support for parallel computing within the “download middleware”. Nevertheless, this combined approach proves exceptionally well-suited for the task of data extraction from an online database, which comprises multiple web pages with unchanging URLs when specific keywords are submitted. We applied natural language processing techniques to identify species and dissect reaction formulas into various states. Employing these methodologies, we extracted a total of 76,893 data points pertaining to 112 species. These data pieces offer intricate insights into the processes unfolding within the plasma, all collected within a span of ten minutes. When compared to traditional web crawling methods, our approach boasts a speed advantage of roughly 100 times faster than dynamic web crawlers and exhibits greater flexibility than static web crawlers. In this report, we present the retrieved results, encompassing reaction formulas, reference information, species metadata, and time comparison among various methods.

Journal Article

Share this book

Add to My Shelf

Crawling the German Health Web: Exploratory Study and Graph Analysis

by Wetter, Thomas , Zowalla, Richard , Pfeifer, Daniel in Averages , Cancer , Classifiers

2020

The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

Journal Article

Share this book

Add to My Shelf

A Method of Constructing a Food Classification Image Dataset by Cleansing Web-Crawling Data

by Miyamoto, Masaki , Kiryu, Kazuki , Nakamura, Akio in Artificial neural networks , Criteria , Datasets

2025

We propose the construction of image datasets via data cleansing for food recognition using a convolutional neural network (CNN). A dataset was constructed by collecting food images and classes from web crawling sites that post cooking recipes. The collected images included images that cannot be effectively learned by the CNN. Examples include images of foods that look extremely similar to other foods, or images with mismatched foods and classes. Here, these images were termed “content and description discrepancy images.” The number of images was reduced using two criteria based on the food recognition results obtained using CNNs. The first criterion was a threshold for the difference in the estimated probabilities, and the second was whether the estimated class and food class matched. These criteria were applied using multiple classifiers. Based on the results, the dataset size was reduced and a new image dataset was constructed. A CNN was trained on the constructed image dataset, and the food recognition accuracy was calculated and compared using a test dataset. The results showed that the accuracy using the dataset constructed using the proposed method was 7.4% higher than that of the case using web crawling. This study demonstrates that the proposed method can efficiently construct a food image dataset, demonstrating the data-cleansing effect of the two selected criteria.

Journal Article

Share this book

Add to My Shelf

A Keyword-based IP Tracking Method for Illegal Web Content Distribution Using Port Scanning on HTTP and HTTPS

by Kim, Youngmo , Park, Byeongchan , Kim, Seok-Yoon

2025

The rapid expansion of online content distribution has led to a significant increase in copyright infringement, where unauthorized works are illegally shared through various web-based platforms. To fundamentally block these copyright-infringing websites, it is essential to accurately identify the IP address or physical location of the original server. However, most illegal content distribution sites utilize advanced security mechanisms, such as DNS resolvers, reverse proxies, and anonymization techniques, to conceal their true IP addresses, making direct tracking increasingly difficult. These evasive tactics allow illegal sites to continue operating while avoiding enforcement measures. To address this challenge, this paper proposes a keyword-based IP tracking method for identifying illegal web content distribution sites by leveraging port scanning on HTTP and HTTPS (ports 80 and 443). The proposed approach systematically detects and analyzes servers that provide unauthorized content by scanning network ports commonly used for web services. By correlating detected IP addresses with keyword-based filtering techniques, this method enables efficient tracking of illegal sites that actively hide their original server’s IP address. Through experimental validation, the proposed method successfully pinpoints the IP addresses of illegal content distribution servers, even when they employ obfuscation techniques to mask their identity. This study contributes to enhancing copyright protection by introducing a web-based detection approach that integrates network security techniques, web engineering principles, and automated keyword analysis. Furthermore, the findings provide a practical solution for law enforcement agencies, copyright holders, and regulatory bodies to combat illegal web content distribution more effectively.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter