Catalogue Search | MBRL

Web scraping with Python : collecting more data from the modern web

by Mitchell, Ryan E., author in Python (Computer program language) , Data mining. , Automatic data collection systems.

If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping almost every type of data from the modern web. Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

Book

Share this book

Add to My Shelf

The National Health and Nutrition Examination Survey (NHANES), 2021–2022: Adapting Data Collection in a COVID-19 Environment

by Ahluwalia, Namanjeet , Woodwell, David , Paulose-Ram, Ryne in Adult , Analytic , Communicable Disease Control - organization & administration

2021

The National Health and Nutrition Examination Survey (NHANES) is a unique source of national data on the health and nutritional status of the US population, collecting data through interviews, standard exams, and biospecimen collection. Because of the COVID-19 pandemic, NHANES data collection was suspended, with more than a year gap in data collection. NHANES resumed operations in 2021 with the NHANES 2021–2022 survey, which will monitor the health and nutritional status of the nation while adding to the knowledge of COVID-19 in the US population. This article describes the reshaping of the NHANES program and, specifically, the planning of NHANES 2021–2022 for data collection during the COVID-19 pandemic. Details are provided on how NHANES transformed its participant recruitment and data collection plans at home and at the mobile examination center to safely collect data in a COVID-19 environment. The potential implications for data users are also discussed. (Am J Public Health. 2021;111(12):2149–2156. https://doi.org/10.2105/AJPH.2021.306517 )

Journal Article

Share this book

Add to My Shelf

Practical web scraping for data science : best practices and examples with Python

by Broucke, Seppe vanden, 1986- author , Baesens, Bart, author in Python (Computer program language) , Data mining. , Automatic data collection systems.

This book provides a complete and modern guide to web scraping, using Python as the programming language, without glossing over important details or best practices. Written with a data science audience in mind, the book explores both scraping and the larger context of web technologies in which it operates, to ensure full understanding. The authors recommend web scraping as a powerful tool for any data scientist's arsenal, as many data science projects start by obtaining an appropriate data set.

Book

Share this book

Add to My Shelf

Ensuring the quality and specificity of preregistrations

by Bakker, Marjan , Wicherts, Jelte M. , Veldkamp, Coosje L. S. in Analysis , Behavioral sciences , Biology and Life Sciences

2020

Researchers face many, often seemingly arbitrary, choices in formulating hypotheses, designing protocols, collecting data, analyzing data, and reporting results. Opportunistic use of “researcher degrees of freedom” aimed at obtaining statistical significance increases the likelihood of obtaining and publishing false-positive results and overestimated effect sizes. Preregistration is a mechanism for reducing such degrees of freedom by specifying designs and analysis plans before observing the research outcomes. The effectiveness of preregistration may depend, in part, on whether the process facilitates sufficiently specific articulation of such plans. In this preregistered study, we compared 2 formats of preregistration available on the OSF: Standard Pre-Data Collection Registration and Prereg Challenge Registration (now called “OSF Preregistration,” http://osf.io/prereg/ ). The Prereg Challenge format was a “structured” workflow with detailed instructions and an independent review to confirm completeness; the “Standard” format was “unstructured” with minimal direct guidance to give researchers flexibility for what to prespecify. Results of comparing random samples of 53 preregistrations from each format indicate that the “structured” format restricted the opportunistic use of researcher degrees of freedom better (Cliff’s Delta = 0.49) than the “unstructured” format, but neither eliminated all researcher degrees of freedom. We also observed very low concordance among coders about the number of hypotheses (14%), indicating that they are often not clearly stated. We conclude that effective preregistration is challenging, and registration formats that provide effective guidance may improve the quality of research.

Journal Article

Share this book

Add to My Shelf

Website scraping with Python: using BeautifulSoup and Scrapy

by Hajba, Gâabor Lâaszlâo, author in Python (Computer program language) , Downloading of data. , Information retrieval.

2018

\"Closely examine website scraping and data processing: the technique of extracting data from websites in a format suitable for further analysis. You'll review which tools to use, and compare their features and efficiency. Focusing on BeautifulSoup4 and Scrapy, this concise, focused book highlights common problems and suggests solutions that readers can implement on their own. Website Scraping with Python starts by introducing and installing the scraping tools and explaining the features of the full application that readers will build throughout the book. You'll see how to use BeautifulSoup4 and Scrapy individually or together to achieve the desired results. Because many sites use JavaScript, you'll also employ Selenium with a browser emulator to render these sites and make them ready for scraping. By the end of this book, you'll have a complete scraping application to use and rewrite to suit your needs. As a bonus, the author shows you options of how to deploy your spiders into the Cloud to leverage your computer from long-running scraping tasks\"--Back cover.

Book

Share this book

Add to My Shelf

Evaluation of Electronic and Paper-Pen Data Capturing Tools for Data Quality in a Public Health Survey in a Health and Demographic Surveillance Site, Ethiopia: Randomized Controlled Crossover Health Care Information Technology Evaluation

by Wilken, Marc , Zeleke, Atinkut Alamirrew , Worku, Abebaw Gebeyehu in Adult , Cross-Over Studies , Data Accuracy

2019

Periodic demographic health surveillance and surveys are the main sources of health information in developing countries. Conducting a survey requires extensive use of paper-pen and manual work and lengthy processes to generate the required information. Despite the rise of popularity in using electronic data collection systems to alleviate the problems, sufficient evidence is not available to support the use of electronic data capture (EDC) tools in interviewer-administered data collection processes. This study aimed to compare data quality parameters in the data collected using mobile electronic and standard paper-based data capture tools in one of the health and demographic surveillance sites in northwest Ethiopia. A randomized controlled crossover health care information technology evaluation was conducted from May 10, 2016, to June 3, 2016, in a demographic and surveillance site. A total of 12 interviewers, as 2 individuals (one of them with a tablet computer and the other with a paper-based questionnaire) in 6 groups were assigned in the 6 towns of the surveillance premises. Data collectors switched the data collection method based on computer-generated random order. Data were cleaned using a MySQL program and transferred to SPSS (IBM SPSS Statistics for Windows, Version 24.0) and R statistical software (R version 3.4.3, the R Foundation for Statistical Computing Platform) for analysis. Descriptive and mixed ordinal logistic analyses were employed. The qualitative interview audio record from the system users was transcribed, coded, categorized, and linked to the International Organization for Standardization 9241-part 10 dialogue principles for system usability. The usability of this open data kit-based system was assessed using quantitative System Usability Scale (SUS) and matching of qualitative data with the isometric dialogue principles. From the submitted 1246 complete records of questionnaires in each tool, 41.89% (522/1246) of the paper and pen data capture (PPDC) and 30.89% (385/1246) of the EDC tool questionnaires had one or more types of data quality errors. The overall error rates were 1.67% and 0.60% for PPDC and EDC, respectively. The chances of more errors on the PPDC tool were multiplied by 1.015 for each additional question in the interview compared with EDC. The SUS score of the data collectors was 85.6. In the qualitative data response mapping, EDC had more positive suitability of task responses with few error tolerance characteristics. EDC possessed significantly better data quality and efficiency compared with PPDC, explained with fewer errors, instant data submission, and easy handling. The EDC proved to be a usable data collection tool in the rural study setting. Implementation organization needs to consider consistent power source, decent internet connection, standby technical support, and security assurance for the mobile device users for planning full-fledged implementation and integration of the system in the surveillance site.

Journal Article

Share this book

Add to My Shelf

Blockchain and clinical trial : securing patient data

by Jahankhani, Hamid, editor , Kendzierskyj, Stefan, editor , Jamal, Arshad (Associate Dean), editor in Blockchains (Databases) , Data integrity. , Clinical trials Data processing.

\"This book aims to highlight the gaps and the transparency issues in the clinical research and trials processes and how there is a lack of information flowing back to researchers and patients involved in those trials. Lack of data transparency is an underlying theme within the clinical research world and causes issues of corruption, fraud, errors and a problem of reproducibility. Blockchain can prove to be a method to ensure a much more joined up and integrated approach to data sharing and improving patient outcomes. Surveys undertaken by creditable organisations in the healthcare industry are analysed in this book that show strong support for using blockchain technology regarding strengthening data security, interoperability and a range of beneficial use cases where mostly all respondents of the surveys believe blockchain will be important for the future of the healthcare industry. Another aspect considered in the book is the coming surge of healthcare wearables using Internet of Things (IoT) and the prediction that the current capacity of centralised networks will not cope with the demands of data storage. The benefits are great for clinical research, but will add more pressure to the transparency of clinical trials and how this is managed unless a secure mechanism like, blockchain is used\"--Publisher's description.

Book

Share this book

Add to My Shelf

Addressing missing data in randomized clinical trials: A causal inference perspective

by Cornelisz, Ilja , van Klaveren, Chris , Donker, Tara in Analysis , Bias , Causal inference

2020

The importance of randomization in clinical trials has long been acknowledged for avoiding selection bias. Yet, bias concerns re-emerge with selective attrition. This study takes a causal inference perspective in addressing distinct scenarios of missing outcome data (MCAR, MAR and MNAR). This study adopts a causal inference perspective in providing an overview of empirical strategies to estimate the average treatment effect, improve precision of the estimator, and to test whether the underlying identifying assumptions hold. We propose to use Random Forest Lee Bounds (RFLB) to address selective attrition and to obtain more precise average treatment effect intervals. When assuming MCAR or MAR, the often untenable identifying assumptions with respect to causal inference can hardly be verified empirically. Instead, missing outcome data in clinical trials should be considered as potentially non-random unobserved events (i.e. MNAR). Using simulated attrition data, we show how average treatment effect intervals can be tightened considerably using RFLB, by exploiting both continuous and discrete attrition predictor variables. Bounding approaches should be used to acknowledge selective attrition in randomized clinical trials in acknowledging the resulting uncertainty with respect to causal inference. As such, Random Forest Lee Bounds estimates are more informative than point estimates obtained assuming MCAR or MAR.

Journal Article

Share this book

Add to My Shelf

Collecting experiments : making Big Data biology

by Strasser, Bruno J., author in Biology, Experimental Data processing. , Biology, Experimental Databases. , Biological models Data processing.

2019

Databases have revolutionized nearly every aspect of our lives. Information of all sorts is being collected on a massive scale, from Google to Facebook and well beyond. But as the amount of information in databases explodes, we are forced to reassess our ideas about what knowledge is, how it is produced, to whom it belongs, and who can be credited for producing it. Every scientist working today draws on databases to produce scientific knowledge. Databases have become more common than microscopes, voltmeters, and test tubes, and the increasing amount of data has led to major changes in research practices and profound reflections on the proper professional roles of data producers, collectors, curators, and analysts. Collecting Experiments traces the development and use of data collections, especially in the experimental life sciences, from the early twentieth century to the present. It shows that the current revolution is best understood as the coming together of two older ways of knowing--collecting and experimenting, the museum and the laboratory. Ultimately, Bruno J. Strasser argues that by serving as knowledge repositories, as well as indispensable tools for producing new knowledge, these databases function as digital museums for the twenty-first century.

Book

Share this book

Add to My Shelf

Design Guidelines for Improving Mobile Sensing Data Collection: Prospective Mixed Methods Study

by Washington, Peter , Slade, Christopher , Benzo, Roberto M in Adult , Data Collection - instrumentation , Data Collection - methods

2024

Machine learning models often use passively recorded sensor data streams as inputs to train machine learning models that predict outcomes captured through ecological momentary assessments (EMA). Despite the growth of mobile data collection, challenges in obtaining proper authorization to send notifications, receive background events, and perform background tasks persist. We investigated challenges faced by mobile sensing apps in real-world settings in order to develop design guidelines. For active data, we compared 2 prompting strategies: setup prompting, where the app requests authorization during its initial run, and contextual prompting, where authorization is requested when an event or notification occurs. Additionally, we evaluated 2 passive data collection paradigms: collection during scheduled background tasks and persistent reminders that trigger passive data collection. We investigated the following research questions (RQs): (RQ1) how do setup prompting and contextual prompting affect scheduled notification delivery and the response rate of notification-initiated EMA? (RQ2) Which authorization paradigm, setup or contextual prompting, is more successful in leading users to grant authorization to receive background events? and (RQ3) Which polling-based method, persistent reminders or scheduled background tasks, completes more background sessions? We developed mobile sensing apps for iOS and Android devices and tested them through a 30-day user study asking college students (n=145) about their stress levels. Participants responded to a daily EMA question to test active data collection. The sensing apps collected background location events, polled for passive data with persistent reminders, and scheduled background tasks to test passive data collection. For RQ1, setup and contextual prompting yielded no significant difference (ANOVA F =0.0227; P=.88) in EMA compliance, with an average of 23.4 (SD 7.36) out of 30 assessments completed. However, qualitative analysis revealed that contextual prompting on iOS devices resulted in inconsistent notification deliveries. For RQ2, contextual prompting for background events was 55.5% (χ =4.4; P=.04) more effective in gaining authorization. For RQ3, users demonstrated resistance to installing the persistent reminder, but when installed, the persistent reminder performed 226.5% more background sessions than traditional background tasks. We developed design guidelines for improving mobile sensing on consumer mobile devices based on our qualitative and quantitative results. Our qualitative results demonstrated that contextual prompts on iOS devices resulted in inconsistent notification deliveries, unlike setup prompting on Android devices. We therefore recommend using setup prompting for EMA when possible. We found that contextual prompting is more efficient for authorizing background events. We therefore recommend using contextual prompting for passive sensing. Finally, we conclude that developing a persistent reminder and requiring participants to install it provides an additional way to poll for sensor and user data and could improve data collection to support adaptive interventions powered by machine learning.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter