Catalogue Search | MBRL

Using systematic data categorisation to quantify the types of data collected in clinical trials: the DataCat project

by Crowley, Evelyn , McDonald, Alison , Breeman, Suzanne in Biomedicine , Clinical trials , Clinical Trials as Topic - statistics & numerical data

2020

Background Data collection consumes a large proportion of clinical trial resources. Each data item requires time and effort for collection, processing and quality control procedures. In general, more data equals a heavier burden for trial staff and participants. It is also likely to increase costs. Knowing the types of data being collected, and in what proportion, will be helpful to ensure that limited trial resources and participant goodwill are used wisely. Aim The aim of this study is to categorise the types of data collected across a broad range of trials and assess what proportion of collected data each category represents. Methods We developed a standard operating procedure to categorise data into primary outcome, secondary outcome and 15 other categories. We categorised all variables collected on trial data collection forms from 18, mainly publicly funded, randomised superiority trials, including trials of an investigational medicinal product and complex interventions. Categorisation was done independently in pairs: one person having in-depth knowledge of the trial, the other independent of the trial. Disagreement was resolved through reference to the trial protocol and discussion, with the project team being consulted if necessary. Key results Primary outcome data accounted for 5.0% (median)/11.2% (mean) of all data items collected. Secondary outcomes accounted for 39.9% (median)/42.5% (mean) of all data items. Non-outcome data such as participant identifiers and demographic data represented 32.4% (median)/36.5% (mean) of all data items collected. Conclusion A small proportion of the data collected in our sample of 18 trials was related to the primary outcome. Secondary outcomes accounted for eight times the volume of data as the primary outcome. A substantial amount of data collection is not related to trial outcomes. Trialists should work to make sure that the data they collect are only those essential to support the health and treatment decisions of those whom the trial is designed to inform.

Journal Article

Share this book

Add to My Shelf

Performance criteria for verbal autopsy-based systems to estimate national causes of death: development and application to the Indian Million Death Study

by Awasthi, Shally , Suraweera, Wilson , Malhotra, Varun in Adolescent , Adult , Aged

2014

Background Verbal autopsy (VA) has been proposed to determine the cause of death (COD) distributions in settings where most deaths occur without medical attention or certification. We develop performance criteria for VA-based COD systems and apply these to the Registrar General of India’s ongoing, nationally-representative Indian Million Death Study (MDS). Methods Performance criteria include a low ill-defined proportion of deaths before old age; reproducibility, including consistency of COD distributions with independent resampling; differences in COD distribution of hospital, home, urban or rural deaths; age-, sex- and time-specific plausibility of specific diseases; stability and repeatability of dual physician coding; and the ability of the mortality classification system to capture a wide range of conditions. Results The introduction of the MDS in India reduced the proportion of ill-defined deaths before age 70 years from 13% to 4%. The cause-specific mortality fractions (CSMFs) at ages 5 to 69 years for independently resampled deaths and the MDS were very similar across 19 disease categories. By contrast, CSMFs at these ages differed between hospital and home deaths and between urban and rural deaths. Thus, reliance mostly on urban or hospital data can distort national estimates of CODs. Age-, sex- and time-specific patterns for various diseases were plausible. Initial physician agreement on COD occurred about two-thirds of the time. The MDS COD classification system was able to capture more eligible records than alternative classification systems. By these metrics, the Indian MDS performs well for deaths prior to age 70 years. The key implication for low- and middle-income countries where medical certification of death remains uncommon is to implement COD surveys that randomly sample all deaths, use simple but high-quality field work with built-in resampling, and use electronic rather than paper systems to expedite field work and coding. Conclusions Simple criteria can evaluate the performance of VA-based COD systems. Despite the misclassification of VA, the MDS demonstrates that national surveys of CODs using VA are an order of magnitude better than the limited COD data previously available.

Journal Article

Share this book

Add to My Shelf

Studying user income through language, behaviour and affect in social media

by Lampos, Vasileios , Volkova, Svitlana , Bachrach, Yoram in Affect , Age differences , Artificial intelligence

2015

Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.

Journal Article

Share this book

Add to My Shelf

Residential scene classification for gridded population sampling in developing countries using deep convolutional neural networks on satellite imagery

by Jones, Kasey , Amer, Safaa , Chew, Robert F. in Clustering , Complex sample design , Data Collection - classification

2018

Background Conducting surveys in low- and middle-income countries is often challenging because many areas lack a complete sampling frame, have outdated census information, or have limited data available for designing and selecting a representative sample. Geosampling is a probability-based, gridded population sampling method that addresses some of these issues by using geographic information system (GIS) tools to create logistically manageable area units for sampling. GIS grid cells are overlaid to partition a country’s existing administrative boundaries into area units that vary in size from 50 m × 50 m to 150 m × 150 m. To avoid sending interviewers to unoccupied areas, researchers manually classify grid cells as “residential” or “nonresidential” through visual inspection of aerial images. “Nonresidential” units are then excluded from sampling and data collection. This process of manually classifying sampling units has drawbacks since it is labor intensive, prone to human error, and creates the need for simplifying assumptions during calculation of design-based sampling weights. In this paper, we discuss the development of a deep learning classification model to predict whether aerial images are residential or nonresidential, thus reducing manual labor and eliminating the need for simplifying assumptions. Results On our test sets, the model performs comparable to a human-level baseline in both Nigeria (94.5% accuracy) and Guatemala (96.4% accuracy), and outperforms baseline machine learning models trained on crowdsourced or remote-sensed geospatial features. Additionally, our findings suggest that this approach can work well in new areas with relatively modest amounts of training data. Conclusions Gridded population sampling methods like geosampling are becoming increasingly popular in countries with outdated or inaccurate census data because of their timeliness, flexibility, and cost. Using deep learning models directly on satellite images, we provide a novel method for sample frame construction that identifies residential gridded aerial units. In cases where manual classification of satellite images is used to (1) correct for errors in gridded population data sets or (2) classify grids where population estimates are unavailable, this methodology can help reduce annotation burden with comparable quality to human analysts.

Journal Article

Share this book

Add to My Shelf

The Assignment of Scores Procedure for Ordinal Categorical Data

by Chen, Han-Ching , Wang, Nae-Sheng in Alcohol Drinking - adverse effects , Alcohol Drinking - epidemiology , Analysis

2014

Ordinal data are the most frequently encountered type of data in the social sciences. Many statistical methods can be used to process such data. One common method is to assign scores to the data, convert them into interval data, and further perform statistical analysis. There are several authors who have recently developed assigning score methods to assign scores to ordered categorical data. This paper proposes an approach that defines an assigning score system for an ordinal categorical variable based on underlying continuous latent distribution with interpretation by using three case study examples. The results show that the proposed score system is well for skewed ordinal categorical data.

Journal Article

Share this book

Add to My Shelf

Reaching black men who have sex with men: a comparison between respondent-driven sampling and time-location sampling

by Colfax, Grant N , McFarland, Willi , Raymond, H Fisher in Adolescent , Adult , African Continental Ancestry Group

2012

Objectives The authors explored whether respondent-driven sampling (RDS) can generate a more diverse sample of black men who have sex with men (MSM) than time-location sampling (TLS) by comparing sample characteristics accrued by each method in two independent studies. Methods The first study exclusively recruited black MSM through RDS (N=256), while the second recruited MSM through TLS including a subsample of black MSM (N=69). Crude and adjusted point estimates and 95% CIs were calculated for socio-demographic and behavioural characteristics, HIV prevalence and prevalence of unrecognised infections, and were compared using the Z-test. Results The samples differed significantly regarding all socio-demographic and some behavioural characteristics. Compared with TLS, RDS estimated higher proportions of older, less educated, poorer, currently homeless and self-identified bisexual black MSM. Participants in RDS were less likely to have a main partner, had fewer male partners, were more likely to have a female partner and have both male and female partners, and reported greater methamphetamine, crack and heroin use. Prevalence of HIV and unrecognised infections were slightly higher among RDS participants. Conclusions The RDS sample comprised black MSM who were more diverse with respect to socio-demographic characteristics and may also be at higher risk for HIV. Thus, RDS has advantages in reaching higher risk black MSM who are most hidden from intervention research and service delivery. Future studies of black MSM using RDS could use steering strategies to recruit younger participants and other subgroups of greatest interest to public health and prevention.

Journal Article

Share this book

Add to My Shelf

A method for encoding clinical datasets with SNOMED CT

by Lau, Francis Y , Lee, Dennis H , Quan, Hue in Abbreviations as Topic , Algorithms , Audit trails

2010

Background Over the past decade there has been a growing body of literature on how the Systematised Nomenclature of Medicine Clinical Terms (SNOMED CT) can be implemented and used in different clinical settings. Yet, for those charged with incorporating SNOMED CT into their organisation's clinical applications and vocabulary systems, there are few detailed encoding instructions and examples available to show how this can be done and the issues involved. This paper describes a heuristic method that can be used to encode clinical terms in SNOMED CT and an illustration of how it was applied to encode an existing palliative care dataset. Methods The encoding process involves: identifying input data items; cleaning the data items; encoding the cleaned data items; and exporting the encoded terms as output term sets. Four outputs are produced: the SNOMED CT reference set; interface terminology set; SNOMED CT extension set and unencodeable term set. Results The original palliative care database contained 211 data elements, 145 coded values and 37,248 free text values. We were able to encode ~84% of the terms, another ~8% require further encoding and verification while terms that had a frequency of fewer than five were not encoded (~7%). Conclusions From the pilot, it would seem our SNOMED CT encoding method has the potential to become a general purpose terminology encoding approach that can be used in different clinical systems.

Journal Article

Share this book

Add to My Shelf

A classification of tasks for the systematic study of immune response using functional genomics data

by BEHNKE, J. M. , HAMSHERE, M. G. , ELSE, K. J. in Allergy and Immunology - classification , Animals , Biological and medical sciences

2006

A full understanding of the immune system and its responses to infection by different pathogens is important for the development of anti-parasitic vaccines. A growing number of large-scale experimental techniques, such as microarrays, are being used to gain a better understanding of the immune system. To analyse the data generated by these experiments, methods such as clustering are widely used. However, individual applications of these methods tend to analyse the experimental data without taking publicly available biological and immunological knowledge into account systematically and in an unbiased manner. To make best use of the experimental investment, to benefit from existing evidence, and to support the findings in the experimental data, available biological information should be included in the analysis in a systematic manner. In this review we present a classification of tasks that shows how experimental data produced by studies of the immune system can be placed in a broader biological context. Taking into account available evidence, the classification can be used to identify different ways of analysing the experimental data systematically. We have used the classification to identify alternative ways of analysing microarray data, and illustrate its application using studies of immune responses in mice to infection with the intestinal nematode parasites Trichuris muris and Heligmosomoides polygyrus.

Journal Article

Share this book

Add to My Shelf