Catalogue Search | MBRL

Current approaches for executing big data science projects—a systematic literature review

by Krasteva, Iva , Saltz, Jeffrey S. in Agile data science , Big Data , Big data science

2022

There is an increasing number of big data science projects aiming to create value for organizations by improving decision making, streamlining costs or enhancing business processes. However, many of these projects fail to deliver the expected value. It has been observed that a key reason many data science projects don’t succeed is not technical in nature, but rather, the process aspect of the project. The lack of established and mature methodologies for executing data science projects has been frequently noted as a reason for these project failures. To help move the field forward, this study presents a systematic review of research focused on the adoption of big data science process frameworks. The goal of the review was to identify (1) the key themes, with respect to current research on how teams execute data science projects, (2) the most common approaches regarding how data science projects are organized, managed and coordinated, (3) the activities involved in a data science projects life cycle, and (4) the implications for future research in this field. In short, the review identified 68 primary studies thematically classified in six categories. Two of the themes (workflow and agility) accounted for approximately 80% of the identified studies. The findings regarding workflow approaches consist mainly of adaptations to CRISP-DM ( vs entirely new proposed methodologies). With respect to agile approaches, most of the studies only explored the conceptual benefits of using an agile approach in a data science project ( vs actually evaluating an agile framework being used in a data science context). Hence, one finding from this research is that future research should explore how to best achieve the theorized benefits of agility. Another finding is the need to explore how to efficiently combine workflow and agile frameworks within a data science context to achieve a more comprehensive approach for project execution.

Journal Article

Share this book

Add to My Shelf

The Role of ChatGPT in Data Science: How AI-Assisted Conversational Interfaces Are Revolutionizing the Field

by Hassani, Hossein , Silva, Emmanuel Sirmal in Algorithms , Artificial intelligence , Big Data

2023

ChatGPT, a conversational AI interface that utilizes natural language processing and machine learning algorithms, is taking the world by storm and is the buzzword across many sectors today. Given the likely impact of this model on data science, through this perspective article, we seek to provide an overview of the potential opportunities and challenges associated with using ChatGPT in data science, provide readers with a snapshot of its advantages, and stimulate interest in its use for data science projects. The paper discusses how ChatGPT can assist data scientists in automating various aspects of their workflow, including data cleaning and preprocessing, model training, and result interpretation. It also highlights how ChatGPT has the potential to provide new insights and improve decision-making processes by analyzing unstructured data. We then examine the advantages of ChatGPT’s architecture, including its ability to be fine-tuned for a wide range of language-related tasks and generate synthetic data. Limitations and issues are also addressed, particularly around concerns about bias and plagiarism when using ChatGPT. Overall, the paper concludes that the benefits outweigh the costs and ChatGPT has the potential to greatly enhance the productivity and accuracy of data science workflows and is likely to become an increasingly important tool for intelligence augmentation in the field of data science. ChatGPT can assist with a wide range of natural language processing tasks in data science, including language translation, sentiment analysis, and text classification. However, while ChatGPT can save time and resources compared to training a model from scratch, and can be fine-tuned for specific use cases, it may not perform well on certain tasks if it has not been specifically trained for them. Additionally, the output of ChatGPT may be difficult to interpret, which could pose challenges for decision-making in data science applications.

Journal Article

Share this book

Add to My Shelf

Closha 2.0: a bio-workflow design system for massive genome data analysis on high performance cluster infrastructure

by Kim, Pan-Gyu , Byeon, IkSu , Ko, Gunhwan in Algorithms , Analysis , Big Data

2024

Background The explosive growth of next-generation sequencing data has resulted in ultra-large-scale datasets and significant computational challenges. As the cost of next-generation sequencing (NGS) has decreased, the amount of genomic data has surged globally. However, the cost and complexity of the computational resources required continue to be substantial barriers to leveraging big data. A promising solution to these computational challenges is cloud computing, which provides researchers with the necessary CPUs, memory, storage, and software tools. Results Here, we present Closha 2.0, a cloud computing service that offers a user-friendly platform for analyzing massive genomic datasets. Closha 2.0 is designed to provide a cloud-based environment that enables all genomic researchers, including those with limited or no programming experience, to easily analyze their genomic data. The new 2.0 version of Closha has more user-friendly features than the previous 1.0 version. Firstly, the workbench features a script editor that supports Python, R, and shell script programming, enabling users to write scripts and integrate them into their pipelines. This functionality is particularly useful for downstream analysis. Second, Closha 2.0 runs on containers, which execute each tool in an independent environment. This provides a stable environment and prevents dependency issues and version conflicts among tools. Additionally, users can execute each step of a pipeline individually, allowing them to test applications at each stage and adjust parameters to achieve the desired results. We also updated a high-speed data transmission tool called GBox that facilitates the rapid transfer of large datasets. Conclusions The analysis pipelines on Closha 2.0 are reproducible, with all analysis parameters and inputs being permanently recorded. Closha 2.0 simplifies multi-step analysis with drag-and-drop functionality and provides a user-friendly interface for genomic scientists to obtain accurate results from NGS data. Closha 2.0 is freely available at https://www.kobic.re.kr/closha2 .

Journal Article

Share this book

Add to My Shelf

Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study

by Whalen, Krista , Lorenzi, Elizabeth , Sendak, Mark in Analysis , Artificial intelligence , Automation

2018

Pythia is an automated, clinically curated surgical data pipeline and repository housing all surgical patient electronic health record (EHR) data from a large, quaternary, multisite health institute for data science initiatives. In an effort to better identify high-risk surgical patients from complex data, a machine learning project trained on Pythia was built to predict postoperative complication risk. A curated data repository of surgical outcomes was created using automated SQL and R code that extracted and processed patient clinical and surgical data across 37 million clinical encounters from the EHRs. A total of 194 clinical features including patient demographics (e.g., age, sex, race), smoking status, medications, comorbidities, procedure information, and proxies for surgical complexity were constructed and aggregated. A cohort of 66,370 patients that had undergone 99,755 invasive procedural encounters between January 1, 2014, and January 31, 2017, was studied further for the purpose of predicting postoperative complications. The average complication and 30-day postoperative mortality rates of this cohort were 16.0% and 0.51%, respectively. Least absolute shrinkage and selection operator (lasso) penalized logistic regression, random forest models, and extreme gradient boosted decision trees were trained on this surgical cohort with cross-validation on 14 specific postoperative outcome groupings. Resulting models had area under the receiver operator characteristic curve (AUC) values ranging between 0.747 and 0.924, calculated on an out-of-sample test set from the last 5 months of data. Lasso penalized regression was identified as a high-performing model, providing clinically interpretable actionable insights. Highest and lowest performing lasso models predicted postoperative shock and genitourinary outcomes with AUCs of 0.924 (95% CI: 0.901, 0.946) and 0.780 (95% CI: 0.752, 0.810), respectively. A calculator requiring input of 9 data fields was created to produce a risk assessment for the 14 groupings of postoperative outcomes. A high-risk threshold (15% risk of any complication) was determined to identify high-risk surgical patients. The model sensitivity was 76%, with a specificity of 76%. Compared to heuristics that identify high-risk patients developed by clinical experts and the ACS NSQIP calculator, this tool performed superiorly, providing an improved approach for clinicians to estimate postoperative risk for patients. Limitations of this study include the missingness of data that were removed for analysis. Extracting and curating a large, local institution's EHR data for machine learning purposes resulted in models with strong predictive performance. These models can be used in clinical settings as decision support tools for identification of high-risk patients as well as patient evaluation and care management. Further work is necessary to evaluate the impact of the Pythia risk calculator within the clinical workflow on postoperative outcomes and to optimize this data flow for future machine learning efforts.

Journal Article

Share this book

Add to My Shelf

Applications of Remote Sensing in Precision Agriculture: A Review

by Singh, Sudhir K. , Ray, Ram L. , Sishodia, Rajendra P. in administrative management , Agricultural practices , Agricultural production

2020

Agriculture provides for the most basic needs of humankind: food and fiber. The introduction of new farming techniques in the past century (e.g., during the Green Revolution) has helped agriculture keep pace with growing demands for food and other agricultural products. However, further increases in food demand, a growing population, and rising income levels are likely to put additional strain on natural resources. With growing recognition of the negative impacts of agriculture on the environment, new techniques and approaches should be able to meet future food demands while maintaining or reducing the environmental footprint of agriculture. Emerging technologies, such as geospatial technologies, Internet of Things (IoT), Big Data analysis, and artificial intelligence (AI), could be utilized to make informed management decisions aimed to increase crop production. Precision agriculture (PA) entails the application of a suite of such technologies to optimize agricultural inputs to increase agricultural production and reduce input losses. Use of remote sensing technologies for PA has increased rapidly during the past few decades. The unprecedented availability of high resolution (spatial, spectral and temporal) satellite images has promoted the use of remote sensing in many PA applications, including crop monitoring, irrigation management, nutrient application, disease and pest management, and yield prediction. In this paper, we provide an overview of remote sensing systems, techniques, and vegetation indices along with their recent (2015–2020) applications in PA. Remote-sensing-based PA technologies such as variable fertilizer rate application technology in Green Seeker and Crop Circle have already been incorporated in commercial agriculture. Use of unmanned aerial vehicles (UAVs) has increased tremendously during the last decade due to their cost-effectiveness and flexibility in obtaining the high-resolution (cm-scale) images needed for PA applications. At the same time, the availability of a large amount of satellite data has prompted researchers to explore advanced data storage and processing techniques such as cloud computing and machine learning. Given the complexity of image processing and the amount of technical knowledge and expertise needed, it is critical to explore and develop a simple yet reliable workflow for the real-time application of remote sensing in PA. Development of accurate yet easy to use, user-friendly systems is likely to result in broader adoption of remote sensing technologies in commercial and non-commercial PA applications.

Journal Article

Share this book

Add to My Shelf

Small data machine learning in materials science

by Ji, Xiaobo , Lu, Wencong , Xu, Pengcheng in Active learning , Algorithms , Artificial intelligence

2023

This review discussed the dilemma of small data faced by materials machine learning. First, we analyzed the limitations brought by small data. Then, the workflow of materials machine learning has been introduced. Next, the methods of dealing with small data were introduced, including data extraction from publications, materials database construction, high-throughput computations and experiments from the data source level; modeling algorithms for small data and imbalanced learning from the algorithm level; active learning and transfer learning from the machine learning strategy level. Finally, the future directions for small data machine learning in materials science were proposed.

Journal Article

Share this book

Add to My Shelf

How can Big Data and machine learning benefit environment and water management: a survey of methods, applications, and future directions

by Sun, Alexander Y , Scanlon, Bridget R in Algorithms , Artificial intelligence , Big Data

2019

Big Data and machine learning (ML) technologies have the potential to impact many facets of environment and water management (EWM). Big Data are information assets characterized by high volume, velocity, variety, and veracity. Fast advances in high-resolution remote sensing techniques, smart information and communication technologies, and social media have contributed to the proliferation of Big Data in many EWM fields, such as weather forecasting, disaster management, smart water and energy management systems, and remote sensing. Big Data brings about new opportunities for data-driven discovery in EWM, but it also requires new forms of information processing, storage, retrieval, as well as analytics. ML, a subdomain of artificial intelligence (AI), refers broadly to computer algorithms that can automatically learn from data. ML may help unlock the power of Big Data if properly integrated with data analytics. Recent breakthroughs in AI and computing infrastructure have led to the fast development of powerful deep learning (DL) algorithms that can extract hierarchical features from data, with better predictive performance and less human intervention. Collectively Big Data and ML techniques have shown great potential for data-driven decision making, scientific discovery, and process optimization. These technological advances may greatly benefit EWM, especially because (1) many EWM applications (e.g. early flood warning) require the capability to extract useful information from a large amount of data in autonomous manner and in real time, (2) EWM researches have become highly multidisciplinary, and handling the ever increasing data volume/types using the traditional workflow is simply not an option, and last but not least, (3) the current theoretical knowledge about many EWM processes is still incomplete, but which may now be complemented through data-driven discovery. A large number of applications on Big Data and ML have already appeared in the EWM literature in recent years. The purposes of this survey are to (1) examine the potential and benefits of data-driven research in EWM, (2) give a synopsis of key concepts and approaches in Big Data and ML, (3) provide a systematic review of current applications, and finally (4) discuss major issues and challenges, and recommend future research directions. EWM includes a broad range of research topics. Instead of attempting to survey each individual area, this review focuses on areas of nexus in EWM, with an emphasis on elucidating the potential benefits of increased data availability and predictive analytics to improving the EWM research.

Journal Article

Share this book

Add to My Shelf

The challenges of colposcopy for cervical cancer screening in LMICs and solutions by artificial intelligence

by Qiao, Youlin , Xue, Peng , Ng, Man Tat Alexander in Adult , Artificial intelligence , Artificial Intelligence - standards

2020

Background The World Health Organization (WHO) called for global action towards the elimination of cervical cancer. One of the main strategies is to screen 70% of women at the age between 35 and 45 years and 90% of women managed appropriately by 2030. So far, approximately 85% of cervical cancers occur in low- and middle-income countries (LMICs). The colposcopy-guided biopsy is crucial for detecting cervical intraepithelial neoplasia (CIN) and becomes the main bottleneck limiting screening performance. Unprecedented advances in artificial intelligence (AI) enable the synergy of deep learning and digital colposcopy, which offers opportunities for automatic image-based diagnosis. To this end, we discuss the main challenges of traditional colposcopy and the solutions applying AI-guided digital colposcopy as an auxiliary diagnostic tool in low- and middle- income countries (LMICs). Main body Existing challenges for the application of colposcopy in LMICs include strong dependence on the subjective experience of operators, substantial inter- and intra-operator variabilities, shortage of experienced colposcopists, consummate colposcopy training courses, and uniform diagnostic standard and strict quality control that are hard to be followed by colposcopists with limited diagnostic ability, resulting in discrepant reporting and documentation of colposcopy impressions. Organized colposcopy training courses should be viewed as an effective way to enhance the diagnostic ability of colposcopists, but implementing these courses in practice may not always be feasible to improve the overall diagnostic performance in a short period of time. Fortunately, AI has the potential to address colposcopic bottleneck, which could assist colposcopists in colposcopy imaging judgment, detection of underlying CINs, and guidance of biopsy sites. The automated workflow of colposcopy examination could create a novel cervical cancer screening model, reduce potentially false negatives and false positives, and improve the accuracy of colposcopy diagnosis and cervical biopsy. Conclusion We believe that a practical and accurate AI-guided digital colposcopy has the potential to strengthen the diagnostic ability in guiding cervical biopsy, thereby improves cervical cancer screening performance in LMICs and accelerates the process of global cervical cancer elimination eventually.

Journal Article

Share this book

Add to My Shelf

DolphinNext: a distributed data processing platform for high throughput genomics

by Kucukural, Alper , Garber, Manuel , Turkyilmaz, Osman in Algorithms , Animal Genetics and Genomics , Big data processing

2020

Background The emergence of high throughput technologies that produce vast amounts of genomic data, such as next-generation sequencing (NGS) is transforming biological research. The dramatic increase in the volume of data, the variety and continuous change of data processing tools, algorithms and databases make analysis the main bottleneck for scientific discovery. The processing of high throughput datasets typically involves many different computational programs, each of which performs a specific step in a pipeline. Given the wide range of applications and organizational infrastructures, there is a great need for highly parallel, flexible, portable, and reproducible data processing frameworks. Several platforms currently exist for the design and execution of complex pipelines. Unfortunately, current platforms lack the necessary combination of parallelism, portability, flexibility and/or reproducibility that are required by the current research environment. To address these shortcomings, workflow frameworks that provide a platform to develop and share portable pipelines have recently arisen. We complement these new platforms by providing a graphical user interface to create, maintain, and execute complex pipelines. Such a platform will simplify robust and reproducible workflow creation for non-technical users as well as provide a robust platform to maintain pipelines for large organizations. Results To simplify development, maintenance, and execution of complex pipelines we created DolphinNext. DolphinNext facilitates building and deployment of complex pipelines using a modular approach implemented in a graphical interface that relies on the powerful Nextflow workflow framework by providing 1. A drag and drop user interface that visualizes pipelines and allows users to create pipelines without familiarity in underlying programming languages. 2. Modules to execute and monitor pipelines in distributed computing environments such as high-performance clusters and/or cloud 3. Reproducible pipelines with version tracking and stand-alone versions that can be run independently. 4. Modular process design with process revisioning support to increase reusability and pipeline development efficiency. 5. Pipeline sharing with GitHub and automated testing 6. Extensive reports with R-markdown and shiny support for interactive data visualization and analysis. Conclusion DolphinNext is a flexible, intuitive, web-based data processing and analysis platform that enables creating, deploying, sharing, and executing complex Nextflow pipelines with extensive revisioning and interactive reporting to enhance reproducible results.

Journal Article

Share this book

Add to My Shelf

Geodata Science-Based Mineral Prospectivity Mapping: A Review

by Zuo, Renguang in Big Data , Chemistry and Earth Sciences , Computer Science

2020

This paper introduces the concept of geodata science-based mineral prospectivity mapping (GSMPM), which is based on analyzing the spatial associations between geological prospecting big data (GPBD) and locations of known mineralization. Geodata science reveals the inter-correlations between GPBD and mineralization, converts GPBD into mappable criteria, and combines multiple mappable criteria into a mineral potential map. A workflow of the GSMPM is proposed and compared with the traditional workflow of mineral prospectivity mapping. More specifically, each component in such a workflow is explained in detail to demonstrate how geodata science serves mineral prospectivity mapping by deriving geoinformation from geoscience data, generating geo-knowledge from geoinformation, and allowing spatial decision-making by integrating geoinformation and geo-knowledge on the formation of mineral deposits. This review also presents several research directions for GSMPM in the future.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter