Catalogue Search | MBRL

R for data science : import, tidy, transform, visualize, and model data

by Wickham, Hadley, author , Grolemund, Garrett, author in Data mining Computer programs. , Information visualization Computer programs. , R (Computer program language)

\"This book introduces you to R, RStudio, and the tidyverse, a collection of R packages designed to work together to make data science fast, fluent, and fun. Suitable for readers with no previous programming experience ... \"--Page 4 of cover.

Book

Share this book

Add to My Shelf

A Layered Grammar of Graphics

by Wickham, Hadley in Cartesian coordinates , Circles , Comparative analysis

2010

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the \"scatterplot\") and gain insight into the deep structure that underlies statistical graphics. This article builds on Wilkinson, Anand, and Grossman (2005), describing extensions and refinements developed while building an open source implementation of the grammar of graphics for R, ggplot2 . The topics in this article include an introduction to the grammar by working through the process of creating a plot, and discussing the components that we need. The grammar is then presented formally and compared to Wilkinson's grammar, highlighting the hierarchy of defaults, and the implications of embedding a graphical grammar into a programming language. The power of the grammar is illustrated with a selection of examples that explore different components and their interactions, in more detail. The article concludes by discussing some perceptual issues, and thinking about how we can build on the grammar to learn how to create graphical \"poems.\" Supplemental materials are available online.

Journal Article

Share this book

Add to My Shelf

Statistical inference for exploratory data analysis and model diagnostics

by Cook, Dianne , Wickham, Hadley , Swayne, Deborah F. in Calibration , Cognitive Perception , Data analysis

2009

We propose to furnish visual statistical methods with an inferential framework and protocol, modelled on confirmatory statistical testing. In this framework, plots take on the role of test statistics, and human cognition the role of statistical tests. Statistical significance of 'discoveries' is measured by having the human viewer compare the plot of the real dataset with collections of plots of simulated datasets. A simple but rigorous protocol that provides inferential validity is modelled after the 'lineup' popular from criminal legal procedures. Another protocol modelled after the 'Rorschach' inkblot test, well known from (pop-)psychology, will help analysts acclimatize to random variability before being exposed to the plot of the real data. The proposed protocols will be useful for exploratory data analysis, with reference datasets simulated by using a null assumption that structure is absent. The framework is also useful for model diagnostics in which case reference datasets are simulated from the model in question. This latter point follows up on previous proposals. Adopting the protocols will mean an adjustment in working procedures for data analysts, adding more rigour, and teachers might find that incorporating these protocols into the curriculum improves their students' statistical thinking.

Journal Article

Share this book

Add to My Shelf

Introduction: Special Issue on Data Science

by Wickham, Hadley , Lazar, Nicole , Bryan, Jennifer in EDITORIAL , Regression analysis , Statistical methods

2018

Journal Article

Share this book

Add to My Shelf

Letter-Value Plots: Boxplots for Large Data

by Wickham, Hadley , Kafadar, Karen , Hofmann, Heike in Datasets , Estimates , Fourths

2017

Boxplots are useful displays that convey rough information about the distribution of a variable. Boxplots were designed to be drawn by hand and work best for small datasets, where detailed estimates of tail behavior beyond the quartiles may not be trustworthy. Larger datasets afford more precise estimates of tail behavior, but boxplots do not take advantage of this precision, instead presenting large numbers of extreme, though not unexpected, observations. Letter-value plots address this problem by including more detailed information about the tails using \"letter values,\" an order statistic defined by Tukey. Boxplots display the first two letter values (the median and quartiles); letter-value plots display further letter values so far as they are reliable estimates of their corresponding quantiles. We illustrate letter-value plots with real data that demonstrate their usefulness for large datasets. All graphics are created using the R package lvplot , and code and data are available in the supplementary materials.

Journal Article

Share this book

Add to My Shelf

Montane meadow change during drought varies with background hydrologic regime and plant functional group

by Wickham, Hadley , Debinski, Diane M. , Caruthers, Jennet C. in Animal and plant ecology , Animal, plant and microbial ecology , Artemisia

2010

Climate change models for many ecosystems predict more extreme climatic events in the future, including exacerbated drought conditions. Here we assess the effects of drought by quantifying temporal variation in community composition of a complex montane meadow landscape characterized by a hydrological gradient. The meadows occur in two regions of the Greater Yellowstone Ecosystem (Gallatin and Teton) and were classified into six categories (M1–M6, designating hydric to xeric) based upon Satellite pour l'Observation de la Terre (SPOT) satellite imagery. Both regions have similar plant communities, but patch sizes of meadows are much smaller in the Gallatin region. We measured changes in the percent cover of bare ground and plants by species and functional groups during five years between 1997 and 2007. We hypothesized that drought effects would not be manifested evenly across the hydrological gradient, but rather would be observed as hotspots of change in some areas and minimally evident in others. We also expected varying responses by plant functional groups (forbs vs. woody plants). Forbs, which typically use water from relatively shallow soils compared to woody plants, were expected to decrease in cover in mesic meadows, but increase in hydric meadows. Woody plants, such as Artemisia, were expected to increase, especially in mesic meadows. We identified several important trends in our meadow plant communities during this period of drought: (1) bare ground increased significantly in xeric meadows of both regions (Gallatin M6 and Teton M5) and in mesic (M3) meadows of the Teton, (2) forbs decreased significantly in the mesic and xeric meadows in both regions, (3) forbs increased in hydric (M1) meadows of the Gallatin region, and (4) woody species showed increases in M2 and M5 meadows of the Teton region and in M3 meadows of the Gallatin region. The woody response was dominated by changes in Artemisia spp. and Chrysothamnus viscidiflorus. Thus, our results supported our expectations that community change was not uniform across the landscape, but instead could be predicted based upon functional group responses to the spatial and temporal patterns of water availability, which are largely a function of plant water use and the hydrological gradient.

Journal Article

Share this book

Add to My Shelf

A Cognitive Interpretation of Data Analysis

by Wickham, Hadley , Grolemund, Garrett in Applied statistics , Cognition , Cognition & reasoning

2014

This paper proposes a scientific model to explain the analysis process. We argue that data analysis is primarily a procedure to build understanding, and as such, it dovetails with the cognitive processes of the human mind. Data analysis tasks closely resemble the cognitive process known as sensemaking. We demonstrate how data analysis is a sensemaking task adapted to use quantitative data. This identification highlights a universal structure within data analysis activities and provides a foundation for a theory of data analysis. The competing tensions of cognitive compatibility and scientific rigour create a series of problems that characterise the data analysis process. These problems form a useful organising model for the data analysis task while allowing methods to remain flexible and situation dependent. The insights of this model are especially helpful for consultants, applied statisticians and teachers of data analysis.

Journal Article

Share this book

Add to My Shelf

Data Science: A Three Ring Circus or a Big Tent?

by Wickham, Hadley , Bryan, Jennifer in Data science , Discussion , Mathematics

2017

For context, we both trained as statisticians and spent several years as regular professors of Statistics. We still have academic appointments. Yet today we work for RStudio, building tools to improve the workflows for data scientists and statisticians. This gives us an informed and unique perspective on Donoho's piece, which explores aspects of the academic statistical establishment that are deeply connected with this unusual career path. Overall, much of the article resonated with us. Our comments deal with three main areas: academic statistics, the skills meme, and coupling of cognitive and computation tools.

Journal Article

Share this book

Add to My Shelf

Graphical Criticism: Some Historical Notes

by Wickham, Hadley in International Year of Statistics Featured Discussion: Info Vis

2013

Journal Article

Share this book

Add to My Shelf

ASA 2009 Data Expo

by Wickham, Hadley in 20th Anniversary Feature: 2009 ASA Data Expo , Airline scheduling , Airlines

2011

The ASA Statistical Computing and Graphics Data Expo is a biannual data exploration challenge. Participants are challenged to provide a graphical summary of important features of the data. The task is intentionally vague to allow different entries to focus on different aspects of the data, giving the participants maximum freedom to apply their skills. The 2009 data expo consisted of flight arrival and departure details for all commercial flights on major carriers within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The complete dataset and challenge are available on the competition website http://stat-computing.org/dataexpo/2009/ . Because the dataset is so large, we also provided participants introductions to useful tools for dealing with this scale of data: Linux command line tools, including sort , awk , and cut , and sqlite , a simple SQL database. Additionally, we provided pointers to supplemental data on airport locations, airline carrier codes, individual plane information, and weather.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter