Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Wrangling messy CSV files by detecting row and type patterns
by
Nazábal, A
, Sutton, C
, G J J van den Burg
in
Consistency
/ Format
/ Human wastes
/ Inspection
/ Tables (data)
2019
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Wrangling messy CSV files by detecting row and type patterns
by
Nazábal, A
, Sutton, C
, G J J van den Burg
in
Consistency
/ Format
/ Human wastes
/ Inspection
/ Tables (data)
2019
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Wrangling messy CSV files by detecting row and type patterns
Journal Article
Wrangling messy CSV files by detecting row and type patterns
2019
Request Book From Autostore
and Choose the Collection Method
Overview
Data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of real-world CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library. Our measure of data consistency is not specific to the data parsing problem, and has potential for more general applicability.
Publisher
Springer Nature B.V
Subject
This website uses cookies to ensure you get the best experience on our website.