Catalogue Search | MBRL

Hadoop for dummies

by DeRoos, Dirk, author , Zikopoulos, Paul, author , Brown, Bruce, author in Apache Hadoop. , File organization (Computer science)

Let Hadoop For Dummies help harness the power of your data and rein in the information overload. Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets without becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters.

Book

Share this book

Add to My Shelf

SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks

by Krishna, C. Rama , Kumar, Krishan , Patil, Nilesh Vishwasrao in Algorithms , Batch processing , Classification

2022

Distributed denial of service (DDoS) is an immense threat for Internet based-applications and their resources. It immediately floods the victim system by transmitting a large number of network packets, and due to this, the victim system resources become unavailable for legitimate users. Therefore, this attack is claimed to be a dangerous attack for Internet-based applications and their resources. Several security approaches have been proposed in the literature to protect Internet-based applications from this type of threat. However, the frequency and strength of DDoS attacks are increasing day-by-day. Further, most of the traditional and distributed processing frameworks-based DDoS attack detection systems analyzed network flows in offline batch processing. Hence, they failed to classify network flows in real-time. This paper proposes a novel Spark Streaming and Kafka-based distributed classification system, named by SSK-DDoS, for classifying different types of DDoS attacks and legitimate network flows. This classification approach is implemented using a distributed Spark MLlib machine learning algorithms on a Hadoop cluster and deployed on the Spark streaming platform to classify streams in real-time. The incoming streams consume by Kafka’s topic to perform preprocessing tasks such as extracting and formulating features for classifying them into seven groups: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, the SSK-DDoS classification system stores formulated features with their predicted class into the HDFS that will help to retrain the distributed classification approach using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS efficiently classified network flows into seven classes and stored formulated features with the predicted value of each incoming network flow into HDFS.

Journal Article

Share this book

Add to My Shelf

Pro Hadoop data analytics : designing and building big data systems using the Hadoop ecosystem

by Koitzsch, Kerry, author in Apache Hadoop. , Database management. , Open source software.

\"Learn advanced analytical techniques and leverage existing toolkits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems which go beyond the basics of classification, clustering, and recommendation\" -- All IT eBooks website.

Book

Share this book

Add to My Shelf

A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020

by Zhang, Jianpeng , Lin, Mingwei in Bibliometrics , Big Data , Citations

2023

PurposeThe purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field, as well as a preliminary knowledge of Apache Hadoop for interested researchers.Design/methodology/approachThis paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database. This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications. Through the bibliometric analysis of the collected documents, this paper analyzes the main statistical characteristics and cooperation networks. Research themes, research hotspots and future development trends are also investigated through the keyword analysis.FindingsThe research on Apache Hadoop is still the top priority in the future, and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.Research limitations/implicationsThis paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics, and it is valuable for researchers can quickly grasp the hot topics in this area.Originality/valueThis paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years, aiming to understand the development status and trends in this field and inspire new ideas for researchers.

Journal Article

Share this book

Add to My Shelf

Using R to unlock the value of big data : big data analytics with Oracle R Enterprise and Oracle R Connector for Hadoop

by Hornick, Mark F , Plunkett, Tom, 1967- in Oracle (Computer file) , Apache Hadoop , Database management.

Book

Share this book

Add to My Shelf

Pig design patterns

by Pasupuleti, Pradeep in Apache Hadoop , Open source software , Programming languages (Electronic computers)

2014

Pig makes Hadoop programming simple, intuitive, and fun to work with. It removes the complexity from Map Reduce programming by giving the programmer immense power through its flexibility. What used to be extremely lengthy and intricate code written in other high level languages can now be written in almost one tenth of the size using its easy to understand constructs. Pig has proven to be the easiest way to learn how to program Hadoop clusters, as evidenced by its widespread adoption.

eBook

Share this book

Add to My Shelf

Learning Apache Drill : query and analyze distributed data sources with SQL

by Givre, Charles, author , Rogers, Paul, author in Apache Hadoop. , Apache Drill. , SQL (Computer program language)

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you'll learn how Drill helps you analyze data more effectively to drive down time to insight. Use Drill to clean, prepare, and summarize delimited data for further analysis ; Query file types including logfiles, Parquet, JSON, and other complex formats ; Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL ; Connect to Drill programmatically using a variety of languages ; Use Drill even with challenging or ambiguous file formats ; Perform sophisticated analysis by extending Drill's functionality with user-defined functions ; Facilitate data analysis for network security, image metadata, and machine learning

Book

Share this book

Add to My Shelf

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments

by Sun, Xue-Ping , Tang, Lin , Zhang, Ding in apache hadoop , Big Data , Computer networks

2024

With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.

Journal Article

Share this book

Add to My Shelf

Getting started with Kudu : perform fast analytics on fast data

by Spaggiari, Jean-Marc, author , Kovacevic, Mladen (Software developer), author , Noland, Brock, author in Apache Hadoop. , Electronic data processing Distributed processing. , Big data.

\"Begun as an internal project at [the firm] Cloudera, Kudu is an open source solution compatible with many data processing frameworks in the Hadoop environment. In this book, current and former solutions professionals from Cloudera provide use cases, examples, best practices, and sample code\"--Page 4 of cover.

Book

Share this book

Add to My Shelf

Hadoop for dummies

by Dirk deRoos in Apache Hadoop , Apache Hadoop (Computer file) , Computer programs

2014

Let Hadoop For Dummies help harness the power of your data and rein in the information overload Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand ForDummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters. * Explains the origins of Hadoop, its economic benefits, and its functionality and practical applications * Helps you find your way around the Hadoop ecosystem, program MapReduce, utilize design patterns, and get your Hadoop cluster up and running quickly and easily * Details how to use Hadoop applications for data mining, web analytics and personalization, large-scale text processing, data science, and problem-solving * Shows you how to improve the value of your Hadoop cluster, maximize your investment in Hadoop, and avoid common pitfalls when building your Hadoop cluster From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.

eBook

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter