Catalogue Search | MBRL

The ATLAS Data Management System Rucio: Supporting LHC Run-2 and beyond

by Serfon, C , Barisits, M , Lassnig, M in Data management , Machine learning , Physics

2018

With this contribution we present some recent developments made to Rucio, the data management system of the High-Energy Physics Experiment ATLAS. Already managing 300 Petabytes of both official and user data, Rucio has seen incremental improvements throughout LHC Run-2, and is currently laying the groundwork for HEP computing in the HL-LHC era. The focus of this contribution are (a) the automations that have been put in place such as data rebalancing or dynamic replication of user data, as well as their supporting infrastructures such as real-time networking metrics or transfer time predictions; (b) the flexible approach towards inclusion of heterogeneous storage systems, including object stores, while unifying the potential access paths using generally available tools and protocols; (c) machine learning approaches to help with transfer throughput estimation; and (d) the adoption of Rucio for two other experiments, AMS and Xenon1t. We conclude by presenting operational numbers and figures to quantify these improvements, and extrapolate the necessary changes and developments for future LHC runs.

Journal Article

Share this book

Add to My Shelf

Rucio – The next generation of large scale distributed system for ATLAS Data Management

by Vigne, R , Serfon, C , Barisits, M in Architecture (computers) , Computer networks , Computer programs

2014

Rucio is the next-generation Distributed Data Management (DDM) system benefiting from recent advances in cloud and \"Big Data\" computing to address HEP experiments scaling requirements. Rucio is an evolution of the ATLAS DDM system Don Quijote 2 (DQ2), which has demonstrated very large scale data management capabilities with more than 140 petabytes spread worldwide across 130 sites, and accesses from 1,000 active users. However, DQ2 is reaching its limits in terms of scalability, requiring a large number of support staff to operate and being hard to extend with new technologies. Rucio will deal with these issues by relying on a conceptual data model and new technology to ensure system scalability, address new user requirements and employ new automation framework to reduce operational overheads. We present the key concepts of Rucio, including its data organization/representation and a model of how to manage central group and user activities. The Rucio design, and the technology it employs, is described, specifically looking at its RESTful architecture and the various software components it uses. We show also the performance of the system.

Journal Article

Share this book

Add to My Shelf

Global heterogeneous resource harvesting: the next-generation PanDA Pilot for ATLAS

by Drizhuk, D. , Oleynik, D. , Guan, W. in Harvesters , Harvesting , Physics

2018

The Production and Distributed Analysis system (PanDA), used for workload management in the ATLAS Experiment at the LHC for over a decade, has in recent years expanded its reach to diverse new resource types such as HPCs, and innovative new workflows such as the Event Service. PanDA meets the heterogeneous resources it harvests in the PanDA Pilot, which has embarked on a next-generation reengineering to efficiently integrate and exploit the new platforms and workflows. The new modular architecture is the product of a year of design and prototyping in conjunction with the design of a completely new component, Harvester, that will mediate a richer flow of control and information between Pilot and PanDA. Harvester will enable more intelligent and dynamic matching between processing tasks and resources, with an initial focus on HPCs, simplifying the operator and user view of a PanDA site but internally leveraging deep information gathering on the resource to accrue detailed knowledge of a site's capabilities and dynamic state to inform the matchmaking. This paper will give an overview of the new Pilot architecture, how it will be used in and beyond ATLAS, its relation to Harvester, and the work ahead.

Journal Article

Share this book

Add to My Shelf

Experiences with the new ATLAS Distributed Data Management System

by Serfon, C , Barisits, M , Lassnig, M in Data management , Evolution , New technology

2017

The ATLAS Distributed Data Management (DDM) system has evolved drastically in the last two years with the Rucio software fully replacing the previous system before the start of LHC Run-2. The ATLAS DDM system manages now more than 250 petabytes spread on 130 storage sites and can handle file transfer rates of up to 30Hz. In this paper, we discuss our experience acquired in developing, commissioning, running and maintaining such a large system. First, we describe the general architecture of the system, our integration with external services like the WLCG File Transfer Service and the evolution of the system over its first years of production. Then, we show the performance of the system, describe the integration of new technologies such as object stores, and outline some new developments, which mainly focus on performance and automation.

Journal Article

Share this book

Add to My Shelf

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

by Serfon, C , Barisits, M , Lassnig, M in Algorithms , Data management , Physics

2017

This paper introduces a new dynamic data placement agent for the ATLAS distributed data management system. This agent is designed to pre-place potentially popular data to make it more widely available. It therefore incorporates information from a variety of sources. Those include input datasets and sites workload information from the ATLAS workload management system, network metrics from different sources like FTS and PerfSonar, historical popularity data collected through a tracer mechanism and more. With this data it decides if, when and where to place new replicas that then can be used by the WMS to distribute the workload more evenly over available computing resources and then ultimately reduce job waiting times. This paper gives an overview of the architecture and the final implementation of this new agent. The paper also includes an evaluation of the placement algorithm by comparing the transfer times and the new replica usage.

Journal Article

Share this book

Add to My Shelf

Automatic rebalancing of data in ATLAS distributed data management

by Serfon, C , Barisits, M , Lassnig, M in Data management , Physics , Workload

2017

The ATLAS Distributed Data Management system stores more than 220PB of physics data across more than 130 sites globally. Rucio, the next generation data management system of the ATLAS collaboration, has now been successfully operated for two years. However, with the increasing workload and utilization, more automated and advanced methods of managing the data are needed. In this article we present an extension to the data management system, which is in charge of detecting and foreseeing storage elements reaching and surpassing their capacity limit. The system automatically and dynamically rebalances the data to other storage elements, while respecting and guaranteeing data distribution policies and ensuring the availability of the data. This concept not only lowers the operational burden, as these cumbersome procedures had previously to be done manually, but it also enables the system to use its distributed resources more efficiently, which not only affects the data management system itself, but in consequence also the workload management and production systems. This contribution describes the concept and architecture behind those components and shows the benefits made by the system.

Journal Article

Share this book

Add to My Shelf

Federating distributed storage for clouds in ATLAS

by Paterson, M , Serfon, C , Di Girolamo, A in Cloud computing , Distribution management , Industry standards

2018

Input data for applications that run in cloud computing centres can be stored at distant repositories, often with multiple copies of the popular data stored at many sites. Locating and retrieving the remote data can be challenging, and we believe that federating the storage can address this problem. A federation would locate the closest copy of the data on the basis of GeoIP information. Currently we are using the dynamic data federation Dynafed, a software solution developed by CERN IT. Dynafed supports several industry standards for connection protocols like Amazon's S3, Microsoft's Azure, as well as WebDAV and HTTP. Dynafed functions as an abstraction layer under which protocol-dependent authentication details are hidden from the user, requiring the user to only provide an X509 certificate. We have setup an instance of Dynafed and integrated it into the ATLAS data distribution management system. We report on the challenges faced during the installation and integration. We have tested ATLAS analysis jobs submitted by the PanDA production system and we report on our first experiences with its operation.

Journal Article

Share this book

Add to My Shelf

Analysis of CERN computing infrastructure and monitoring data

by Motesnitsalis, E , Lassnig, M , Menichetti, L in CERN , Computation , Dashboards

2015

Optimizing a computing infrastructure on the scale of LHC requires a quantitative understanding of a complex network of many different resources and services. For this purpose the CERN IT department and the LHC experiments are collecting a large multitude of logs and performance probes, which are already successfully used for short-term analysis (e.g. operational dashboards) within each group. The IT analytics working group has been created with the goal to bring data sources from different services and on different abstraction levels together and to implement a suitable infrastructure for mid- to long-term statistical analysis. It further provides a forum for joint optimization across single service boundaries and the exchange of analysis methods and tools. To simplify access to the collected data, we implemented an automated repository for cleaned and aggregated data sources based on the Hadoop ecosystem. This contribution describes some of the challenges encountered, such as dealing with heterogeneous data formats, selecting an efficient storage format for map reduce and external access, and will describe the repository user interface. Using this infrastructure we were able to quantitatively analyze the relationship between CPU/wall fraction, latency/throughput constraints of network and disk and the effective job throughput. In this contribution we will first describe the design of the shared analysis infrastructure and then present a summary of first analysis results from the combined data sources.

Journal Article

Share this book

Add to My Shelf

Resource control in ATLAS distributed data management: Rucio Accounting and Quotas

by Serfon, C , Vigne, R , Barisits, M in Data management , Experiments , Physics

2015

The ATLAS Distributed Data Management system manages more than 160PB of physics data across more than 130 sites globally. Rucio, the next generation Distributed Data Management system of the ATLAS experiment, replaced DQ2 in December 2014 and will manage the experiment's data throughout Run 2 of the LHC and beyond. The previous data management system pursued a rather simplistic approach for resource management, but with the increased data volume and more dynamic handling of data workflows required by the experiment, a more elaborate approach is needed. Rucio was delivered with an initial quota system, but during the first months of operation it turned out to not fully satisfy the collaboration's resource management needs. We consequently introduce a new concept of declaring quota policies (limits) for accounts in Rucio. This new quota concept is based on accounts and RSE (Rucio storage element) expressions, which allows the definition of hierarchical quotas in a dynamic way. This concept enables the operators of the data management system to implement very specific policies for users, physics groups and production systems while, at the same time, lowering the operational burden. This contribution describes the concept, architecture and workflow of the system and includes an evaluation measuring the performance of the system.

Journal Article

Share this book

Add to My Shelf

Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio

by Vigne, R , Serfon, C , Barisits, M in Data centers , Data management , Error recovery

2015

This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possible to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these components, their deployment, software mitigation, and the monitoring strategy are discussed.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter