Catalogue Search | MBRL

The ATLAS Data Management System Rucio: Supporting LHC Run-2 and beyond

by Serfon, C , Barisits, M , Lassnig, M in Data management , Machine learning , Physics

2018

With this contribution we present some recent developments made to Rucio, the data management system of the High-Energy Physics Experiment ATLAS. Already managing 300 Petabytes of both official and user data, Rucio has seen incremental improvements throughout LHC Run-2, and is currently laying the groundwork for HEP computing in the HL-LHC era. The focus of this contribution are (a) the automations that have been put in place such as data rebalancing or dynamic replication of user data, as well as their supporting infrastructures such as real-time networking metrics or transfer time predictions; (b) the flexible approach towards inclusion of heterogeneous storage systems, including object stores, while unifying the potential access paths using generally available tools and protocols; (c) machine learning approaches to help with transfer throughput estimation; and (d) the adoption of Rucio for two other experiments, AMS and Xenon1t. We conclude by presenting operational numbers and figures to quantify these improvements, and extrapolate the necessary changes and developments for future LHC runs.

Journal Article

Share this book

Add to My Shelf

Search for light long-lived neutral particles produced in pp collisions at s=13TeV and decaying into collimated leptons or light hadrons with the ATLAS detector

by Alexandre, D. , Bernard, N. R. , Walkowiak, W. in Astronomy , Astrophysics and Cosmology , Atoms & subatomic particles

2020

Several models of physics beyond the Standard Model predict the existence of dark photons, light neutral particles decaying into collimated leptons or light hadrons. This paper presents a search for long-lived dark photons produced from the decay of a Higgs boson or a heavy scalar boson and decaying into displaced collimated Standard Model fermions. The search uses data corresponding to an integrated luminosity of 36.1 fb - 1 collected in proton–proton collisions at s = 13 Te recorded in 2015–2016 with the ATLAS detector at the Large Hadron Collider. The observed number of events is consistent with the expected background, and limits on the production cross section times branching fraction as a function of the proper decay length of the dark photon are reported. A cross section times branching fraction above 4 pb is excluded for a Higgs boson decaying into two dark photons for dark-photon decay lengths between 1.5 mm and 307 mm.

Journal Article

Share this book

Add to My Shelf

Experiences with the new ATLAS Distributed Data Management System

by Serfon, C , Barisits, M , Lassnig, M in Data management , Evolution , New technology

2017

The ATLAS Distributed Data Management (DDM) system has evolved drastically in the last two years with the Rucio software fully replacing the previous system before the start of LHC Run-2. The ATLAS DDM system manages now more than 250 petabytes spread on 130 storage sites and can handle file transfer rates of up to 30Hz. In this paper, we discuss our experience acquired in developing, commissioning, running and maintaining such a large system. First, we describe the general architecture of the system, our integration with external services like the WLCG File Transfer Service and the evolution of the system over its first years of production. Then, we show the performance of the system, describe the integration of new technologies such as object stores, and outline some new developments, which mainly focus on performance and automation.

Journal Article

Share this book

Add to My Shelf

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

by Serfon, C , Barisits, M , Lassnig, M in Algorithms , Data management , Physics

2017

This paper introduces a new dynamic data placement agent for the ATLAS distributed data management system. This agent is designed to pre-place potentially popular data to make it more widely available. It therefore incorporates information from a variety of sources. Those include input datasets and sites workload information from the ATLAS workload management system, network metrics from different sources like FTS and PerfSonar, historical popularity data collected through a tracer mechanism and more. With this data it decides if, when and where to place new replicas that then can be used by the WMS to distribute the workload more evenly over available computing resources and then ultimately reduce job waiting times. This paper gives an overview of the architecture and the final implementation of this new agent. The paper also includes an evaluation of the placement algorithm by comparing the transfer times and the new replica usage.

Journal Article

Share this book

Add to My Shelf

Automatic rebalancing of data in ATLAS distributed data management

by Serfon, C , Barisits, M , Lassnig, M in Data management , Physics , Workload

2017

The ATLAS Distributed Data Management system stores more than 220PB of physics data across more than 130 sites globally. Rucio, the next generation data management system of the ATLAS collaboration, has now been successfully operated for two years. However, with the increasing workload and utilization, more automated and advanced methods of managing the data are needed. In this article we present an extension to the data management system, which is in charge of detecting and foreseeing storage elements reaching and surpassing their capacity limit. The system automatically and dynamically rebalances the data to other storage elements, while respecting and guaranteeing data distribution policies and ensuring the availability of the data. This concept not only lowers the operational burden, as these cumbersome procedures had previously to be done manually, but it also enables the system to use its distributed resources more efficiently, which not only affects the data management system itself, but in consequence also the workload management and production systems. This contribution describes the concept and architecture behind those components and shows the benefits made by the system.

Journal Article

Share this book

Add to My Shelf

Rucio – The next generation of large scale distributed system for ATLAS Data Management

by Vigne, R , Serfon, C , Barisits, M in Architecture (computers) , Computer networks , Computer programs

2014

Rucio is the next-generation Distributed Data Management (DDM) system benefiting from recent advances in cloud and \"Big Data\" computing to address HEP experiments scaling requirements. Rucio is an evolution of the ATLAS DDM system Don Quijote 2 (DQ2), which has demonstrated very large scale data management capabilities with more than 140 petabytes spread worldwide across 130 sites, and accesses from 1,000 active users. However, DQ2 is reaching its limits in terms of scalability, requiring a large number of support staff to operate and being hard to extend with new technologies. Rucio will deal with these issues by relying on a conceptual data model and new technology to ensure system scalability, address new user requirements and employ new automation framework to reduce operational overheads. We present the key concepts of Rucio, including its data organization/representation and a model of how to manage central group and user activities. The Rucio design, and the technology it employs, is described, specifically looking at its RESTful architecture and the various software components it uses. We show also the performance of the system.

Journal Article

Share this book

Add to My Shelf

Resource control in ATLAS distributed data management: Rucio Accounting and Quotas

by Serfon, C , Vigne, R , Barisits, M in Data management , Experiments , Physics

2015

The ATLAS Distributed Data Management system manages more than 160PB of physics data across more than 130 sites globally. Rucio, the next generation Distributed Data Management system of the ATLAS experiment, replaced DQ2 in December 2014 and will manage the experiment's data throughout Run 2 of the LHC and beyond. The previous data management system pursued a rather simplistic approach for resource management, but with the increased data volume and more dynamic handling of data workflows required by the experiment, a more elaborate approach is needed. Rucio was delivered with an initial quota system, but during the first months of operation it turned out to not fully satisfy the collaboration's resource management needs. We consequently introduce a new concept of declaring quota policies (limits) for accounts in Rucio. This new quota concept is based on accounts and RSE (Rucio storage element) expressions, which allows the definition of hierarchical quotas in a dynamic way. This concept enables the operators of the data management system to implement very specific policies for users, physics groups and production systems while, at the same time, lowering the operational burden. This contribution describes the concept, architecture and workflow of the system and includes an evaluation measuring the performance of the system.

Journal Article

Share this book

Add to My Shelf

Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio

by Vigne, R , Serfon, C , Barisits, M in Data centers , Data management , Error recovery

2015

This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possible to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these components, their deployment, software mitigation, and the monitoring strategy are discussed.

Journal Article

Share this book

Add to My Shelf

Monitoring and controlling ATLAS data management: The Rucio web user interface

by Vigne, R , Serfon, C , Barisits, M in Data base management systems , Data management , Information management

2015

The monitoring and controlling interfaces of the previous data management system DQ2 followed the evolutionary requirements and needs of the ATLAS collaboration. The new data management system, Rucio, has put in place a redesigned web-based interface based upon the lessons learnt from DQ2, and the increased volume of managed information. This interface encompasses both a monitoring and controlling component, and allows easy integration for usergenerated views. The interface follows three design principles. First, the collection and storage of data from internal and external systems is asynchronous to reduce latency. This includes the use of technologies like ActiveMQ or Nagios. Second, analysis of the data into information is done massively parallel due to its volume, using a combined approach with an Oracle database and Hadoop MapReduce. Third, sharing of the information does not distinguish between human or programmatic access, making it easy to access selective parts of the information both in constrained frontends like web-browsers as well as remote services. This contribution will detail the reasons for these principles and the design choices taken. Additionally, the implementation, the interactions with external systems, and an evaluation of the system in production, both from a technological and user perspective, conclude this contribution.

Journal Article

Share this book

Add to My Shelf

Experience with Rucio in the wider HEP community

by Clark, James Alexander , Lassnig, Mario , Serfon, Cédric in Data management , MATHEMATICS AND COMPUTING

2021

Managing the data of scientific projects is an increasingly complicated challenge, which was historically met by developing experiment-specific solutions. However, the ever-growing data rates and requirements of even small experiments make this approach very difficult, if not prohibitive. In recent years, the scientific data management system Rucio has evolved into a successful open-source project that is now being used by many scientific communities and organisations. Rucio is incorporating the contributions and expertise of many scientific projects and is offering common features useful to a diverse research community. This article describes the recent experiences in operating Rucio, as well as contributions to the project, by ATLAS, Belle II, CMS, ESCAPE, IGWN, LDMX, Folding@Home, and the UK’s Science and Technology Facilities Council (STFC).

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter