Catalogue Search | MBRL

Reading Tea Leaves - Understanding internal events and addressing performance issues within a CephFS/XRootD Storage Element

by Simpson, Steven , Doidge, Matt , Love, Peter in Data storage , Monitoring , Reliability

2025

Erasure-coded storage systems based on Ceph have become a mainstay within UK Grid sites as a means of providing bulk data storage whilst maintaining a good balance between data safety and space efficiency. These storage systems are complex and self-correcting, but despite access to a myriad of metrics, the inner workings of the storage tend to be opaque to the storage admin. One of the common problems seen within Ceph based systems is slow ops— instances of operations that take longer than expected, that are also often blocking in nature, impacting the overall performance and reliability of the system. Identifying the causes of slow ops can help to prevent or reduce the impact of future occurrences, leading to an increase in performance and reliability. We detail the efforts of the Lancaster Grid Site to understand the causes of and mitigate against these slow ops and other performance bottlenecks within our storage system. We endeavour to bring together a holistic monitoring model, utilising Ceph metrics, detailed XRootD monitoring streams, and client-side logging, in order to understand how data-management events impact the health of the storage.

Journal Article

Share this book

Add to My Shelf

A Blueprint for a Contemporary Storage Element, building a new WLCG storage system with widely available hardware and software components: Ceph, XRootD, and Prometheus

by Jones, Roger , Simpson, Steven , Doidge, Matt in Failure , Hardware , Monitoring

2024

When a new long-term storage facility was needed at the Lancaster WLCG Tier-2 Site, an architecture was chosen involving CephFS as a failure-tolerant back-end volume, and load-balanced XRootD as an endpoint exposing the volume via the HTTPS/DAVS protocols increasingly favoured by the WLCG and other users. This allows operations to continue in the face of disc/node failures with minimal management, and enables good utilization of network connectivity for remote access. We deployed a Prometheus/Loki/Grafana monitoring/alerting stack for timely detection and resolution of failures in such a production environment. Some custom scripts were required to adapt the off-the-shelf functional components with monitoring. With such a monitoring system in place, failures such as disc defects, data corruption and resource exhaustion in long-running processes can be anticipated, and their management planned. We describe the hardware platform and our requirements on it, and detail the software architecture from initial design, through adaptations to face challenges encountered during production, to present condition. Developments and contributions to related projects that help to fully exploit our design decisions are described. We include performance metrics of the system, the lessons learned during production, and our future plans.

Journal Article

Share this book

Add to My Shelf

Preliminary findings and recommendations from the Token Trust and Traceability Working Group

by Hardt, Marcus , Kelsey, David , Doidge, Matt in Best practice , Certificates , Distributed processing

2025

Created in 2023, the Token Trust and Traceability Working Group (TTT) was formed in order to answer questions of policy and best practice with the ongoing move from X.509 and VOMS proxy certificates to token-based solutions as the primary authorisation and authentication method in distributed computing environments. With a remit to act in an investigatory and advisory capacity alongside other working groups in the token space, the TTT is composed of a broad variety of stakeholders to provide a breadth of experience and viewpoints. While the requirements of grid sites, users, identity providers and virtual organisations to be able to trace workflows have remained largely the same in a token paradigm as to one using X.509 certificates, tokens provide a new set of challenges, requiring a rethink and restructure of the policies and processes that were defined with just X.509 and VOMS in mind, in order to meet these requirements in the new context. After providing an overview of the current status of the token trust landscape we will detail the initial findings, future plans and recommendations to be made by the TTT. This will include best practice for sites and identity providers, suggestions for token development, and methodologies for tracing token usage by system administrators within common grid middleware stacks.

Journal Article

Share this book

Add to My Shelf

Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time

by White, Brandon , Le Boulc’h, Quentin , Doidge, Matt in Algorithms , Astronomical catalogs , Astronomy

2024

The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory’s ten-year science mission scheduled to begin in 2025. Rubin’s 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320–1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties. In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and mapping the Milky Way. Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory.

Journal Article

Share this book

Add to My Shelf

Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time

by White, Brandon , Doidge, Matt , Beckett, George in Algorithms , Astronomical catalogs , Astronomy

2023

The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320-1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties. In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and mapping the Milky Way. Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter