Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
5 result(s) for "Doidge, Matt"
Sort by:
Reading Tea Leaves - Understanding internal events and addressing performance issues within a CephFS/XRootD Storage Element
Erasure-coded storage systems based on Ceph have become a mainstay within UK Grid sites as a means of providing bulk data storage whilst maintaining a good balance between data safety and space efficiency. These storage systems are complex and self-correcting, but despite access to a myriad of metrics, the inner workings of the storage tend to be opaque to the storage admin. One of the common problems seen within Ceph based systems is slow ops— instances of operations that take longer than expected, that are also often blocking in nature, impacting the overall performance and reliability of the system. Identifying the causes of slow ops can help to prevent or reduce the impact of future occurrences, leading to an increase in performance and reliability. We detail the efforts of the Lancaster Grid Site to understand the causes of and mitigate against these slow ops and other performance bottlenecks within our storage system. We endeavour to bring together a holistic monitoring model, utilising Ceph metrics, detailed XRootD monitoring streams, and client-side logging, in order to understand how data-management events impact the health of the storage.
A Blueprint for a Contemporary Storage Element, building a new WLCG storage system with widely available hardware and software components: Ceph, XRootD, and Prometheus
When a new long-term storage facility was needed at the Lancaster WLCG Tier-2 Site, an architecture was chosen involving CephFS as a failure-tolerant back-end volume, and load-balanced XRootD as an endpoint exposing the volume via the HTTPS/DAVS protocols increasingly favoured by the WLCG and other users. This allows operations to continue in the face of disc/node failures with minimal management, and enables good utilization of network connectivity for remote access. We deployed a Prometheus/Loki/Grafana monitoring/alerting stack for timely detection and resolution of failures in such a production environment. Some custom scripts were required to adapt the off-the-shelf functional components with monitoring. With such a monitoring system in place, failures such as disc defects, data corruption and resource exhaustion in long-running processes can be anticipated, and their management planned. We describe the hardware platform and our requirements on it, and detail the software architecture from initial design, through adaptations to face challenges encountered during production, to present condition. Developments and contributions to related projects that help to fully exploit our design decisions are described. We include performance metrics of the system, the lessons learned during production, and our future plans.
Preliminary findings and recommendations from the Token Trust and Traceability Working Group
Created in 2023, the Token Trust and Traceability Working Group (TTT) was formed in order to answer questions of policy and best practice with the ongoing move from X.509 and VOMS proxy certificates to token-based solutions as the primary authorisation and authentication method in distributed computing environments. With a remit to act in an investigatory and advisory capacity alongside other working groups in the token space, the TTT is composed of a broad variety of stakeholders to provide a breadth of experience and viewpoints. While the requirements of grid sites, users, identity providers and virtual organisations to be able to trace workflows have remained largely the same in a token paradigm as to one using X.509 certificates, tokens provide a new set of challenges, requiring a rethink and restructure of the policies and processes that were defined with just X.509 and VOMS in mind, in order to meet these requirements in the new context. After providing an overview of the current status of the token trust landscape we will detail the initial findings, future plans and recommendations to be made by the TTT. This will include best practice for sites and identity providers, suggestions for token development, and methodologies for tracing token usage by system administrators within common grid middleware stacks.
Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time
The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory’s ten-year science mission scheduled to begin in 2025. Rubin’s 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320–1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties. In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and mapping the Milky Way. Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory.
Overview of the distributed image processing infrastructure to produce the Legacy Survey of Space and Time
The Vera C. Rubin Observatory is preparing to execute the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently the final phase of construction is under way in the Chilean Andes, with the Observatory's ten-year science mission scheduled to begin in 2025. Rubin's 8.4-meter telescope will nightly scan the southern hemisphere collecting imagery in the wavelength range 320-1050 nm covering the entire observable sky every 4 nights using a 3.2 gigapixel camera, the largest imaging device ever built for astronomy. Automated detection and classification of celestial objects will be performed by sophisticated algorithms on high-resolution images to progressively produce an astronomical catalog eventually composed of 20 billion galaxies and 17 billion stars and their associated physical properties. In this article we present an overview of the system currently being constructed to perform data distribution as well as the annual campaigns which reprocess the entire image dataset collected since the beginning of the survey. These processing campaigns will utilize computing and storage resources provided by three Rubin data facilities (one in the US and two in Europe). Each year a Data Release will be produced and disseminated to science collaborations for use in studies comprising four main science pillars: probing dark matter and dark energy, taking inventory of solar system objects, exploring the transient optical sky and mapping the Milky Way. Also presented is the method by which we leverage some of the common tools and best practices used for management of large-scale distributed data processing projects in the high energy physics and astronomy communities. We also demonstrate how these tools and practices are utilized within the Rubin project in order to overcome the specific challenges faced by the Observatory.