P10.43 Martin Melchior (Institute for Data Science)
Theme: Data processing pipelines
Scheduling the Euclid Pipeline in the Ground Segment Processing Infrastructure
The science products emerging from the Euclid mission will be created by running a complex, multi-stage pipeline at different processing sites spread across Europe and the US. For deploying centrally specified pipeline processing requests to the distributed HPC infrastructures, a science-agnostic middle-ware component, the Infrastructure Abstraction Layer, is used at the different data centers. It functionally decomposes into two parts: The first part is responsible for the communication with the Euclid Archive by fetching and updating processing requests and transferring input and output data between the archive and a staging area seen by the processing nodes; the second part manages the pipeline workflow at data center level and abstracts the specific underlying systems setups at the different processing sites. The pipeline workflow is specified by a python-based dataflow framework that allows for chaining and parallelizing the processing steps. These processing steps are implemented as executables with a generic signature so that they can be translated at runtime into processing jobs. With the help of suitable agents started on the processing nodes, the pipeline jobs can be executed without passing through the queuing system; this significantly improves the efficiency and flexibility of the job submission. The Infrastructure Abstraction Layer has gone through several software development cycles. It has been successfully tested in various data processing challenges performed in the Euclid ground segment and its flexible design is proven to nicely support the various development, test and production use cases the distributed ground segment architecture needs to cope with.