Poster Abstract
P10.6 Maohai Huang (National Astr. Obs., CAS)
Theme: Data processing pipelinesSelf-describing Portable Dataset Container
With SPDC one can package data set into modular Data Products with annotation (description and units) and meta data. By combining arrays or tables of Products one can define highly complex structures.Access APIs of the components of ‘SPDCs’ are convenient, making it easier for scripting and data mining.
All levels of SPDC Products and their components (datasets or metadata) are portable (serializable) in human-friendly standard format, allowing machine data processors on different platforms to re-construct or parse SPDC. Even a human with a web browser can understand the data.
Most SPDC Products and components implement event sender and listener interfaces, allowing scalable data-driven processing pipelines to be constructed.
Reference SPDC storage pool (file based and partially implemented memory based) are provided for data storage and for all persistent data to be referenced to with URNs (Universal Resource Names).
‘Context’ type of SPDCs are provided so that references of SPDCs can become components, enabling SPDCs to encapsulate rich, deep, sophisticated, and accessible contextual information, yet remain light weight.
On the data processor end, an HTML server with RESTful APIs is provided to exchange SPDC data, especially suitable for Docker containers running Linux. This is a solution to insulate processing task software with incompatible dependencies when constructing a pipeline. SPDC allows such processing tasks to run in the Processing Node Server’s memory space, in a daemon process, or in the OS, receiving input and delivering output through a ‘delivery man’ protocol, in a docker, or a normal server.