Metadata of datasets

Related Issues	Document Status
#18	In Progress

Section authors:

Philipp S. Sommer , Marie Ryan, Linda Baldewein , Hatef Takyar, Andrea Pörsch, Beate Geyer , Lars Buntemeyer , Emanuel Söding , Nikolaus Groll, Ludwid Lierhammer, Klaus Getzlaff , Tilman Dinter

A central aspect in the model data explorer is the metadata of the datasets. We do not want to mimic a metadata portal such as geonetwork, but nevertheless, we need information to

let the user know what he or she is looking at
link datasets to groups
link datasets to people

Each author and group has a dedicated site where all the related datasets are listed. So we need to find a way to uniquely identify authors and associate the datasets with them.

We will implement a manual way where you can select the authors and edit the metadata through a web interface, but it should be possible to automatically interprete metadata standards.

In the following sections, we will describe what metadata we implement in the, and how.

Note

This document only covers global metadata of a dataset. Variable related metadata (units, standard name, etc.) shall be handled in a different document.

Todo

Document variable related metadata

Required metadata

Datasets in the model data explorer must define the following metadata attributes:

title: A one-line description of the dataset

Optional metadata

Optional but recommended metadata attributes are:

contacts: A list of authors that have some role related to the dataset. They participated in the generation, are responsible for providing the data, etc.
institutions: The institutions that are responsible for the dataset
projects: related projects that provided funding for the generation of the dataset
bbox: The bounding box of the geographic region of the data
abstract: A short description of the dataset
data_relations: datacite relation types (see Relations between Users, Groups and datasets and https://support.datacite.org/docs/relationtype_for_citation)
temporal_extent: The temporal window that is covered by the dataset

Todo

document geographic and temporal resolution as well?

Todo

Add descriptive spatial extent (such as global, continental, etc.)

Todo

add creation, publication and revision date

Interpretation of standards

The items mentioned in the previous sections are encoded in the metadata standards that we support, namely the netCDF header and the INSPIRE ISO- Standard. Our aim is to develop readers for each standard that transform the corresponding conventions into the metadata scheme of the model data explorer (see next section, Implementation details).

The exact database structure that allows this interpretation is however part of a different user story, namely #20.

CF-Conventions

For netCDF Headers (and NcML, a special markup language used by THREDDS) we want to develop guidelines based on the Binding Regulations for Storing Data as netCDF Files. For this purpose, we will transform the guidelines into a web-based format and enhance it with templates to make them easier to apply.

The guidelines are based on the CF-Conventions and extend by further attributes that are mainly motivated by the Conversion methodology to INSPIRE developed at the Geomar.

Todo

The UnidataDD2MI.xsl methodology needs to be elaborated further.

INSPIRE ISO

ISO-conform XML files will be read using the owslib python library. We will orient the format on the UnidataDD2MI.xsl file that has been developed by Franziska Weng (Geomar) and Andrea Pörsch (GFZ) (currently still work in progress).

Implementation

Note

Graph

Illustration of object attributes vs. object relations — Object attribute vs. object relations

Django source

class Author(models.Model):

    name = models.CharField(max_length=100)
    email = models.EmailField(max_length=100)

class Dataset(models.Model):

    title = models.CharField(max_length=50)
    abstract = models.CharField(max_length=50, null=True, blank=True)
    contacts = models.ManyToManyField(Author)

Metadata items described above are represented in the model data explorer as properties of Django objects that in turn translates into connections and attributes in a relational database. But for this document we will keep it simple and distinguish two metadata types: attributes and relations.

Attributes are simple string properties of a dataset. A title for instance. Relations describe how the dataset is connected to other items in the database. A dataset won’t have an authors string property, for instance, but it will define a connection to author objects, where one author holds a first name and last name attribute (for instance).

An example is shown in the graph on the right, Object attribute vs. object relations.

Authors and Contact Persons 

Todo

Add OrcID

Description

Authors can have a dedicated role related when being related to a dataset. This roles describe how the authors have been involved in the generation of the dataset (motivated by the available roles for the CI_RoleCode tag in INSPIRE, see the Roles tab).

In the metadata display on the frontend, we will then group the contributors based on their role such that is clearly visible who is the responsible contact person.

Graph

$digraph model_graph { // Dotfile by Django-Extensions graph_models // Created: 2023-02-14 14:49 // Cli Options: --output /home/docs/checkouts/readthedocs.org/user_builds/mde-prototype/checkouts/latest/source/tmp_graph.dot templateapp fontname = "Roboto" fontsize = 8 splines = true rankdir = "TB" node [ fontname = "Roboto" fontsize = 8 shape = "plaintext" ] edge [ fontname = "Roboto" fontsize = 8 ] // Labels templateapp_models_Author [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> Author </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> email </TD><TD ALIGN="LEFT"> EmailField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> name </TD><TD ALIGN="LEFT"> CharField </TD></TR> </TABLE> >] templateapp_models_DatasetContact [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> DatasetContact </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> author </TD><TD ALIGN="LEFT"> ForeignKey (id) </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> dataset </TD><TD ALIGN="LEFT"> ForeignKey (id) </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> role </TD><TD ALIGN="LEFT"> CharField </TD></TR> </TABLE> >] templateapp_models_Dataset [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> Dataset </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> title </TD><TD ALIGN="LEFT"> CharField </TD></TR> </TABLE> >] // Relations templateapp_models_DatasetContact -> templateapp_models_Author [label=" author (datasetcontact)"] [arrowhead=none, arrowtail=dot, dir=both]; templateapp_models_DatasetContact -> templateapp_models_Dataset [label=" dataset (datasetcontact)"] [arrowhead=none, arrowtail=dot, dir=both]; }$

Django source

class Author(models.Model):

    name = models.CharField(max_length=100)
    email = models.EmailField(max_length=100)

class DatasetContact(models.Model):

    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    dataset = models.ForeignKey("Dataset", on_delete=models.CASCADE)
    role = models.CharField(max_length=30)

class Dataset(models.Model):

    title = models.CharField(max_length=50)
    contacts = models.ManyToManyField(Author, through=DatasetContact)

Roles

Roles are taken from the INSPIRE ISO standard, https://inspire.ec.europa.eu/metadata-codelist/ResponsiblePartyRole. The role of an Author in a Dataset can be one of the following:

Code	Description
`resourceProvider`	Party that supplies the resource.
`custodian`	Party that accepts accountability and responsibility for the data and ensures appropriate care and maintenance of the resource.
`owner`	Party that owns the resource.
`user`	Party who uses the resource.
`distributor`	Party who distributes the resource.
`originator`	Party who created the resource
`pointOfContact`	Party who can be contacted for acquiring knowledge about or acquisition of the resource.
`principalInvestigator`	Key party responsible for gathering information and conducting research.
`processor`	Party who has processed the data in a manner such that the resource has been modified.
`publisher`	Party who published the resource.
`author`	Party who authored the resource.

Interpretation of Authors and Contact Persons

CF-Conventions

Although the CF-Conventions define an originator attribute, the information is rather limited. Therefore we aim to follow the suggestions by the UnidataDD2MI.xsl file of Franziska Weng (see INSPIRE ISO), and introduce further attributes such as creator_email, originator_email, contact_email, pi_email, contributor_role, etc.

INSPIRE

The implementation is pretty straight-forward and will be taken from the CI_ResponsibleParty tags.

Projects 

Description

Projects are data groups within the Model Data Explorer Framework (see Data Group). As such, a project is also a relation between two objects in the database (see the Graph tab).

This relation can also be equipped with permissions, namely can_edit, can_view and can_list (see Datasets and data groups). These permissions need to be approved by both, the data group (project) owner and the dataset.

A relation can also be made visible or invisible, which will determine whether the group is listed explicitly on the detail page of the dataset or not.

Note

A dataset can also be related to other types of data groups, such as institutions and this will be using the same methodology as this. As such, we will distinguish projects from platforms, etc. based on the DS_InitiativeTypeCode identifier (see Data Group and #20).

Graph

Django source

class DataGroup(models.Model):
    name = models.CharField(max_length=100)


class RelationPermission(models.Model):

    name = models.CharField(max_length=20)
    left_approved = models.BooleanField(default=False)
    right_approved = models.BooleanField(default=False)


class DatasetDataGroupRelation(models.Model):

    data_group = models.ForeignKey(DataGroup, on_delete=models.CASCADE)
    dataset = models.ForeignKey("Dataset", on_delete=models.CASCADE)
    permissions = models.ManyToManyField(RelationPermission)
    visible = models.BooleanField(default=True)


class Dataset(models.Model):

    title = models.CharField(max_length=50)
    data_groups = models.ManyToManyField(
        DataGroup, through=DatasetDataGroupRelation
    )

Interpretation of Projects

CF-Conventions

netCDF files can define a project, program, projects or project_name attribute. We will then search for matching names in the data groups that define a DS_InitiativeTypeCode kind of project and suggest them to the data submitter. This will also be documented in the netCDF guidelines (see CF-Conventions).

INSPIRE

We will look for MD_AggregateInformation entries that define a DS_AssociationTypeCode of largerWorkCitation and match the MD_Identifier against the available data groups.

Institutions 

Todo

Add ROR ID

Institutions are handled the same internally as projects as both are represented as data groups in the model data explorer. Just the interpretation of the metadata standards differ.

Interpretation of Institutions

CF-Conventions

netCDF files can define an institution or creator_institution attribute, together with a corresponding institution_references attribute. They will then be matched against available names of institutions in the database to make suggestions to the data submitter.

INSPIRE

Institutions will be identified from the organisationName in a CI_ResponsibleParty (see Authors and Contact Persons above).

Other relations 

Description

Other relations are references to internal or external resources, such as related studies or datasets. They are commonly described by datacite related identifiers, see https://support.datacite.org/docs/relationtype_for_citation.

However, neither the CF-Conventions nor INSPIRE define such a relation type. But both give the possibilities to add supplementary studies, (see below) and we’ll just add these informations as a DatasetReference object (see the Graph tab).

If the URI however corresponds to a handle in the model data explorer, we can also directly transfer this into a relation between datasets (see Graph tab) and suggest that Dataset A is supplement to Dataset B (see Datasets).

Graph

$digraph model_graph { // Dotfile by Django-Extensions graph_models // Created: 2023-02-14 14:49 // Cli Options: --output /home/docs/checkouts/readthedocs.org/user_builds/mde-prototype/checkouts/latest/source/tmp_graph.dot templateapp fontname = "Roboto" fontsize = 8 splines = true rankdir = "TB" node [ fontname = "Roboto" fontsize = 8 shape = "plaintext" ] edge [ fontname = "Roboto" fontsize = 8 ] // Labels templateapp_models_Dataset [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> Dataset </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> title </TD><TD ALIGN="LEFT"> CharField </TD></TR> </TABLE> >] templateapp_models_DatasetReference [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> DatasetReference </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> dataset </TD><TD ALIGN="LEFT"> ForeignKey (id) </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> description </TD><TD ALIGN="LEFT"> CharField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> uri </TD><TD ALIGN="LEFT"> URLField </TD></TR> </TABLE> >] templateapp_models_DatasetRelation [label=< <TABLE BGCOLOR="white" BORDER="1" CELLBORDER="0" CELLSPACING="0"> <TR><TD COLSPAN="2" CELLPADDING="5" ALIGN="CENTER" BGCOLOR="#1b563f"> DatasetRelation </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> id </TD><TD ALIGN="LEFT"> BigAutoField </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> left </TD><TD ALIGN="LEFT"> ForeignKey (id) </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> right </TD><TD ALIGN="LEFT"> ForeignKey (id) </TD></TR> <TR><TD ALIGN="LEFT" BORDER="0"> relation_type </TD><TD ALIGN="LEFT"> CharField </TD></TR> </TABLE> >] // Relations templateapp_models_DatasetReference -> templateapp_models_Dataset [label=" dataset (datasetreference)"] [arrowhead=none, arrowtail=dot, dir=both]; templateapp_models_DatasetRelation -> templateapp_models_Dataset [label=" left (left_relation)"] [arrowhead=none, arrowtail=dot, dir=both]; templateapp_models_DatasetRelation -> templateapp_models_Dataset [label=" right (right_relation)"] [arrowhead=none, arrowtail=dot, dir=both]; }$

Django source

class Dataset(models.Model):

    title = models.CharField(max_length=50)

class DatasetReference(models.Model):

    dataset = models.ForeignKey(Dataset, on_delete=models.CASCADE)
    description = models.CharField(max_length=400)
    uri = models.URLField(max_length=300)

class DatasetRelation(models.Model):

    left = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="left_relation")
    right = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="right_relation")
    relation_type = models.CharField(max_length=30)

Interpretation of relations

CF-Conventions

netCDF files can define global references and doi attributes. We will check here for common DOI patterns and use this to extract the uri for the DatasetReference (see the Graph tab above).

INSPIRE

INSPIRE encodes references as MD_AggregateInformation with a specific DS_AssociationTypeCode, namely crossReference. So we will just use the MD_Identifier of these tags. If the MD_Identifier is listed as gmd:code, we will assume it’s a DOI and transform it to the corresponding URL, otherwise we take it as the description of the DatasetReference and try if we find a URL in it.

Metadata of datasets

Required metadata

Optional metadata

Interpretation of standards

CF-Conventions

INSPIRE ISO

Implementation

Attributes 

Title

Abstract

Bounding box

Temporal extent

Authors and Contact Persons 

Projects 

Institutions 

Other relations 

Metadata of datasets

Required metadata

Optional metadata

Interpretation of standards

CF-Conventions

INSPIRE ISO

Implementation

Attributes

Title

Abstract

Bounding box

Temporal extent

Authors and Contact Persons

Projects

Institutions

Other relations

Attributes 

Authors and Contact Persons 

Projects 

Institutions 

Other relations 