Metadata of datasets
Related Issues |
Document Status |
---|---|
In Progress |
A central aspect in the model data explorer is the metadata of the datasets. We do not want to mimic a metadata portal such as geonetwork, but nevertheless, we need information to
let the user know what he or she is looking at
link datasets to groups
link datasets to people
Each author and group has a dedicated site where all the related datasets are listed. So we need to find a way to uniquely identify authors and associate the datasets with them.
We will implement a manual way where you can select the authors and edit the metadata through a web interface, but it should be possible to automatically interprete metadata standards.
In the following sections, we will describe what metadata we implement in the, and how.
Note
This document only covers global metadata of a dataset. Variable related metadata (units, standard name, etc.) shall be handled in a different document.
Todo
Document variable related metadata
Required metadata
Datasets in the model data explorer must define the following metadata attributes:
- title
A one-line description of the dataset
Optional metadata
Optional but recommended metadata attributes are:
- contacts
A list of authors that have some role related to the dataset. They participated in the generation, are responsible for providing the data, etc.
- institutions
The institutions that are responsible for the dataset
- projects
related projects that provided funding for the generation of the dataset
- bbox
The bounding box of the geographic region of the data
- abstract
A short description of the dataset
- data_relations
datacite relation types (see Relations between Users, Groups and datasets and https://support.datacite.org/docs/relationtype_for_citation)
- temporal_extent
The temporal window that is covered by the dataset
Todo
document geographic and temporal resolution as well?
Todo
Add descriptive spatial extent (such as global, continental, etc.)
Todo
add creation, publication and revision date
Interpretation of standards
The items mentioned in the previous sections are encoded in the metadata standards that we support, namely the netCDF header and the INSPIRE ISO- Standard. Our aim is to develop readers for each standard that transform the corresponding conventions into the metadata scheme of the model data explorer (see next section, Implementation details).
The exact database structure that allows this interpretation is however part of a different user story, namely #20.
CF-Conventions
For netCDF Headers (and NcML, a special markup language used by THREDDS) we want to develop guidelines based on the Binding Regulations for Storing Data as netCDF Files. For this purpose, we will transform the guidelines into a web-based format and enhance it with templates to make them easier to apply.
The guidelines are based on the CF-Conventions and extend by further attributes that are mainly motivated by the Conversion methodology to INSPIRE developed at the Geomar.
Todo
The UnidataDD2MI.xsl
methodology needs to be elaborated further.
INSPIRE ISO
ISO-conform XML files will be read using the owslib python library. We will orient the format on the UnidataDD2MI.xsl file that has been developed by Franziska Weng (Geomar) and Andrea Pörsch (GFZ) (currently still work in progress).
Implementation
Note
class Author(models.Model):
name = models.CharField(max_length=100)
email = models.EmailField(max_length=100)
class Dataset(models.Model):
title = models.CharField(max_length=50)
abstract = models.CharField(max_length=50, null=True, blank=True)
contacts = models.ManyToManyField(Author)
Metadata items described above are represented in the model data explorer as properties of Django objects that in turn translates into connections and attributes in a relational database. But for this document we will keep it simple and distinguish two metadata types: attributes and relations.
Attributes are simple string properties of a dataset. A title for instance. Relations describe how the dataset is connected to other items in the database. A dataset won’t have an authors string property, for instance, but it will define a connection to author objects, where one author holds a first name and last name attribute (for instance).
An example is shown in the graph on the right, Object attribute vs. object relations.
Attributes
A Dataset defines three simple attributes, title, abstract and bounding box (bbox), see the graph about Attributes of a dataset.
class Dataset(models.Model):
title = models.CharField(max_length=50)
abstract = models.TextField(max_length=10000, null=True, blank=True)
bbox = models.JSONField(null=True, blank=True)
start = models.DateTimeField(null=True, blank=True)
end = models.DateTimeField(null=True, blank=True)
start_s = models.CharField(max_length=50, null=True, blank=True)
end_s = models.CharField(max_length=50, null=True, blank=True)
Title
The title is a short human-readable description as string of the dataset and should describe the purpose of the data in one sentence.
The CF-Conventions define a title
netCDF attribute that will be
used
We are using the <gmd:title>
tag of the CI_Citation
element.
Abstract
The abstract is a longer human-readable description of the dataset that describes the content, purpose and methodology in a bit more details.
We wiill look for global summary or abstract attribute.
We are using the <gmd:abstract>
tag.
Bounding box
The bbox is a JSONField (or optionally we can also make it a georeferenced polygon) that defines the region where this dataset can be applied.
We wiill look for the global geospatial_lon_min, geospatial_lat_min, geospatial_lon_max and geospatial_lat_max attributes, as well as a Bbox attribute.
We are using the EX_GeographicBoundingBox
element in the
<gmd:geographicElement>
tag, namely westBoundLongitude
,
eastBoundLongitude
, southBoundLatitude
and
northBoundLatitude
Temporal extent
The temporal extent is a DatetimeField
that defines the start and end of a
time window. We will expect two
ISO-formatted timestamps here, one for the
start and one for the end of the coverage.
This might not always be possible, as python does not support paleo dates. So
we will also add a attributes start_s
and end_s
that accept plain text
fields.
We wiill look for the global StartTime, StopTime, time_coverage_start and time_coverage_end attributes.
We are using the EX_TemporalExtent
element in the
<gmd:temporalElement>
tag, namely beginPosition
,
endPosition
Projects
Projects are data groups within the Model Data Explorer Framework (see Data Group). As such, a project is also a relation between two objects in the database (see the Graph tab).
This relation can also be equipped with permissions, namely can_edit, can_view and can_list (see Datasets and data groups). These permissions need to be approved by both, the data group (project) owner and the dataset.
A relation can also be made visible or invisible, which will determine whether the group is listed explicitly on the detail page of the dataset or not.
Note
A dataset can also be related to other types of data groups, such as
institutions and this will be using the same methodology as this.
As such, we will distinguish projects from platforms, etc. based on the
DS_InitiativeTypeCode
identifier (see Data Group and
#20).
class DataGroup(models.Model):
name = models.CharField(max_length=100)
class RelationPermission(models.Model):
name = models.CharField(max_length=20)
left_approved = models.BooleanField(default=False)
right_approved = models.BooleanField(default=False)
class DatasetDataGroupRelation(models.Model):
data_group = models.ForeignKey(DataGroup, on_delete=models.CASCADE)
dataset = models.ForeignKey("Dataset", on_delete=models.CASCADE)
permissions = models.ManyToManyField(RelationPermission)
visible = models.BooleanField(default=True)
class Dataset(models.Model):
title = models.CharField(max_length=50)
data_groups = models.ManyToManyField(
DataGroup, through=DatasetDataGroupRelation
)
netCDF files can define a project, program, projects or
project_name attribute. We will then search for matching names in
the data groups that define a DS_InitiativeTypeCode
kind of
project and suggest them to the data submitter. This will also be
documented in the netCDF guidelines (see CF-Conventions).
We will look for MD_AggregateInformation
entries that define
a DS_AssociationTypeCode
of largerWorkCitation
and match
the MD_Identifier
against the available data groups.
Institutions
Todo
Add ROR ID
Institutions are handled the same internally as projects as both are represented as data groups in the model data explorer. Just the interpretation of the metadata standards differ.
netCDF files can define an institution or creator_institution attribute, together with a corresponding institution_references attribute. They will then be matched against available names of institutions in the database to make suggestions to the data submitter.
Institutions will be identified from the organisationName
in a
CI_ResponsibleParty
(see Authors and Contact Persons above).
Other relations
Other relations are references to internal or external resources, such as related studies or datasets. They are commonly described by datacite related identifiers, see https://support.datacite.org/docs/relationtype_for_citation.
However, neither the CF-Conventions nor INSPIRE define such a relation type. But both give the possibilities to add supplementary studies, (see below) and we’ll just add these informations as a DatasetReference object (see the Graph tab).
If the URI however corresponds to a handle in the model data explorer, we can also directly transfer this into a relation between datasets (see Graph tab) and suggest that Dataset A is supplement to Dataset B (see Datasets).
class Dataset(models.Model):
title = models.CharField(max_length=50)
class DatasetReference(models.Model):
dataset = models.ForeignKey(Dataset, on_delete=models.CASCADE)
description = models.CharField(max_length=400)
uri = models.URLField(max_length=300)
class DatasetRelation(models.Model):
left = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="left_relation")
right = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="right_relation")
relation_type = models.CharField(max_length=30)
netCDF files can define global references and doi attributes.
We will check here for common DOI patterns and use this to extract
the uri for the DatasetReference
(see the
Graph tab above).
INSPIRE encodes references as MD_AggregateInformation
with a
specific DS_AssociationTypeCode
, namely crossReference
. So
we will just use the MD_Identifier
of these tags. If the
MD_Identifier
is listed as gmd:code
, we will assume it’s a
DOI and transform it to the corresponding URL, otherwise we take
it as the description of the DatasetReference
and try if we
find a URL in it.