Metadaten — An Introduction
Metadata contains additional descriptive information about other datasets. Metadata is nothing new, digital images and music, for example, already contain additional metadata for many years. In the case of media files, metadata is containing information on the author or artist, year of publication, genre and in the case of music if it belongs to an album or collection. Using those additional bits of information, data can be put in context, connections between individual items can be established and files can be organized and searched. Similar motives drive the implementation of metadata for open data. Publishing datasets as machine-readable open data on public data portals is only a first step, only if this data is complemented with standardised and detailed metadata it becomes a cornerstone to an transparent and sustainable digital infrastructure. In the following paragraphs the new German open data standard DCAT-AP.de will be discussed and explained using applied examples to demonstrate the need and demand for good metadata.
DCAT is short for Data Catalogue Vocabulary and AP stands for Application Profile. DCAT-AP is a European standard for describing datasets. DCAT-AP.de is the German adoption of the standard (specifically the vocabulary). So even if you are using DCAT-AP or another national adoption, the following article might still be a useful introduction to the topic. DCAT-AP.de can be broken down into to components: 1) The structure of a metadata file, which parameters to include in a metadata description, for example, the URL to the dataset or the license. 2) A controlled vocabulary which should be used to describe metadata, for example, license type or spatial granularity (national or city data). When using the standard the committee suggests a minimum version of attributes to include (download examples). This light version can be extended depending on data type and available information. The metadata can be provided in the machine-readable formats XML, RDF and JSON. Definitions of each attribute and its vocabulary can be found on www.DCAT-AP.de.
DCAT-AP Minimum Version
The metadata descriptions according to DCAT-AP are split up into multiple main components. In the minimum variant the two elements are dcat:Catalog and dcat:Dataset. dcat:Catalog is describing the catalogue the dataset belongs to, for example, the open data portal Berlin or in the example above Hamburg. All attributes within in this element (besides dcat:dataset) specifically describe the overall catalogue or portal not the individual item. dcat:Dataset is in contrast describing the specific dataset in question. In addition there are multiple other main components, for example, dcat:Distribution.
While the minimum variant only contains essential attributes like title, description, license, URLs and information on the contributor of the data. The full strength of the metadata standard unfolds when additional attributes are added. In the following a few examples for such additional attributes are being explored.
dct:issued and dct:modified describe when a catalogue or dataset was created and last modified. Using this information users can determine if their data is still up to date or if it needs updating. dct:accrualPeriodicity lets the users know in which intervals the data is being updated. For a sustainable and long lasting relationship between data providers and consumers a transparent and reliable documentation of data provision processes is important. By describing, for example, update intervals, users can streamline their processes accordingly.
dct:temporal on the other hand is not describing the data file, but instead, the data contained in the file. Using schema:startDate and schema:endDate the dct:temporal attribute defines the timespan described by the dataset. Making use of this information users can find datasets for the same temporal period without analysing the overall dataset itself. Search engines for open data can use this data to optimize the user experience of finding the optimal data sources.
To underline the importance of such temporal attributes for describing the contained data, the following visualisation shows a sample of datasets from Berlin's and Brandenburg's office for statistics and the temporal periods they describe. In some cases, for example, election and census the data is only available for the years in which the events happened. In other cases like the BQFG (determination of job qualification law) or the trainee program (Auszubildende) the statistical methods have changed over the years and, therefore, two time series datasets are available. And sometimes the digitization of data does not go back long enough or sometimes the time between observation time and data publishing is longer.
Similar to the temporal description, the spatial attributes are also organised in two layers. The catalogue itself is georeferenced through a Geoname ID, describing the catchment area of the catalogue using the attribute "dct:spatial". Geonames is an open database for spatial descriptors. On the second level the actual data itself is described. The granularity of the data is defined through dcatde:politicalGeocodingLevelURI. At the moment, possible values are EU, national, federal state, districts and cities. Through dcatde:politicalGeocodingURI a predefined ID can be assigned (Berlin, for example, would be 11). An additional textual description of the area can be provided through the use of dcatde:geocodingText. More detailed and more interesting for computational processing is the dct:spatial attribute. The latter can hold spatial information, for example, a polygon. The simplest example would be a bounding box (see graphic below). Making use of the spatial information users can search for datasets of similar spatial extent in order to combine them.
In order to allow users to get in touch with the responsible person, for example, in regards of license questions, the standard offers the possibility to include contact information.
A standard which is applied by various data providers with heterogeneous backgrounds requires a controlled vocabulary in order to ensure a sustainable and long lasting application of the standard. Controlled vocabularies are available for many of the DCAT-AP attributes, for example, a list of available licenses is provided. An item from the vocabulary is an URL, which points to the description of the term or the list of available terms on the DCAT-AP definition site. It is highly recommended to make use of the existing vocabulary. If a term is not yet included, one should try to get in touch with the committee and add the term to the next revision of the standard.
In this introduction of the new metadata standard DCAT-AP.de we tried to shed some light on the advantages in regards to sustainable data infrastructures, usability and machine-readability. Besides the attributes highlighted in this article there are many more attributes from pointing out relationships to others versions of a file using (dct:hasVersion or dct:isVersionOf) to describing the language of the available information (dct:language). In some cases there will be redundancy between attributes, which is a feature not a bug. In order to incorporate as many use cases as possible such redundancies are a by-product. In everyday life, such metadata files will often be written semi-automatically, for example, automatically filling in data about the data-provider.
More information and a complete documentation of all attributes available in the standard and additional documents are available here:
The Technologiestiftung Berlin (Technology Foundation Berlin) is not a data provider in the metropolitan area of Berlin. But we regularly work with open data. When ever we refine and combine data, we would like the community to benefit from our work, therefore, we publish the refined data on our open data site. Every dataset on our site has an DCAT-AP.de metadata file. If you are interested in some applied examples check the files on our site.