CHESS Course: An Introduction to FAIR Data Management for Geoscientists

Course description:

This course will provide an introduction to the FAIR guiding principles for data management, their specific implementations within geoscience and practical exercises. Practical steps towards Findable, Accessible, Interoperable and Reuseable data are discussed and exercised emphasising the data provider and data consumer perspectives. Practical introductions to the various elements of the FAIR guiding principles are related to concepts of discovery metadata, use metadata, persistent identifiers (e.g. Digital Object Identifiers) and how they help traceability of decisions (e.g. through scientific citation of data), containers for data (e.g. NetCDF), semantics for geoscientific data (glossaries, thesaurus, taxonomies and ontologies) in a interdisciplinary context and related to terminology as a mechanism for scientific collaboration, tools for generating FAIR data (e.g. how to work with Rosetta and other tools for converting data, how to use Python), how to work with FAIR data, how to publish data with the help of data centres, how to publish data with the help of schema.org (focusing on discoverability by Google), national structures that facilitate data sharing (e.g. Norwegian Marine Data Centre, Norwegian Scientific Data Network, Norwegian Infrastructure for Research Data), how these are connected and how to work with Data Management Plans that are/or will be required by funding agencies and resources providers (e.g. UNINETT Sigma2). Practical work will be based on students bringing their own data, evaluation of their FAIRness and how to improve FAIRness for these using Rosetta and Python to create NetCDF according to the Climate and Forecast (CF) Convention with Attribute Convention for Dataset Discovery (ACDD) embedded.

Outcomes:

At the end of the course, students will know the FAIR guiding principles, best practises of FAIR data within geoscience and practical approaches to achieving FAIR data using Rosetta and Python as well as how to work with data management plans for their future career.

Learning modules/structure:

The first day will be a full day (6 hours) of lectures, introducing different concepts, one day will be for self study where students work their own dataset. Lecturers will be available by Zoom (open room outside lecture hours) and a dedicated Slack channel through the full week to support students. A more detailed outline of the lectures will be provided online, students are required to describe and upload the dataset they will work with. At the end of the course (last day), each student presents the status of FAIRness of their data following the exercises undertaken. This session is scheduled for 5 hours (10 minutes presentation by each student and a longer discussion session).

Course home page:

https://chess.w.uib.no/event/introduction-to-fair-data-management-for-geoscientists/?instance_id=112

Lecturer:

Øystein Godøy (o.godoy@met.no, Head of Division for Remote Sensing and Data Management, The Norwegian Meteorological Institute)
Markus Fiebig (mf@nilu.no, Senior Scientist at the Atmosphere and Climate Department, Norwegian Institute for Air Research)
Torill Hamre (Torill.Hamre@nersc.no, Research Leader/Senior Researcher at the Scientific Data Management Division, Nansen Environmental and Remote Sensing Center)
Lara Ferrighi (laraf@met.no, Research Scientist at the Remote Sensing and Data Management Division, The Norwegian Meteorological Institute)

Further contacts: to contact the project about this course, we recommend to use the contact available at https://www.nordatanet.no/en/contact. This will ensure a more prompt answer to your enquiry, compare to writing emails directly to one of the lecturer.

Relevant info: Some of your questions/issues might be already tacked in the FAQ

Syllabus:

Day 1

09:00-09:15 Introduction

Online meeting ethics
Use of break-out rooms
Use of “raise hand” function
Interactive course!

09:15-10:15: Motivation: Why do we need data management? (Øystein)

Why do we need data management?
Data Sharing and Management Snafu in 3 Short Acts
https://www.youtube.com/watch?v=N2zK3sAtr-4
Science life cycle/Data life cycle
How to change data sharing culture.
What are the FAIR data principles?
How do they help with good data management?
External boundary conditions by funding agencies and publishers, scientific data as service.
Data management plan.

MATERIAL: Session-1

10:15-10:30: Break

10:30-11:45: The basics: data and metadata (Lara, Markus)

What are data? What are metadata?
Discovery, site, and use metadata.
What is provenance?
Plan your experiment. Which data and metadata do you need to record?
How to record various types of metadata.
Metadata templates (Arven etter Nansen, EBAS)
Gap handling for metadata (missing elements).

MATERIAL: Session-2, Session-2-B

11:45-12:30: Lunch break

12:30-13:45: Data structure/formatting (Øystein, Markus)

NetCDF/CF grid, trajectory, profile, timeseries
Granularity requirements
Standard names, vocabularies

MATERIAL: Session-3, Session-3-B

14:00-15:30: Summary of the day

Group work. Groups will present a summary of today’s lessons in their own words. One groups per section. (Moderator: Lara)

Day 2

09:00-10:15: Documentation of data (Torill)

Tools for documenting data
- Rosetta (web application), NCO/CDO (command line), Python (netcdf4), R
- More detail on Python
Validation tools for NetCDF-CF.
- What is actually validated?
- NorDataNet validator, PUMA validator
Rosetta in more detail
- Profiles, time series, trajectory
- Template concept, benefits for processing multiple datasets, possibilities for collaboration (e.g. place template files in GitHub)
- Examples of e.g. CTD profile from Seabird sensor

MATERIAL: Session-1

10:15-10:30: Break

10:30-11:45: Workshop: Document your own data (Torill, Øystein, Lara, Markus)

11:45-12:30: Lunch break

12:30-13:45: Publishing your data (Øystein, Markus)

Mandated and long term archives
Data publications
PID (Explicit mention DOI)
Data policies / Licensing
Tracking usage (using DOI)
Repositories:
- NorDataNet (distributed network of data centres)
- NIRD RDA
- GAW repositories
- Repositories for model data
- Figshare

MATERIAL: Session2, Session2-B

13:45-14:00: Break

14:00-15:30:

Group work. Groups will present a summary of today’s discoveries in their own words. (Moderator: Markus)

Day 3

09:00-10:15: How to exploit / process further / consume data (Torill)

Interfaces to data
Examples of benefits when using truly interoperable data.
Interfaces: WMS, OGC API, OpenAPI, OPeNDAP, RESTful (Restful in general)
Integration in tools e.g.:
- Python
- R
- Jupyter

MATERIAL: Session-1

10:15-10:30: Break

10:30-11:45: Workshop: Analysing data (Moderator: Torill, Øystein)

11:45-12:30: Lunch break

12:30-13:45: Data sharing ethics & culture, and how NorDataNet services help. (Øystein)

Data sharing ethics, certainly before publishing
Data Life Cycle and its relation to the scientific workflow, revisited from a scientists point of view
Data sharing in a cultural perspective and relations to the scientific workflow
NorDataNet service overview

MATERIAL: Session-2

13:45-14:00: Break

14:00-15:30: Student summary of the course, what has been useful (and not). (Moderator: Lara)

Group work; Groups will present a summary of today’s discoveries in their own words.