CHESS Course: An Introduction to FAIR Data Management for Geoscientists

Course description:

This course will provide an introduction to the FAIR guiding principles for data management, their specific implementations within geoscience and practical exercises. Practical steps towards Findable, Accessible, Interoperable and Reusable data are discussed and exercised emphasising the data provider and data consumer perspectives. Practical introductions to the various elements of the FAIR guiding principles are related to concepts of discovery metadata, use metadata, persistent identifiers (e.g. Digital Object Identifiers) and how they help traceability of decisions (e.g. through scientific citation of data), containers for data (e.g. NetCDF), semantics for geoscientific data (glossaries, thesaurus, taxonomies and ontologies) in a interdisciplinary context and related to terminology as a mechanism for scientific collaboration, tools for generating FAIR data (e.g. how to work with Rosetta and other tools for converting data, how to use Python), how to work with FAIR data, how to publish data with the help of data centres, how to publish data with the help of schema.org (focusing on discoverability by Google), national structures that facilitate data sharing (e.g. Norwegian Marine Data Centre, Norwegian Scientific Data Network, Norwegian Infrastructure for Research Data), how these are connected and how to work with Data Management Plans that are/or will be required by funding agencies and resources providers (e.g. UNINETT Sigma2).

Practical work will be based on students bringing their own data, evaluation of their FAIRness and how to improve FAIRness for these using Rosetta and Python to create NetCDF according to the Climate and Forecast (CF) Convention with Attribute Convention for Dataset Discovery (ACDD) embedded.

Outcomes:

At the end of the course, students will know the FAIR guiding principles, best practises of FAIR data within geoscience and practical approaches to achieving FAIR data using Rosetta and Python as well as how to work with data management plans for their future career.

Learning modules/structure:

The course will be held online (Zoom) with the link provided to participants in time before the course. The first day will be a full day (6 hours) of lectures, introducing different concepts, as well as introducing the first assignment, on data curation. The 2nd day will be dedicated to self study where students work on curating their own dataset. Lecturers will be available by Zoom (open room outside lecture hours) and a dedicated Slack channel through the full week to support students. On the 3rd day, students will present results of the 1st assignment, followed by lectures and an introduction to the 2nd assignment (on data exploitation). After another day of self-study on the assignment (day 4), day 5 concludes with student presentations of the 2nd assignment, further lectures, and a course wrap-up.
A more detailed outline of the lectures is provided below. Students are required to describe and upload the dataset they will work with one week in advance of the course start.

Course home page:

https://chess.w.uib.no/activities/upcoming-activities/an-introduction-to-fair-data-management-for-geoscientists/

Lecturer:

  • Øystein Godøy (o.godoy@met.no, Head of Division for Remote Sensing and Data Management, The Norwegian Meteorological Institute)
  • Markus Fiebig (mf@nilu.no, Senior Scientist at the Atmosphere and Climate Department, Norwegian Institute for Air Research)
  • Torill Hamre (Torill.Hamre@nersc.no, Research Leader/Senior Researcher at the Scientific Data Management Division, Nansen Environmental and Remote Sensing Center)
  • Lara Ferrighi (laraf@met.no, Research Scientist at the Remote Sensing and Data Management Division, The Norwegian Meteorological Institute)

 

Lecture program:

Lecture Day 1, Monday

09:00-09:15 Introduction (Øystein)

09:15-10:15: Motivation: Why do we need data management? (Øystein)

  • Why do we need data management?
  • Data Sharing and Management Snafu in 3 Short Acts
    https://www.youtube.com/watch?v=N2zK3sAtr-4
  • Science life cycle/Data life cycle
  • How to change data sharing culture.
  • What are the FAIR data principles?
  • How do they help with good data management?
  • External boundary conditions by funding agencies and publishers, scientific data as service.
  • Data management plan.

Material: Day1-Session1

10:15-10:30: Break

10:30-11:45: The basics: data and metadata (Lara, Markus)

  • What are data? What are metadata?
  • Discovery, site, and use metadata.
  • What is provenance?
  • Plan your experiment. Which data and metadata do you need to record?
  • How to record various types of metadata.
  • Metadata templates (Arven etter Nansen, EBAS)
  • Gap handling for metadata (missing elements).

Material: Day1-Session2, Day1-Session2-B

11:45-12:30: Lunch break

12:30-13:45: Data structure/formatting (Øystein, Markus)

  • NetCDF/CF grid, trajectory, profile, timeseries
  • Standard names, vocabularies
  • Granularity requirements

Material: Day1-Session3, Day1-Session3-B

13:45-14:00: Break

14:00-15:30: Documentation of data (Torill)

  • Tools for documenting data
  • Rosetta (web application), NCO/CDO (command line), Python (netcdf4/xarray), R
  • More detail on Python
  • Validation tools for NetCDF-CF.
  • What is actually validated?
  • NorDataNet validator, PUMA validator
  • Rosetta in more detail
  • Profiles, time series, trajectory
  • Template concept, benefits for processing multiple datasets, possibilities for collaboration (e.g. place template files in GitHub)
  • Examples of e.g. CTD profile from Seabird sensor
  • Introduction of assignment

Material: Day1-Session4

Git repository with training data: https://github.com/NorDataNet/TrainingMaterial

Further instructions on the assignment and its delivery are given in the google shared folder.

Lecture Day 2, Wednesday

09:00-10:15: Presentation of assignment results and feedback

10:15-10:30: Break

10:30-11:45: Publishing your data ((Øystein, Markus)

  • Mandated and long term archives
  • Data publications
  • PID (Explicit mention DOI)
  • Data policies / Licensing
  • Tracking usage (using DOI)
  • Repositories:
    • NorDataNet (distributed network of data centres)
    • NIRD RDA
    • GAW repositories
    • Repositories for model data
    • Figshare

Material: Day2-Session2, Day2-Session2-B

11:45-12:30: Lunch break

12:30-13:45:How to exploit / process further / consume data (Torill)

  • Interfaces to data
  • Examples of benefits when using truly interoperable data.
  • Interfaces: WMS, OGC API, OpenAPI, OPeNDAP, RESTful (Restful in general)
  • Integration in tools e.g.:
    • Python
    • R
    • Jupyter

Material: Day2-Session3

13:45-14:00: Break

14:00-15:30:Intro to Workshop: Analysing data (Torill, Øystein)

Git repository with training data: https://github.com/NorDataNet/TrainingMaterial

Further instructions on the assignment and its delivery are given in the google shared folder.

Lecture Day 3, Friday

09:00-10:15: Presentation of assignment results and feedback, part1

10:15-10:30: Break

10:30-11:45: Presentation of assignment results and feedback, part2

11:45-12:30: Lunch break

12:30-13:45: Data sharing ethics & culture, and how NorDataNet services help. (Øystein)

  • Data sharing ethics, certainly before publishing
  • Data Life Cycle and its relation to the scientific workflow, revisited from a scientists point of view
  • Data sharing in a cultural perspective and relations to the scientific workflow
  • NorDataNet service overview

Material: Day3-Session3

13:45-14:00: Break

14:00-15:30: Student summary of the course, what has been useful (and not?).