In tandem - a smart approach to Data Stewardship

How can data stewardship tasks be successfully shared? Dr. Sabina Keller and Dr. Lukas Hörtnagl (Grassland Sciences research group at D-USYS) provide some insights.

Dr. Lukas Hörtnagl and Dr. Sabina Keller

Data stewardship in a research group does not have to be a “one-person show”. What synergies can be used when working in tandem to overcome challenges and make a research group's work more efficient? The ETH Library shows how this can be achieved by interviewing Dr. Sabina Keller (SK) and Dr. Lukas Hörtnagl (LH).

Do you work at ETH Zurich and are you involved in research data management, data stewardship or open research data?  

Network with other experts, share experiences and solutions, and learn from best practices in the Data Stewardship Network. Tap into the community's knowledge and actively shape data stewardship at ETH - get in touch with us at .

Sabina and Lukas, you currently work as a data archive manager and a data scientist. What are your tasks?

SK: As a data archive manager, I introduce all new group members – from bachelor students to postdocs – to our data management standards. This involves topics such as the external page FAIR principles, our variable naming convention, metadata documentation, our data policy, etc. At the beginning of their work, new group members sign a “Research Data Access and Use Agreement”. When a project is completed, I meet with the data author to discuss the preparation of the data for long-term archiving. After the data has been entered, I check the data sets of the project for completeness, adjust them if necessary with regard to formats and encoding, and transfer them with the associated metadata and the corresponding usage licence to the long-term data archive of the ETH Library. Since I sometimes want to transfer very large amounts of data from a project to the ETH Data Archive, we use a special “batch processing ingest” in collaboration with the ETH Library. The data is later linked to the Research Collection and can be found and - if open access - downloaded there.

LH: As a data scientist, I cover a wide range of tasks in our team, including both technical data processing and scientific analysis. My main focus is on calculating the gas exchange between the biosphere and the atmosphere. This includes carefully checking the quality of data and correcting and completing it to create seamless long-term data sets. We make these data sets, which at some of our measuring stations in the Swiss FluxNet already cover several decades, openly available via platforms such as the ETH Research Collection and external page FLUXNET.

To monitor the data from our measuring stations more efficiently, I implemented a database (external page InfluxDB) a few years ago, into which new measurements are fed daily. I also support students and postdocs in processing and interpreting their ecosystem data to help them successfully implement their research projects.

Dr Sabina Keller has been working in various roles as a teaching and research assistant in the group since 2004: as a data archive manager in the training of group members and the archiving of research data, as a lecturer, and with outreach projects in science and research communication and education.

Dr Lukas Hörtnagl has been working in the group since 2014, initially as a postdoc and later as a data scientist. Lukas i.a. deals with the exchange of gases such as carbon dioxide and methane between the biosphere and the atmosphere, and coordinates the upload of data to international databases to make current research data openly accessible.

What challenges do you face in your data management tasks?

SK: A lack of documentation during data collection can mean that important metadata for later use by other researchers is missing. For example, if geographical localisation (GIS) is missing, it will not be possible to conduct follow-up surveys at exactly the same locations in the future when collecting samples in the field. 
Often, you only realise where there is a need for action or clarification when problems arise. Recently, we discussed useful unique sample IDs for data collected by hand.

LH: One challenge is defining and documenting transparent and comprehensible processing criteria, which is often complex due to the diversity of our measurement data. In data processing, we therefore make a fundamental distinction between raw data and processed data. Raw data is the original, unaltered data collected directly from measuring instruments in the field. It forms the basis for all further processing steps. Processed data is created by processing the raw data. For example, erroneous measurements can be removed based on defined criteria, or data can be transformed. To increase transparency, we have been documenting data processing in external page Jupyter Notebooks for several years. These notebooks contain both the program code used and visual representations of the data processing steps. This makes it possible to quickly and easily check the creation of the data sets at any time.

Another challenge is choosing a suitable data format for data storage. We value easy and direct access to our data sets. Therefore, we store raw data and results in generally readable text files (CSV, packaged as ZIP). For the actual data processing, we also use formats such as Apache Parquet, which enable high processing speeds for large amounts of data.

What opportunities do you see for exchanging ideas across research units in your role as data stewards? What has been your experience so far?

SK: For me, opportunities for exchange have so far arisen as a result of requests: for example, I presented our data management to the members of another professorship in the department and to a visiting professor. Both heard that we have an established data management system. Recently, we also received a request from a professor at the Karlsruhe Institute of Technology, a former doctoral student in the Grassland Sciences Group, who wants to prepare her research group for the topic.

LH: The close integration of the group into international research projects such as external page ICOS has led to numerous collaborations. We generally make our data openly available under a generous Creative Commons licence.

“We generally make our data openly available under a generous Creative Commons licence.”
Lukas Hörtnagl

As a result, we receive numerous requests either for the data themselve or for input to scientific publications. We are also often asked whether additional data are available, which we then provide if possible. It is important to me to support the use of our open data, so I try to respond to incoming requests as promptly as possible.

What potential do you see in implementing data stewardship as a shared task, as you do at Grassland Sciences?

SK: We have different research priorities, and the data management requirements vary accordingly. However, together we cover the different aspects of data management well. On the one hand, there are the SwissFluxNet measuring stations with long-term measurements of greenhouse gas fluxes and time series of various meteorological variables. The measurement series are processed and controlled by different station managers. Lukas implemented all the processes here and, for example, monitors and coordinates compliance with standards and the transfer of data to international databases. This requires both discipline-specific know-how and an affinity for IT, which, fortunately for us, Lukas can provide. On the other hand, there are projects in the field of plant and ecosystem physiology and functional plant diversity, where individual data sets are generated during field campaigns. Archiving these data sets is an administrative matter. That's my job: to transfer them to the ETH Data Archive and to ensure that the documentation (e.g. field books) is complete.

LH: Sabina describes our successful division of labour very aptly. I monitor and supervise the work steps up to the final data sets. Together with our team, I take care of the regular data review of current measurement data. This is where our database, which combines historical and current data and creates data images, comes in handy. I think this is an important distinction with regard to data stewardship: on the one hand, there is current, continuously updated data that we put into a historical context in relation to each other via the database and check before further processing; on the other hand, there is the archived long-term data as an end product. This distinction enables an optimised workflow between me and Sabina, with a clear division of tasks and responsibilities. Both aspects are thus adequately taken into account.

How do you plan to improve data management in the Grassland Sciences group?

SK: As I said, I think we are already in a good position overall - not least because we already live and implement data stewardship as a shared task.

“However, one challenge is certainly to create awareness among young researchers that data management should be considered as early as the planning of an experiment and in all data processing steps [...].”
Sabina Keller

By introducing data management at the very beginning of a research project, we raise awareness among young researchers and also address the responsibility in the further steps of research, e.g. the correct citation of data sources.

LH: We have regular meetings, usually once a month, to exchange or agree on methods (e.g. calculation, processing, naming of variables) and to go through the latest data together. The experience already available within the group is passed on to new group members. This is initially time-consuming, but in the long term it leads to a significant increase in efficiency in our daily work. We also discuss data topics at every group retreat. My permanent position allows me to try to help maintain and develop continuity, consistency and domain-specific knowledge within the group as much as possible. I believe this is extremely valuable and enables us to sustainably anchor data stewardship in our research group.

Data Stewardship at ETH

Data Stewardship is supported at ETH Zurich as part of the swissuniversities external page national ORD strategy and the ETH Domain's ORD programme. The ETH Library is actively involved in these programmes and coordinates activities related to Data Stewardship under Dr. Julian Dederke as the project lead.
Read the earlier interviews on data stewardship models at ETH Zurich and the news on the launch of the Data Stewardship Network.

JavaScript has been disabled in your browser