A-Z Glossary Research Data
A
Analog research materials include photographs, handwritten notes, books, audio cassettes, paintings or 3D objects, such as fossils or architectural models. In order to make them usable in a repository, the materials must first be digitized or at least the associated metadata must be provided in a digital form. Analog materials differ from “born digital” data, which exist in digital form from the beginning, for example digital photos, CAD drawings, measurement data or blogs.
An archive is generally understood to be a collection of documents with the intention of preserving documents indefinitely. In the research data management context, an archive is a collection of data.
B
Say no to data loss and back up your data
C
Cold data are data records that have been finalized and will not be modified. These are usually data that are stored in repositories together with the descriptive metadata (e.g. for publication or archiving). Only cold research data can get a DOI.
Literary, artistic and scientific works are protected by German copyright law.
In specific terms, this means that without a corresponding license, reuse is only possible in a restrictive manner.
We recommend licensing that is as open as possible, because this increases data reusability and boosts the reputation of researchers. If you have any questions, the CDI will be happy to advise you.
Read this article for further details.
CRIS stands for Current Research Information System, the research information system at FAU. It stores information about research achievements, for example information about a publication, a research project, research data or inventions. Only metadata is stored, for example, the full text of a publication is not stored directly in CRIS, but it does indicate where this full text can be found (for example, by specifying the DOI).
In research information systems, the various data areas are linked: publications are not only assigned to persons, but also projects, projects in turn are assigned to specific research areas. For this purpose, internal data sources of the university are also used, which offers added value compared to classic list formats and data providers such as Scopus or Web of Science.
D
Data literacy is a key skill in the 21st century and, in short, describes the ability of an individual to handle data. What knowledge, skills and attitudes are needed in society, the world of work and science today? Aspects of data literacy are described in detail in this document. Data literacy is fundamental to the entire research process in collecting, organizing, using, publishing and re-using data.
The following video illustrates the importance of data literacy as a key skill.
According to forschungsdaten.info, a data management plan (DMP) structures the handling of research data, or its “collection, saving, documentation, maintenance, processing, transfer, publication and storage, as well as the necessary resources, legal framework and responsible persons.” A data management plan (DMP) documents the entire data lifecycle.
Many third-party funding organizations (DFG, FWF, SNSF, Horizon Europe, Volkswagen Foundation) expect information on the handling of research data as part of a funding application for the allocation of funds from certain funding lines.
The DMP describes how to handle research data from the planning stage, to collection, to long-term archiving or, if applicable, planned deletion. At the very least, the data management plan answers the following questions:
- What is collected?
- Which bodies must be consulted before collecting data?
- In what form and where will research data be stored in the various project phases?
- Who can access the data and when will it be available?
- Who is responsible for the individual steps?
- Which legal requirements must be observed? The DMP is a useful and necessary part of the project application.
- What exactly does this mean for research?
Why this approach is meaningful and sustainable is explained in this video.
For many scientists, dealing with research data is the basis of their daily work. It therefore saves time and effort if this data is efficiently structured, documented and backed up from the outset.
Most of the data is initially stored in files. Files have different types or file formats that are sometimes identified by the file name extension, for example in the Windows operating system. Furthermore, files are stored in directories (folders). Naming files and directories systematically is very important. For example, the Stanford File Naming Handout.
Alternatively, data can also be stored in databases. Here, the effort is greater, because a database management system such as MySQL must first be set up. A database schema needs to be defined which provides a structure for storing data. Here, too, naming is of great importance. Databases support managed shared access to data much better than data stored in files. There are different types of databases: relational, hierarchical, graph-based, RDF triple stores and a few more.
The following animated video clearly summarizes the topic of data organization.
In order to ensure transparent research and traceability of results, research data should be published wherever possible. In order to comply with FAIR principles, the corresponding metadata must be recorded. A repository is required for the publication of the research data.
E
eLabFTW is open source software used as an electronic lab notebook (ELN), data management platform and laboratory inventory management system.
The data can be exported to various formats that make it easier to import to another system with JSON and CSV files.
Electronic Lab Notebooks (ELNs) are software applications that are used to document research data and replace paper laboratory books.
ELNs offer several advantages to the paper equivalent:
- Data findability through search and filter functions
- Accessibility: Network access regardless of location and time
- Backup copies of lab notebooks (previous versions can be restored)
- Reusability of templates, protocols and processes
- Time savings through templates, standardization and existing digital data
- Automatic recording of measurement results
The decision to use ELN software should be made carefully, as it is generally used over a long period of time. Important points that should be taken into account in the decision can be found here.
The European Open Science Cloud (EOSC) is a multi-disciplinary information service where you can publish, search for data and find tools and services. This service is one of the flagship projects of the EU Framework Programme for Research and Innovation. The portal can be reached here.
F
FAIR principles are requirements that ensure sustainable and re-usable research data. The acronym FAIR stands for Findable, Accessible, Interoperable and Re-Usable. A number of research funding providers (including the EU, the DFG and the SNF) believe the FAIR principles are an important requirement for sustainable research and therefore expect them to be complied with. Using persistent identifiers and detailed metadata is considered particularly important for the findability of data. Using standards for interfaces, metadata and data supports accessibility and interoperability of data. Extensive content-related metadata and documentation and clear rights of reuse make it easier to reuse data. Data do not have to be classed as open in order to meet FAIR principles, but their metadata ought to be freely accessible. By complying with these guidelines, “machine-actionability” is to be ensured. This means that a computer-aided system can find, access and reuse the digital objects with minimal human input.
FAUWissKICloud
The purpose of FAUWissKICloud is to host and maintain the FAU WissKI instances. The CDI manages the software and RRZE is responsible for maintaining the hardware. WissKIs are maintained and updated using the WissKI Distillery.
The file format defines the structure of the data contained in the file. This allows applications to interpret the contents of a file.
Many filenames contain an extension separated from the filename by a period. This declares the file format.
Under the following link you will find details and a list of which file formats are suitable for long-term storage.
H
Hot data are accessed frequently and data must be available almost immediately for processing.
In specific terms, this means that the data is processed frequently and changes occur as a result. Hot data are ideally located close to the machine that processes them, so that delays do not occur over a network, for example.
Hot research data are not published and are rarely shared with other people. If the data are not easy to recover, the backup strategy must also include hot data.
L
LabFolder is an electronic lab notebook, inventory management tool and the name of the company that offers the software which was founded in 2013.
As proprietary software, usage incurs monthly costs per user. A free version is also available, but with limited features and limited number of users per group.
Data can only be exported as XHTML and PDF. Labfolder licences can be purchased via the University’s clinic, for further details please visit the Labfolder website at the Medical Faculty.
Linked Open Data (LOD) is an approach to representing and publishing research data. It consists of two aspects:
- “Linked”: related, machine-readable data on the Internet
- “Open”: the data is freely accessible and distributable
This results in a network of data in which individual elements refer to others. Individual data can be retrieved via a URI.
A visual representation can be found here.
The standard retention period for research data is at least ten years [1]. This poses both organizational and technical challenges.
From an organizational point of view, it must be regulated who has responsibility and control over data when the original owner leaves FAU.
From a technical point of view, there is a need for specialized archiving systems and plans that prevent data loss. In addition, the file format is relevant, as it may no longer be supported at a later time.
[1]: https://forschungsdaten.info/praxis-kompakt/glossar/#c269839
M
Metadata describe other data using information that is useful for interpreting and (automatically) processing the actual data, for example digital research data; they represent “data about data”.
Metadata can be elementary descriptions such as length, coding and type (number, string, date and time, currency amount, etc.). Much more important are metadata which help to categorize and characterize the properties of digital objects and provide further information that says something about their meaning. For measured values in research data, these are, for example: measuring device used or sensor used, accuracy or location of the measurement. Even the name of a data object says something about its meaning, but usually this is not enough. Often these terms are too short and too general (such as “measurement”). It’s only usually clear what this means in the context the data are used. In this way, research projects develop terms that can be misunderstood outside the project.
Ontologies are intended to relate such specific terms to a general system of terms.
There are different categories of metadata:
- Technical metadata give information about the data volume and data format and are essential for saving data in the long term.
- Descriptive metadata (also known as content metadata) give information about the information contained in the digital objects and therefore are decisive for the findability, referencing and reusability of data. Descriptions of the measuring method used, an abstract or keywords all fall into this category. Structural metadata describe relationships between individual elements of a data set or the internal structure of the data themselves.
- Administrative metadata include information required for assuring the quality of data (for example checksum), and information on access rights and licenses or the provenance of the data.
O
An ontology is a system of terms that attempts to relate and thus define all concepts of a subject area as far as possible. Relationships include: “superordinate term – subordinate term”, “whole – component” or also “means the same as” (synonym). The terms are not simply named by words, but more precisely and unambiguously by URIs.
“Open Access” means that digital scientific content is available free of charge and in an accessible form. The copyright remains in place.
Further information is available at forschungsdaten.info
“Open source” means that the source code is publicly available and the license permits modification and reproduction.
In addition, there are no restrictions on using the source code in other products or services.
Further details are available from the Open Source Initiative.
openBis is an open source combination of electronic lab notebook (ELN), data management platform and laboratory inventory management system that has been actively developed since 2007. Depending on requirements, all or only selected features can be used. The modular design of openBIS enables flexible adaptation to the requirements of a wide variety of working groups. In addition, it has an interface that allows the use of Jupyter notebooks for data analysis in openBIS. There is also an interface for exporting data to the Zenodo repository.
P
Persistent identifiers (PI) are long-lasting references to digital resources. A PI is a unique name for digital objects of any kind (essays, data, software, etc., especially data records in research data management). This name, usually a longer sequence of digits and / or alphanumeric characters, is linked to the web URL of the digital resource. If the URL for the resource changes, only the address to which the PI refers has to be changed, whilst the PI itself can stay the same. This guarantees, for example, that a resource cited using the PI can still be found even if its physical storage place has changed. Examples of persistent identifiers are digital object identifiers (DOI), uniform resource names (URN) and handles.
Using a specific example, this video clearly explains what persistent identifiers are.
Personal data is any information relating to an identified or identifiable natural person. It should be noted that this also applies if a person can be identified indirectly. Since May 25, 2018, the Federal Data Protection Act (Germany) and the GDPR (EU) apply. Both laws deal with data protection and privacy.
Further information on this topic can be found on the FAU Data Protection page and here.
R
A repository is a managed location for storing digital objects. The visibility of the digital objects can be restricted.
For example:
- the institutional repository of University Library, which enables FAU researchers to publish their dissertations and research papers free of charge.
- The version management system GitLab, which is provided by the RRZE.
- CERN offers a globally visible repository in Zenodo, for data sets < 50GB.
Details at forschungsdaten.info
According to the definition on the website Forschungsdaten.info, research data are data generated during scientific activity (e.g. through measurements, surveys, source work).
This definition applies to both newly generated and processed data, regardless of whether these data are incorporated into a publication and whether they are available in analog or digital form. Research data form the basis of scientific work.
This results in the need for a discipline and project-specific understanding of research data with different requirements for the preparation, processing and management of the data: this is known as research data management.
Research data and their metadata are explained simply in this video using the example of squirrel research.
FAU believes that storing and managing research data is crucial for successful, sustainable research and scientific integrity. It is essential that research data are handled responsibly and methodically. If the University, its members and the general public are to benefit, this must not only be encouraged but also made a requirement, and it is important to raise awareness of research data and FAIR principles in the long term. See FAU Research Data Policy.
According to the definition on the website Forschungsdaten.info, research data management is the process of transforming, selecting and storing research data with the aim of keeping them accessible, usable and verifiable in the long term and independent of the data generator. To this end, structured measures can be taken at all points of the data life cycle to maintain the scientific validity of research data, to preserve accessibility by third parties for evaluation and analysis and to secure the chain of custody.
The practical application and benefits for the researchers are illustrated in the following video
“Forschungsdaten leben länger” (German)
FAU considers safeguarding and managing research data (RD) as essential for successful and sustainable research and scientific integrity. It is essential that research data are handled responsibly and methodically. If the University, its members and the general public are to benefit, this must not only be encouraged but also made a requirement, and it is important to raise awareness of research data and FAIR principles in the long term. See FAU Research Data Policy.
A research data policy contains basic guidelines for handling data in a larger organization, e.g. a university. In addition to general recommendations for action, a research data policy usually regulates the responsibilities and support structures on site. In some cases, the guidelines also include details on licensing and repositories for research data.
The current version of the FAU research data policy can be found at https://www.fau.info/fdm-policy.
U
A right of use describes the manner in which an object may be used. Examples of “uses” of digital objects are copying, saving or publishing. The use can be subject to conditions, such as a monetary fee.
A right of use between the rights holder and the contractual partner can be established through a license.
W
Warm data are rarely changed. It is also acceptable if access to warm data takes longer, for instance when copying files (“copy’n’tea”). Warm data are usually already suitable for sharing in the research group or with external researchers.
WissKI
WissKI is a virtual research environment that enables scientific, location-independent and collaborative work with linked data. On the basis of an ontology, the research data are semantically enriched and stored in the form of triples in a coherent data network (graph database).