Digital Services » Library & Learning Resources (LLR) » Open Access and Research » Research Data Management » Research Data Management Glossary

Research Data Management Glossary

Find out the meaning of specific terms related to data management.

Backup

An additional copy of digital data which is stored for use as a replacement in case the main copy is either deleted or corrupted. A backup service does not provide the same service as an archive: it is not generally intended for long-term preservation after the end of a project, and it does not provide access for data consumers or individuals other than the data owner and possibly IT support.

Data Archive

A service designed to preserve, organise, and catalogue (digital) items in optimal conditions, with standardised labelling to ensure their longevity and continued access. The term is frequently used interchangeably with repository.

Data Dictionary

Documentation describing the contents, format, and structure of a dataset and the relationship between its elements. A data dictionary provides metadata about data elements: for example, it might include a table listing data attributes, with columns giving the attribute name, whether it is optional or required, the attribute type or format, and so on. An additional column for explanatory notes about each attribute is also helpful, especially if the data is to be shared with others. This could include a brief explanation of how the attribute is obtained or calculated.

Data management Plan (DMP)

A data management plan, or DMP, is a formal document which outlines how a project will manage its research data throughout the whole project life cycle. This covers details of the type of data involved, how data will be gathered, stored, and backed up, how it will be accessed by collaborators in a secure way, and how any legal or ethical requirements will be met. It should also address the longer term questions of preservation and sharing.

Database

A database is a structured set of data, accessed via a database management system (DBMS). The goal is to make it possible for the data to be easily queried, allowing users to locate a particular piece of information, or to answer more general questions.

There are various types of database, providing different ways of structuring information. The most common type is a relational database, which consist of a set of connected tables which record the properties of, and the relationships between, entities of various types. Microsoft Access and MySQL are examples of relational database management systems. XML databases, such as eXist, are designed to work with information that has been tagged with XML. Document-orientated (or noSQL) databases are flexible systems which do not require the attributes of objects to have a consistent structure. MongoDB is an example of a document-orientated database.

Dataset

A general term often used to describe a collection of research data. A digital dataset might comprise a single item such as a spreadsheet of numerical data, or it might be much larger, comprising a collection of related items such as spreadsheets, images, the readings on a particular day from a scientific instrument, or a mixture of these and many other types of data.

Digital Object

A digital object is a specific digital ‘thing’. It may comprise a single file, such as a research publication, with its associated metadata, or it may be a package containing multiple files and metadata.

Digital objects are frequently assigned identifiers, which distinguish them from other similar objects, and can be used for citation purposes. Digital Object Identifiers, or DOIs, are assigned to many research data items.

Digital Object Identifier (DOI)

A DOI is a particular type of persistent identifier assigned to digital items such as an article or a dataset, to enable them to be located and cited. It is standardised by the International Organization for Standardization (ISO).

DOIs can be incorporated into URLs so that users can always access the digital content, even if it has moved online location. If the content is unavailable, the DOI should still resolve to a record for the item. Publishers use DOIs to identify articles: for example, the DOI 10.1103/PhysRevLett.107.133902 is incorporated into the publisher’s URL: http://link.aps.org/doi/10.1103/PhysRevLett.107.133902. An item can always be traced using the DOI by using it with the prefix http://dx.doi.org/ (in this case, giving the URL http://dx.doi.org/10.1103/PhysRevLett.107.133902).

Digital Preservation

The process of storing the bits and bytes that comprise digital objects. Preservation does not necessarily imply continued access.

Preservation is an important part of prolonging the life of research data, but is not sufficient by itself. Simply storing data files without actively managing them can result in data which still exists, but which is unusable. For example data integrity checks may not have been carried out, bit-rot (data decay) may have made the data unusable, or the software needed to open the files may no longer be available.

Documentation

Contextual information provided with data to enable users to make sense of it and to interpret it properly. Documentation may relate to a whole dataset (e.g. a README file that accompanies the data files, or a detailed description of data gathering methods), or to specific aspects of it (e.g. labelling of columns in a spreadsheet, or annotation of apparent anomalies in the data).

Embargo

If a dataset deposited in a data archive has an embargo placed on it, it means that the dataset is not accessible. Typically, there will be a metadata record describing the data, but the data itself will not be available. Embargoes may be permanent or for a fixed period of time. Researchers may sometimes choose to deposit a dataset at the end of their project, but to embargo it for a further period – for example, until publications which make use of the data have appeared.

FAIR Data

Data which meets a set of principles for data management and stewardship established by a consortium of scientists, and endorsed by the G20 Hangzhou summit in 2016. FAIR is an acronym, standing for Findable, Accessible, Interoperable and Reusable.

Licence

A statement about an item (such as a creative work or a dataset) which indicates what potential users may and may not do with it. Some licences are custom-written formal legal contracts which need to be signed by both the owner of the item and the reuser. Others are open licences, which grant reuse rights to anyone, sometimes subject to conditions such as attribution of the data creator or a requirement that any derivative works are made available with a similar open licence. Creative Commons and Open Data Commons are examples of open licences.

Linked Data

Structured digital data which is connected to other digital data, often using common web technologies such as HTTP, RDF, and URIs. Linked data is designed not just to be comprehensible to humans, but also to make information machine readable. For example, a sentence such as ‘Charles Dickens wrote Oliver Twist’ might be represented as “ ‘Charles Dickens’ isAuthorOf ‘Oliver Twist’ ”, with each element of the statement being given a unique machine-readable identifier that points the machine (and the reader) to a page that clearly explains who or what they are – thus making it easy to discover that Charles Dickens was a person, while Oliver Twist is a novel.

Metadata

Literally, data about data – for example, data that describes an item such as a dataset.

The term is sometimes used interchangeably with documentation, but often means information with a defined structure, and which is designed to be machine readable. A metadata schema or standard is a set of pieces of information about an object recorded in a consistent way. For example, metadata for a research dataset might include fields for the author or creator of the item, the title, the date of creation or publication, the publisher, a unique identifier, and so on. The type of metadata it makes sense to record may also be more specific to the type of data: for example, metadata for a digital photograph file might include information about the light conditions, lens, and location of the camera when the image was taken.

Personal Data

Any data about living, identifiable individuals. Personal data must be handled in accordance with the relevant legislation, including the UK General Data Protection Regulation (GDPR).

Pseudonymisation

Pseudonymising requires the physical separation of ‘real-world’ identifiers from the rest of the research data. A link is maintained between research data and ‘real-world’ identifiers via a cipher or code. The cipher is kept secure and separate from the research data thereby limiting how many people in the research team have access to real-world identifiers, making it more difficult to identify individuals from research data, and so helping to guard against accidental disclosure. However, the ICO notes that while pseudonymised data can help reduce privacy risks by making it more difficult to identify individuals, it is still personal data.

Repository

A service in which research data or publications can be deposited and preserved safely for the long term. Data in repositories may be open access, or restricted.

Research Data

The Digital Curation Centre defines research data as ‘Representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship’. Research data can take many different forms, depending on the field of study: it may be numerical, textual, consisting of images or audio-visual data, or it may be something else entirely. Some research data is highly structured (e.g. tabular data); some is unstructured; some is somewhere in between.

Version

A dataset as it is at a particular point in time. Datasets frequently evolve through a project: they grow as more data is gathered, and they change as data is edited, processed, and manipulated. Thus there may be many different versions of one dataset.