Data organization
Cecile data structure
The following schema represents how data are structured on Cecile. Please take a moment to look at it and read the explanation below, it will give you an overview of the cluster organization.
/
├── data/
│ ├── archive/
│ │ └── projects/
│ │ ├── project_1/
│ │ └── project_n/
│ ├── groups/
│ │ ├── biopsy/
│ │ │ ├── archive/
| | │ - symlink /data/archive/project_1
│ │ │ └── projects/
│ │ | - symlink /data/projects/project_1
│ │ ├── exppsy/
│ │ ├── methpsy1/
│ │ └── neuropsy/
│ └── projects/
│ ├── project_1/
| | └── scratch/
| ├── ...
│ └── project_n/
│ └── scratch/
├── home/
│ └── user/
│ └── scratch/
└── software/
Understanding Cecile data structure
/home/<username>
: Your personal home folder, this folder must not contain your data and your analysis. Inhome
it is allowed a maximum storage space of 1GB, see the section on data storage quotas for further details.
/home/<username>/scratch
: Your personal scratch folder.scratch
is a folder that does not get backed up, it is useful for testing and dumping temporary processed data that can be deleted. For a proper introduction on how to usescratch
read the dedicated paragraph./software
: Stores the software stack. To understand how the software stack works refer to the software stack page.
/data/archive
: Archived projects. This folder contains archived projects.
/data/groups
: Group folders. These folders are dedicated to each group, for example/data/groups/biopsy
. As forhome
, there is a maximum amount of storage allowed. See the section on data storage quotas.
/data/groups/archive
: This folder contains symlinks (see the tab below), that allow you to access archived projects directly from the group folder/data/groups/projects
: As for thedata/groups/archive
, this folder contains symlinks to access the projects from the group folder./data/projects
: Projects folder. The actual storage of all ongoing projects, which can be comfortably accessed and worked on from thedata/groups/projects
folder/data/projects/scratch
: Project scratch folder. Ascratch
folder is also available in each singleproject
, please refer to the dedicated paragraph.
What is a symlink?
In a nutshell, a symbolic link, also known as symlink, is a file or a folder that points to another file or folder, known as target. Once a symlink is in place, you can work on it as if it was the original file or folder. The deletion of a symlink does not affect the target, however if a target is deleted, moved to a different location or renamed, the symbolic link does not get updated, it continues to point to the original file, but it is now broken. An example of how a symbolic link works:
# create an empty file
touch my_file.txt
# create a symlink of my_file.txt, we can give to the symlink the same name as the original file or a different one
ln -s my_file.txt my_file_symlink.txt
# now list the content of simlink_dir in the following way (use the flag -l)
ls -l
my_file_symlink.txt
points to the original file my_file.txt
Now if you modify the symlink file, you can see that this change affects the original file
# write something into the simlink
echo I am a symlink > my_file_symlink.txt
# check the content of the symlink
cat my_file_symlink.txt
my_file_symlink.txt
If you check the content of the original file, you can see that also the original file contains the sentence
How to create a project on Cecile
As mentioned in the research planning section, we adopt a specific project workflow, please follow carefully the next steps to create a project:
As a first step, you need to contact the cluster admin at cecile-admins-l@ovgu.de
.
As a second step, fill in a questionnaire about your project. The questionnaire can be downloaded as a text file.
Please try to answer all the questions, even if you do not know the exact answer to some questions, for example: Indicate the size of your raw data
, you should provide a reasonable estimate and your answer can be updated later. In case you are not able to answer some questions by yourself, try asking your PI or more expert members of your group, alternatively contact us and we will try to help you.
If your questionnaire has been properly completed your project will be created by the cluster admin.
Project structure
The project structure follows a specific BIDS structure (see examples in the tabs below). Such a structure will be generated upon project creation. According to BIDS, the structure we adopt requires only rawdata
to be BIDS compliant, however we strongly recommend that also derivatives
are BIDS compliant.
How to handle derivatives and code
Content of derivatives and code
derivatives
must contain only processed data and no code.code
must contain only code and no processed or unprocessed data.
In derivatives
, unlike rawdata
, you have the freedom to create sub-folders for your processed data at your discretion, but with one little caveat: When naming your sub-folders in derivatives
you must follow the convention <pipeline>-<variant>
, where <pipeline>
is the name of the tool you used for that step and <variant>
is the step of actual analysis, for example, assuming your preprocessing has been done with spm
, the sub-folder name would be spm-preproc/
. For further information see the official BIDS derivatives
section.
Mirroring derivatives and code structure
We strongly recommend to mirror the sub-folder names between derivatives
and code
to keep an intuitive relationship between code and data.
Good naming practices
As you know files and folders naming is an essential aspect of BIDS and the FAIR principles, we suggest you to look into this very helpful page to learn about good naming practices and machine readable names.
For modalities, such as eye-tracking, for which there is not yet a consensus for BIDS, we recommend to use the same structure as for the other modalities and to convert the data into BIDS format anyway. Here you can follow the discussion around BIDS for eye-tracking.
Sides effects of using a different structure
Failing to mantain this structure might create future issues for assigning less canonical permissions. Not adopting BIDS would make your dataset not easily interoperable and ultimately it would not be following the FAIR principles.
A prototypical fMRI dataset:
project-metadata.json
: This file is created by the cluster admin, it contains general metadata about the project.sourcedata
: It comprises the raw data (unprocessed data), in this case raw dicom files.rawdata
: It comprises the BIDS converted niftis, the relative.json
files and.tsv
files.derivatives
: It comprises different sub-folders which represent the macro-steps of you analysis, e.g.spm-preproc
,spm-first_level
etc. These folders should follow the BIDS convention as well, and must contain only data and no code.code
: It includes sub-folders that mirror thederivatives
sub-folder names. They must contain only the code relative to each step, and no data at all. It must also include a sub-folder containing the code for the BIDS convertion.stimuli
: It includes a subfolder for the experimental code and the stimuli (e.g. images), if any.
/data/projects/
└── dataset/
├── project-metadata.json
├── sourcedata
├── rawdata/
│ ├── dataset_description.json
│ ├── participants.tsv
│ ├── sub-01/
│ ├── sub-02/
│ └── ...
├── derivatives/
│ ├── spm-preproc/
│ ├── spm-first_level/
│ ├── nilearn-decoding/
│ └── ...
├── code/
| ├── bids-convertion/
│ ├── spm-preproc/
│ ├── spm-first_level/
│ ├── nilearn-decoding/
│ └── ...
└── stimuli/
├── experiment
└── ...
A prototypical EEG/MEG dataset:
project-metadata.json
: This file is created by the cluster admin, it contains general metadata about the project.sourcedata
: It comprises the raw data (unprocessed data), for example .bdf files.rawdata
: It comprises the BIDS converted data, the relative.json
files and.tsv
files.derivatives
: It comprises different sub-folders which represent the macro-steps of you analysis, e.g. eeglab-preproc, eeglab-erp etc. These folders should follow the BIDS convention as well, and must contain only processed data and no code.code
: It includes sub-folders that mirror thederivatives
sub-folder names. They must contain only the code relative to each step, and no data at all. It must also include a sub-folder containing the code for the BIDS convertion.stimuli
: It includes a subfolder for the experimental code and the stimuli (e.g. images), if any.
/data/projects/
└── dataset/
├── project-metadata.json
├── sourcedata
├── rawdata/
│ ├── dataset_description.json
│ ├── participants.tsv
│ ├── sub-01/
│ ├── sub-02/
│ └── ...
├── derivatives/
│ ├── eeglab-preproc/
│ ├── eeglab-erp/
│ ├── fieldtrip-time_frequency/
│ └── ...
├── code/
| ├── bids-convertion/
│ ├── eeglab-preproc/
│ ├── eeglab-erp/
│ ├── fieldtrip-time_frequency/
│ └── ...
└── stimuli/
└── experiment
A prototypical behavioral dataset:
project-metadata.json
: This file is created by the cluster admin, it contains general metadata about the project.sourcedata
: It comprises the raw data (unprocessed data) in whatever file type you have acquired them.rawdata
: It comprises the BIDS converted data, the relative.json
files.derivatives
: It comprises different sub-folders which represent the macro-steps of you analysis, e.g. reaction_times, curve_fitting etc. These folders should follow the BIDS convention as well, and must contain only processed data and no code.code
: It includes sub-folders that mirror thederivatives
sub-folder names. They must contain only the code relative to each step, and no data at all. It must also include a sub-folder containing the code for the BIDS convertion.stimuli
: It includes a sub-folder for the experimental code and the stimuli (e.g. images), if any.
/data/projects/
└── dataset/
├── project-metadata.json
├── sourcedata
├── rawdata/
│ ├── dataset_description.json
│ ├── participants.tsv
│ ├── sub-01/
│ ├── sub-02/
│ └── ...
├── derivatives/
│ ├── reaction_times/
│ ├── curve_fitting/
│ └── ...
├── code/
| ├── bids_convertion/
│ ├── reaction_times/
│ ├── curve_fitting/
│ └── ...
└── stimuli/
└── experiment
How related projects are handled
In case you want to create a project that is related to an existing one, for example projects that are part of the same grant, project names will share a common prefix decided by the project owner upon project creation.
If two experiments are part of the same project and need to be together in the same project folder, two sub-directories containing the project structures, previously discussed, will be created:
How to transfer data from/to Cecile
We recommend to use rsync
to transfer files from/to Cecile. rsync
is a powerful tool for data synchronization, it minimizes the data transfer by copying only data that have changed, meaning that if the files you want to transfer already exist in the new location, rsync
will only copy files that have been modified or that are not present in the new location. In order to work, rsync
must be installed in both machines, the source machine and the destination machine. Please refer to the rsync -man
or rsync --help
for further usage infomation.
Be aware of the .zfs
folder when transferring data from a project
As we explain in the backup section, every project contains a hidden folder called .zfs
. The .zfs
folder stores temporary snapshots of your project to facilitate the recovery of lost data. Make sure to exclude this folder when transferring your data. In the following sections we explain how to conveniently exclude any file or directory during data transfer using either the command line or Filezilla.
-
Transferring files with the command line
From your computer to Cecile:
From Cecile to your computer: Also in this case the following command should be typed on your machine.
How to exclude files or folders from a data transfer:
- Filter a folder out, for example, we exclude
.zfs
folder from the transfer: - Filter a file out:
- Filter a folder out, for example, we exclude
-
Transferring files with a GUI
In case you feel more comfortable with a GUI you may use the Filezilla client, available for all OS. Please refer to the Filezilla documentation for the installation.
How to use Filezilla
-
Install filezilla (please follow the instruction provided by the filezilla webpage)
-
Let's open the
Site manager
and create a new instance for Cecile: -
Now set up the new instance by using the
site manager
:Choose the SFTP-SSH protocol
Choose the SFTP-SSH protocol, as shown in the image, this will ensure that the correct port is automatically chosen.
-
Once your are logged in the cluster, you can simply select your local directory and the destination directory and drag and drop the files you want to transfer.
How to exclude files or folders during data transfer:
-
-
Transferring files with the command line
From your computer to Cecile:
From Cecile to your computer: Also in this case the following command should be typed on your machine.
How to exclude files or folders from a data transfer:
- Filter a folder out, for example, we exclude
.zfs
folder from the transfer: - Filter a file out:
- Filter a folder out, for example, we exclude
-
Transferring files with a GUI
In case you feel more comfortable with a GUI you may use the following SFTP clients: Filezilla, available for all OS, or cyberduck. Please refer to the respective documentation for the installation.
How to use Filezilla
-
Install filezilla (please follow the instruction provided by the filezilla webpage)
-
Let's open the
Site manager
and create a new instance for Cecile: -
Now set up the new instance by using the
site manager
:Choose the SFTP-SSH protocol
Choose the SFTP-SSH protocol, as shown in the image, this will ensure that the correct port is automatically chosen.
-
Once your are logged in the cluster, you can simply select your local directory and the destination directory and drag and drop the files you want to transfer.
How to exclude files or folders during data transfer:
-
-
Transferring files with the command line
If you are using a WSL on your Windows machine you can simply transfer data by using
rsync
:From your computer to Cecile:
From Cecile to your computer: Also in this case the following command should be typed on your machine.
How to exclude files or folders from a data transfer:
- Filter a folder out, for example, we exclude
.zfs
folder from the transfer: - Filter a file out:
- Filter a folder out, for example, we exclude
-
Transferring files with a GUI
In case you feel more comfortable with a GUI you may use the following SFTP clients: Filezilla, available for all OS, or Winscp or cyberduck. Please refer to the respective documentation for the installation.
How to use Filezilla
-
Install filezilla (please follow the instruction provided by the filezilla webpage)
-
Let's open the
Site manager
and create a new instance for Cecile: -
Now set up the new instance by using the
site manager
:Choose the SFTP-SSH protocol
Choose the SFTP-SSH protocol, as shown in the image, this will ensure that the correct port is automatically chosen.
-
Once your are logged in the cluster, you can simply select your local directory and the destination directory and drag and drop the files you want to transfer.
How to exclude files or folders during data transfer:
-
How to retrieve lost data
In case you have lost data within the last seven days, you can retrieve them by yourself. It is sufficient to go the following folder called .zfs/snapshot
within the folder in which your data were previously hosted and transfer the data back to their previous location. Data transfer can be easily done using cp
or rsync
.
You cannot retrieve such data by yourself, please contact manuela.khun at ovgu.de
Backups of home
, group
, project
and archive
are done daily by moving the snapsthots
to the backup storage. The only folder that is never backed up is scratch
.
Data storage quotas
Cecile is a shared resource, as such users are subjected to certain restrictions to guarantee a fair access to resources for everybody.
Data storage quotas set the maximum amount of storage space that can be utilized in a given directory (e.g. your home directory)
Such restrictions are enforced to ensure that users adopt a reasoned and parsimonious choice when storing data, thus avoiding that unnecessary data pollute the cluster. As a byproduct, quotas help you to keep your project much more tidy and force you to keep organized your directories.
Directory | Quota |
---|---|
home |
1 GB |
group |
10 GB |
project |
500 GB |
scratch |
1 TB |
Scratch: What it is and how to use it
Scratch
is a particular folder that is never backed up. For this reason is meant to be used for temporary data, for example intermediate outputs, that can be deleted with no consequences or code that you are testing.
How to decide which data should go into scratch:
The following are not absolute rules, please use your common sense and domain knowledge when applying them.
- It is an intermediate output.
- It is easily reproducible, meaning that such an output is not generated by long lasting jobs.
- It is an accessory output from a specific software that you are not going to use and that is not essential for your dataset.
- It is not code essential for your project.
No important data on scratch
Do not store important data and code in scratch. Clean your scratch periodically.