Dataset curation#

Converting data to BIDS#

All git-annex datasets should be BIDS-compliant. For more information about the BIDS standard, please visit http://bids.neuroimaging.io.

When you receive data from an external collaborator, you can save them under a temporary location: duke/temp.

Then, inspect the data and convert them to BIDS. It is recommended to write a script that does the conversion. The script should then be saved under the code folder of the final dataset. Some previous scripts can be found on GitHub or under the code folder of already existing datasets.

Once the data are converted to BIDS and uploaded to git-annex repository, delete the temporary folder to save space.

Subject naming convention#

Basic convention: sub-XXX

Example:

sub-001
sub-002

Multi-institution/Multi-pathology convention: sub-<site><pathology>XXX

Example of Multi-institution dataset:

sub-mon001      # mon stands for Montreal
sub-tor001      # tor stands for Toronto

Example of Multi-institution/Multi-pathology dataset:

In the case of multi-pathology dataset (two or more distinct diseases + healthy controls), it is convenient to include also pathology to the subjectID, for example:

sub-torDCM001      # tor stands for Toronto and DCM stands for Degenerative Cervical Myelopathy
sub-torHC001       # tor stands for Toronto and HC stands for Healthy Controls
sub-zurSCI001      # zur stands for Zurich and SCI stands for Spinal Cord Injury

Data collected from actual subjects goes under their specific sub-folder, for example:

sub-001
β”œβ”€β”€ anat
β”‚Β Β  β”œβ”€β”€ sub-001_T1w.json
β”‚Β Β  β”œβ”€β”€ sub-001_T1w.nii.gz
β”‚Β Β  β”œβ”€β”€ sub-001_T2star.json
β”‚Β Β  β”œβ”€β”€ sub-001_T2star.nii.gz
β”‚Β Β  β”œβ”€β”€ sub-001_T2w.json
β”‚Β Β  └── sub-001_T2w.nii.gz
└── dwi
    β”œβ”€β”€ sub-001_dwi.bval
    β”œβ”€β”€ sub-001_dwi.bvec
    β”œβ”€β”€ sub-001_dwi.json
    └── sub-001_dwi.nii.gz

Many kinds of data have a place specified for them by BIDS. See file naming conventions and the MRI and Microscopy extensions for full details.

Note

If you need to differentiate spinal cord images from the brain, use the acq-cspine tag. For example, sub-001_acq-cspine_T1w.nii.gz.

ℹ️ We opted for acq-cspine tag (see BIDS template) because bp-cspine is not currently supported by the BIDS convention (see BEP25 BIDS extension proposal).

Note

If you need to differentiate between sequences acquired with different orientations, use the acq-ax, acq-cor, or acq-sag tag. For example, sub-001_acq-ax_T1w.nii.gz.

Note

If you need to differentiate between different magnetization transfer (MT) sequences, use the flip-<index>_mt-<on|off> tag. For example, sub-001_flip-1_mt-on_MTS.nii.gz, sub-001_flip-1_mt-off_MTS.nii.gz or sub-001_flip-2_mt-off_MTS.nii.gz.

Note

If you to combine several above mentioned tags, use camelCase. For example, sub-001_acq-cspineSagittal_T1w.nii.gz.

BIDS template#

⚠️ Every dataset must have the following files:

β”œβ”€β”€ README.md
β”œβ”€β”€ dataset_description.json
β”œβ”€β”€ participants.tsv
β”œβ”€β”€ participants.json
β”œβ”€β”€ code/
β”‚   └── curate.py
β”œβ”€β”€ sub-XXX
β”‚   └── anat
β”‚       └──sub-XXX_T1w.nii.gz
 ...
 ...
└── derivatives
    β”œβ”€β”€ dataset_description.json
    └── labels
        └── sub-XXX
        ...
        ...

For details, see BIDS specification.

README.md#

The README.md is a markdown file describing the dataset in more detail.

Please use the README.md template below:

# <NAME OF DATASET>

This is an <MRI/Microscopy> dataset acquired in the context of the <XYZ> project. 
<IF DATASET CONTAINS DERIVATIVES>It also contains <manual segmentation/labels> of <MS lesions/tumors/etc> from <one/two/or more> expert raters located under the derivatives folder.

## Contact Person

Dataset shared by: <NAME AND EMAIL>
<IF THERE WAS EMAIL COMM>Email communication: <DATE OF EMAIL AND SUBJECT>
<IF THERE IS A PRIMARY PROJECT/MODEL>Repository: https://github.com/<organization>/<repository_name>

## <IF DATA ARE MISSING FOR SOME SUBJECT(S)>missing data

<LIST HERE MISSING SUBJECTS>

dataset_description.json#

The dataset_description.json is a JSON file describing the dataset.

Please use the dataset_description.json template below:

{
    "BIDSVersion": "BIDS X.Y.Z",
    "Name": "<dataset_name>",
    "DatasetType": "raw"
}

Note

Refer to the BIDS spec to know what version to fill in here.

Warning

The dataset_description.json file within the top-level dataset should include "DatasetType": "raw".

participants.tsv#

The participants.tsv is a TSV file and should include at least the following columns:

participant_id

source_id

species

age

sex

pathology

institution

sub-001

001

homo sapiens

30

F

HC

montreal

sub-002

005

homo sapiens

40

O

MS

montreal

sub-003

007

homo sapiens

n/a

n/a

MS

toronto

Authorized values for pathology are listed under participants.json.

Please use the participants.tsv template below:

participant_id	source_id	species	age	sex	pathology	institution
sub-001	001	homo sapiens	30	F	HC	montreal
sub-002	005	homo sapiens	40	O	MS	montreal
sub-003	007	homo sapiens	n/a	n/a	MS	toronto

Other columns may be added if the data exists to fill them and it would be useful to keep.

Warning

Indicate missing values with n/a (for β€œnot available”), not by empty cells!

Warning

This is a Tab-Separated-Values file. Make sure to use tabs between entries if editing with a text editor. Most spreadsheet software can read and write .tsv correctly.

participants.json#

The participants.json is a JSON file providing a legend for the columns in participants.tsv, with longer descriptions, units, and in the case of categorical variables, allowed levels. Please use the template below:

{
    "participant_id": {
        "Description": "Unique Participant ID",
        "LongName": "Participant ID"
    },
    "source_id": {
        "Description": "Subject ID in the source unprocessed data",
        "LongName": "Subject ID in the source unprocessed data"
    },
    "species": {
        "Description": "Binomial species name of participant",
        "LongName": "Species"
    },
    "age": {
        "Description": "Participant age",
        "LongName": "Participant age",
        "Units": "years"
    },
    "sex": {
        "Description": "sex of the participant as reported by the participant",
        "Levels": {
            "M": "male",
            "F": "female",
            "O": "other"
        }
    },
    "pathology": {
        "Description": "The diagnosis of pathology of the participant",
        "LongName": "Pathology name",
        "Levels": {
            "HC": "Healthy Control",
            "DCM": "Degenerative Cervical Myelopathy (synonymous with CSM - Cervical Spondylotic Myelopathy)",
            "MildCompression": "Asymptomatic cord compression, without myelopathy",
            "MS": "Multiple Sclerosis",
            "SCI": "Traumatic Spinal Cord Injury"
        }
    },
    "institution": {
        "Description": "Human-friendly institution name",
        "LongName": "BIDS Institution ID"
    }
    "notes": {
        "Description": "Additional notes about the participant. For example, if there is more information about a disease, indicate it here.",
        "LongName": "Additional notes"
    }
}

code/#

The data cleaning and curation script(s) that create the sub-XXX/ folders should be kept with them, under the code/ folder. Within reason, every dataset should have a script that when run like

python code/curate.py path/to/sourcedata ./

unpacks, converts and renames all the images and related files in path/to/sourcedata/ into BIDS format in the current dataset ./.

This program should be committed first, before the curated data it produces. Afterwards, every commit that modifies the code should also re-run it, and the code and re-curated data should be committed in tandem.

Note

Analysis scripts should not be kept here. Keep them in separate repositories, usually in public on GitHub, with instructions about. See PIPELINE-DOC.

Changelog policy#

We use git log to track our changes. That means care should be taken to write good messages: they are there to help both you and future researchers understand how the dataset evolved.

Good commit message examples:

git commit -m 'Segment spines of subjects 010 through 023
    
Produced manually, using fsleyes.'

or

git commit -m 'Add new subjects provided by <email_adress>'

If you choose to also fill in BIDS’s optional CHANGES file make sure it reflects the git log.

Derivatives Structure#

This is a folder at the root of the dataset, which includes derivatives files generated from the top-level dataset such as segmentations or labeling. According to BIDS, these data should go under derivatives/ folder, and follow the same folder logic as the sub-* data.

Warning

If derivatives files were generated from preprocessed data (e.g., after reorientation and resampling), describe the preprocessing steps in README.md file. Also, include the link (pointing to fixed GitHub version) to the pipeline to the README.md file.

Example:

...
...
β”œβ”€β”€ sub-XXX
β”‚   └── anat
β”‚       └──sub-XXX_T1w.nii.gz
...
...
└── derivatives
    β”œβ”€β”€ dataset_description.json
    └── manual_labels
        β”œβ”€β”€ sub-XXX
        β”‚   β”œβ”€β”€ anat
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-SC_seg.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-SC_probseg.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-SC_mask.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-GM_seg.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-WM_seg.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-centerline.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-disc.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-lesion.nii.gz
        β”‚   β”‚   β”œβ”€β”€sub-XXX_T1w_label-compression.nii.gz
        ...
        ...

The convention for suffix is inspired from the BIDS convention and is the following::

  • label-<region>_seg.nii.gz: binary segmentation of the region <region>

  • label-<region>_probseg.nii.gz: probabilistic (soft) segmentation (i.e., values can lie between 0 and 1) of the region <region>

  • label-<region>_mask.nii.gz: binary mask of the region <region>, for example, cylinder mask with diameter of 35mm centered at the center of the spinal cord

  • label-<region>_probmask.nii.gz: probabilistic mask of the region <region>

  • label-centerline.nii.gz: binary spinal cord centerline

  • label-disc.nii.gz: voxels located at the posterior tip of each intervertebral disc, with values corresponding to SCT convention

  • label-pmj.nii.gz: a single voxel with value of 50 corresponding to the pontomedullary junction (PMJ), see SCT convention for details

  • label-compression.nii.gz: voxel(s) with value of 1 located at the posterior tip of each intervertebral disc corresponding to the spinal cord compression(s), see here for details

  • label-<region>_lesion.nii.gz: lesion (for example in multiple sclerosis), see here for details

Fields:
- region = {SC, GM, WM, CSF, brain, brainstem, tumor, edema, cavity, axon, myelin}

If you have multiple derivatives, you can create a folder for each of them, and then follow the same logic as above. For example:

...
...
└── derivatives
    β”œβ”€β”€ dataset_description.json
    β”œβ”€β”€ manual_labels
    └── manual_labels_softseg

Convention for derivatives JSON metadata:

{
  "Author": "Firstname Lastname",
  "Date": "YYYY-MM-DD HH:MM:SS"
}

Note

"Date" is optional. We usually include it when running the manual correction via python scripts.

Warning

The derivatives must include its own dataset_description.json file (with "DatasetType": "derivative").