Dataset curation#

Subject naming convention#

Basic convention: sub-XXX

Example:

sub-001
sub-002

Multi-institution/Multi-pathology convention: sub-<site><pathology>XXX

Example of Multi-institution dataset:

sub-mon001      # mon stands for Montreal
sub-tor001      # tor stands for Toronto

Example of Multi-institution/Multi-pathology dataset:

In the case of multi-pathology dataset (two or more distinct diseases + healthy controls), it is convenient to include also pathology to the subjectID, for example:

sub-torDCM001      # tor stands for Toronto and DCM stands for Degenerative Cervical Myelopathy
sub-torHC001       # tor stands for Toronto and HC stands for Healthy Controls
sub-zurSCI001      # zur stands for Zurich and SCI stands for Spinal Cord Injury

Data collected from actual subjects goes under their specific sub-folder, for example:

sub-001
β”œβ”€β”€ anat
β”‚Β Β  β”œβ”€β”€ sub-001_T1w.json
β”‚Β Β  β”œβ”€β”€ sub-001_T1w.nii.gz
β”‚Β Β  β”œβ”€β”€ sub-001_T2star.json
β”‚Β Β  β”œβ”€β”€ sub-001_T2star.nii.gz
β”‚Β Β  β”œβ”€β”€ sub-001_T2w.json
β”‚Β Β  └── sub-001_T2w.nii.gz
└── dwi
    β”œβ”€β”€ sub-001_dwi.bval
    β”œβ”€β”€ sub-001_dwi.bvec
    β”œβ”€β”€ sub-001_dwi.json
    └── sub-001_dwi.nii.gz

Many kinds of data have a place specified for them by BIDS. See file naming conventions and the MRI and Microscopy extensions for full details.

Warning

TODO: describe neuropoly-specific BIDS entities, like bp-cspine or acq-MTon

BIDS template#

⚠️ Every dataset must have the following files:

β”œβ”€β”€ README.md
β”œβ”€β”€ dataset_description.json
β”œβ”€β”€ participants.tsv
β”œβ”€β”€ participants.json
β”œβ”€β”€ code/
β”‚   └── curate.py
β”œβ”€β”€ sub-XXX
β”‚   └── anat
β”‚       └──sub-XXX_T1w.nii.gz
 ...
 ...
└── derivatives
    β”œβ”€β”€ dataset_description.json
    └── labels
        └── sub-XXX
        ...
        ...

For details, see BIDS specification.

README.md#

The README.md is a markdown file describing the dataset in more detail.

Below is a template - modify it!!!

README.md template:
# <NAME OF DATASET>

This is an <MRI/Microscopy> dataset acquired in the context of the <XYZ> project. 
<IF DATASET CONTAINS DERIVATIVES>It also contains <manual segmentation/labels> of <MS lesions/tumors/etc> from <one/two/or more> expert raters located under the derivatives folder.

## Contact Person

Dataset shared by: <NAME AND EMAIL>
<IF THERE WAS EMAIL COMM>Email communication: <DATE OF EMAIL AND SUBJECT>
<IF THERE IS A PRIMARY PROJECT/MODEL>Repository: https://github.com/<organization>/<repository_name>

## <IF DATA ARE MISSING FOR SOME SUBJECT(S)>missing data

<LIST HERE MISSING SUBJECTS>

dataset_description.json#

The dataset_description.json is a JSON file describing the dataset.

dataset_description.json template:

Refer to the BIDS spec to know what version to fill in here.

{
    "BIDSVersion": "BIDS X.Y.Z",
    "Name": "<dataset_name>",
    "DatasetType": "raw"
}

Warning

The dataset_description.json file within the top-level dataset should include "DatasetType": "raw".

participants.tsv#

The participants.tsv is a TSV file and should include at least the following columns:

participant_id

source_id

species

age

sex

pathology

institution

sub-001

001

homo sapiens

30

F

HC

montreal

sub-002

005

homo sapiens

40

O

MS

montreal

sub-003

007

homo sapiens

n/a

n/a

MS

toronto

participants.tsv template:
participant_id	source_id	species	age	sex	pathology	institution
sub-001	001	homo sapiens	30	F	HC	montreal
sub-002	005	homo sapiens	40	O	MS	montreal
sub-003	007	homo sapiens	n/a	n/a	MS	toronto

Other columns may be added if the data exists to fill them and it would be useful to keep.

Warning

Indicate missing values with n/a (for β€œnot available”), not by empty cells!

Warning

This is a Tab-Separated-Values file. Make sure to use tabs between entries if editing with a text editor. Most spreadsheet software can read and write .tsv correctly.

participants.json#

The participants.json is a JSON file providing a legend for the columns in participants.tsv, with longer descriptions, units, and in the case of categorical variables, allowed levels.

participants.json template:
{
    "participant_id": {
        "Description": "Unique Participant ID",
        "LongName": "Participant ID"
    },
    "source_id": {
        "Description": "Subject ID in the unprocessed data",
        "LongName": "Subject ID in the unprocessed data"
    },
    "species": {
        "Description": "Binomial species name of participant",
        "LongName": "Species"
    },
    "age": {
        "Description": "Participant age",
        "LongName": "Participant age",
        "Units": "years"
    },
    "sex": {
        "Description": "sex of the participant as reported by the participant",
        "Levels": {
            "M": "male",
            "F": "female",
            "O": "other"
        }
    },
    "pathology": {
        "Description": "The diagnosis of pathology of the participant",
        "LongName": "Pathology name",
        "Levels": {
            "HC": "Healthy Control",
            "DCM": "Degenerative Cervical Myelopathy (synonymous with CSM - Cervical Spondylotic Myelopathy)",
            "MS": "Multiple Sclerosis",
            "SCI": "Traumatic Spinal Cord Injury"
        }
    },
    "institution": {
        "Description": "Human-friendly institution name",
        "LongName": "BIDS Institution ID"
    }
}

code/#

The data cleaning and curation script(s) that create the sub-XXX/ folders should be kept with them, under the code/ folder. Within reason, every dataset should have a script that when run like

python code/curate.py path/to/sourcedata ./

unpacks, converts and renames all the images and related files in path/to/sourcedata/ into BIDS format in the current dataset ./.

This program should be committed first, before the curated data it produces. Afterwards, every commit that modifies the code should also re-run it, and the code and re-curated data should be committed in tandem.

Note

Analysis scripts should not be kept here. Keep them in separate repositories, usually in public on GitHub, with instructions about. See PIPELINE-DOC.

Changelog policy#

We use git log to track our changes. That means care should be taken to write good messages: they are there to help both you and future researchers understand how the dataset evolved.

Good commit message examples:

git commit -m 'Segment spines of subjects 010 through 023
    
Produced manually, using fsleyes.'

or

git commit -m 'Add new subjects provided by <email_adress>'

If you choose to also fill in BIDS’s optional CHANGES file make sure it reflects the git log.

Derivatives Structure#

The derivatives are files generated from the top-level dataset such as segmentations or labels.

Convention for derivatives JSON metadata:

{
  "Author": "Firstname Lastname",
  "Date": "YYYY-MM-DD HH:MM:SS"
}

NOTE: β€œDate” is optional. We usually include it when running the manual correction via python scripts.

Warning

The derivatives must include its own dataset_description.json file (with "DatasetType": "derivative").