Compare revisions

ONDEL Lucas · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang
--- a/.gitignore
+++ b/.gitignore
+*outputdir/
+Manifest.toml
+notebook-test.jl
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
+# Tags
+## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024
+### Changed
+- Added support for Speech2Tex dataset
+## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024
+### Changed
+- Added support for AVID dataset
+## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024
+### Changed
+- Added support for INA Diachrony dataset
+### Fixed
+- Fixed Minilibrispeech data prep
+## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024
+### Changed
+- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.
+## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024
+### Added
+- filtering speech dataset based on recording id.
+### Improved
+- Faster TIMIT preparation
+## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024
+### Added
+- extract alignments from TIMIT
+### Changed
+- `Supervision` is now `Annotation`
+## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024
+# Fixed
+- TIMIT data preparation
+## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024
+# Fixed
+- `CMUDICT("dir/path")` fails if `dir` does not already exists.
+## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024
+# Fixed
+- invalid type for field `channels` of `Recording`
+- `MINILIBRISPEECH` broken
+## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024
+# Fixed
+- not possible to use `:` as channel specifier
+## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024
+# Changed
+- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`
+## Added
+* CMU and TIMIT lexicon
+## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024
+## Features
+* New `dataset` function, which builds `SpeechDataset` from manifest files
+* Compatibility with MLUtils.DataLoader
+## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023
+## Changed
+* refactored API, TIMIT dataset working (but not Librispeech anymore)
+## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023
+## Added
+- raw audio data source
+## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023
+## Added
+- can load the data directly from an audio source with the `load`
+  function.
+## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023
+## Added
+* HTML display of AudioSource rather than recording
+## Fixed
+* creating Recording from audio source without specifying the channels
+  and the sampling rate
+## [0.4.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.0)- 08/03/2023
+## Removed
+* `play` function and dependcy to PortAudio
+* dependency with Fast
+## Added
+* HTML display of recording used in pluto notebook for instance
+* `setrootdir` function to specify the location of the corpora
+## [0.3.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.3.0)- 03/03/2023
+## Added
+* user do not need to specify the output directory -> relying on
+  Fast.jl to provide the default directory
+* MiniLibriSpeech
+## [0.2.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.2.0)- 03/03/2023
+## Added
+* MiniLibriSpeech
+## [0.1.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.1.1)- 17/02/2023
+## Fixed
+* do not regenerate the manifest if they have been already created
+## [0.1.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.1.0)- 17/02/2023
+## Added
+* download and preparation of the multilingual Librispeech corpus
--- a/Project.toml
+++ b/Project.toml
+name = "SpeechDatasets"
+uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"
+authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",
+           "Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",
+           "Nicolas DENIER <nicolas.denier@lisn.fr>"]
+version = "0.15.0"
+[deps]
+JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
+MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
+SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"
+[compat]
+julia = "1.10"
+JSON = "0.21"
+SpeechFeatures = "0.8"
--- a/README.md
+++ b/README.md
+# SpeechDatasets.jl
+A Julia package to download and prepare speech corpus.
+## Installation
+Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)
+to your julia installation. Then, install the package as usual:
+```
+pkg> add SpeechDatasets
+```
+## Example
+```
+julia> using SpeechDatasets
+julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test
+...
+julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test
+...
+julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional
+...
+julia> dataset = AVID("/path/to/avid/dir", "outputdir")
+...
+julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")
+...
+julia> for ((signal, fs), supervision) in dataset
+           # do something
+       end
+# Lexicons
+julia> CMUDICT("outputfile")
+...
+julia> TIMITDICT("/path/to/timit/dir")
+...
+```
+## License
+This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))
--- a/src/SpeechDatasets.jl
+++ b/src/SpeechDatasets.jl
+# SPDX-License-Identifier: CECILL-2.1
+module SpeechDatasets
+using JSON
+using SpeechFeatures
+import MLUtils
+export
+    # ManifestItem
+    Recording,
+    Annotation,
+    load,
+    # Manifest interface
+    writemanifest,
+    readmanifest,
+    # Corpora interface
+    download,
+    lang,
+    name,
+    prepare,
+    # Corpora
+    MultilingualLibriSpeech,
+    MINILIBRISPEECH,
+    TIMIT,
+    INADIACHRONY,
+    AVID,
+    SPEECH2TEX,
+    # Lexicon
+    CMUDICT,
+    TIMITDICT,
+    MFAFRDICT,
+    # Dataset
+    dataset
+include("speechcorpus.jl")
+include("manifest_item.jl")
+include("manifest_io.jl")
+include("dataset.jl")
+# Supported corpora
+include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))
+include("lexicons.jl")
+end
--- a/src/corpora/avid.jl
+++ b/src/corpora/avid.jl
+# SPDX-License-Identifier: CECILL-2.1
+function avid_recordings(dir::AbstractString)
+    checkdir(dir)
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            id = filename
+            path = joinpath(root, file)
+            audio_src = FileAudioSource(path)
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+function load_metadata_files(dir::AbstractString)
+    tasksdict = Dict('s' => "SENT", 'p' => "PARA")
+    metadatadict = Dict(key => 
+        readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv")) 
+        for key in keys(tasksdict))
+    return metadatadict
+end
+function get_metadata(filename, metadatadict)
+    task = split(filename, "_")[3][1]
+    headers = metadatadict[task][1]
+    headers = split(headers, ",")
+    file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]
+    file_metadata = split(file_metadata, ",")
+    metadata = Dict(
+        headers[i] => file_metadata[i]
+        for i = 1:length(headers)
+    )
+    return metadata
+end
+function avid_annotations(dir)
+    checkdir(dir)
+    annotations = Dict()
+    metadatadict = load_metadata_files(dir)
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            # extract metadata from csv files
+            metadata = get_metadata(filename, metadatadict)
+            id = filename
+            # generate annotation
+            annotations[id] = Annotation(
+                id, # audio id
+                id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations   
+            )
+        end
+    end
+    annotations
+end
+function download_avid(dir)
+    @info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"
+    url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"
+    filename = "AVID.zip"
+    filepath = joinpath(dir,filename)
+    run(`mkdir -p $dir`)
+    run(`wget $url -O $filepath`)
+    @info "Download complete, extracting files"
+    run(`unzip $filepath -d $dir`)
+    run(`rm $filepath`)
+    return joinpath(datadir, "/AVID")
+end
+function avid_prepare(datadir, outputdir)
+    # Validate the data directory
+    isdir(datadir) || (datadir = download_avid(datadir))
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+    # Recordings
+    recordings = Array{Dict}(undef, 2)
+    recordings_path = joinpath(datadir, "Repository 2")
+    @info "Extracting recordings from $recordings_path"
+    recordings[1] = avid_recordings(recordings_path)
+    # Calibration tones
+    calibtones_path = joinpath(datadir, "Calibration_tones")
+    @info "Extracting recordings from $calibtones_path"
+    recordings[2] = avid_recordings(calibtones_path)
+    for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])
+        open(manifestpath, "w") do f
+            writemanifest(f, recordings[i])
+        end
+    end
+    # Annotations
+    annotations_path = recordings_path
+    @info "Extracting annotations from $annotations_path"
+    annotations = avid_annotations(annotations_path)
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+function AVID(datadir, outputdir)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        avid_prepare(datadir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/ina_diachrony.jl
+++ b/src/corpora/ina_diachrony.jl
+# SPDX-License-Identifier: CECILL-2.1
+function ina_diachrony_recordings(dir::AbstractString)
+    checkdir(dir)
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            id = "ina_diachrony§$filename"
+            path = joinpath(root, file)
+            audio_src = FileAudioSource(path)
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+function ina_diachrony_get_metadata(filename)
+    metadata = split(filename, "§")
+    age, sex = split(metadata[2], "_")
+    Dict(
+        "speaker" => metadata[3],
+        "timeperiod" => metadata[1],
+        "age" => age,
+        "sex" => sex,
+    )
+end
+function ina_diachrony_annotations_whole(dir)
+    checkdir(dir)
+    annotations = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            # extract metadata from filename
+            metadata = ina_diachrony_get_metadata(filename)
+            # extract transcription text (same filename but .txt)
+            textfilepath = joinpath(root, "$filename.txt")
+            metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
+            id = "ina_diachrony§$filename"
+            annotation_id = id*"§0"
+            # generate annotation
+            annotations[annotation_id] = Annotation(
+                id, # audio id
+                annotation_id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations
+            )
+        end
+    end
+    annotations
+end
+function ina_diachrony_annotations_csv(dir)
+    checkdir(dir)
+    annotations = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".csv" && continue
+            # extract metadata from filename
+            metadata = ina_diachrony_get_metadata(filename)
+            id = "ina_diachrony§$filename"
+            # generate annotation for each line in csv
+            open(joinpath(root, file)) do f
+                header = readline(f)   
+                line = 1 
+                # read till end of file
+                while ! eof(f) 
+                    current_line = readline(f)
+                    start_time, end_time, text = split(current_line, ",", limit=3)
+                    start_time = parse(Float64, start_time)
+                    duration = parse(Float64, end_time)-start_time
+                    metadata["text"] = text
+                    annotation_id = id*"§$line"
+                    annotations[id] = Annotation(
+                        id, # audio id
+                        annotation_id, # annotation id
+                        start_time,  # start
+                        duration,  # duration
+                        [1], # only 1 channel (mono recording)
+                        metadata # additional informations
+                    )
+                    line += 1
+                end
+            end
+        end
+    end
+    annotations
+end
+function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
+    # Validate the data directory
+    for d in [ina_wav_dir, ina_csv_dir]
+        isnothing(d) || checkdir(d)
+    end
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+    # Recordings
+    @info "Extracting recordings from $ina_wav_dir"
+    recordings = ina_diachrony_recordings(ina_wav_dir)
+    manifestpath = joinpath(outputdir, "recordings.jsonl")
+    open(manifestpath, "w") do f
+        writemanifest(f, recordings)
+    end
+    # Annotations
+    @info "Extracting annotations from $ina_wav_dir"
+    annotations = ina_diachrony_annotations_whole(ina_wav_dir)
+    if ! isnothing(ina_csv_dir)
+        @info "Extracting annotations from $ina_csv_dir"
+        csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)
+        annotations = merge(annotations, csv_annotations)
+    end
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/mini_librispeech.jl
+++ b/src/corpora/mini_librispeech.jl
+# SPDX-License-Identifier: CECILL-2.1
+#######################################################################
+const MINILS_URL = Dict(
+    "dev" => "https://www.openslr.org/resources/31/dev-clean-2.tar.gz",
+    "train" => "https://www.openslr.org/resources/31/train-clean-5.tar.gz"
+)
+const MINILS_SUBSETS = Dict(
+    "train" => "train-clean-5",
+    "dev" => "dev-clean-2"
+)
+#######################################################################
+struct MINILIBRISPEECH <: SpeechCorpus
+    recordings
+    train
+    dev
+    test
+end
+function minils_recordings(dir, subset)
+    subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
+    recs = Dict()
+    for d1 in readdir(subsetdir; join = true)
+        for d2 in readdir(d1; join = true)
+            for path in readdir(d2; join = true)
+                endswith(path, ".flac") || continue
+                id = replace(basename(path), ".flac" =>  "")
+                r = Recording(
+                    id,
+                    CmdAudioSource(`sox $path -t wav -`);
+                    channels = [1],
+                    samplerate = 16000
+                )
+                recs[r.id] = r
+            end
+        end
+    end
+    recs
+end
+function minils_annotations(dir, subset)
+    subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
+    sups = Dict()
+    for d1 in readdir(subsetdir; join = true)
+        for d2 in readdir(d1; join = true)
+            k1 = d1 |> basename
+            k2 = d2 |> basename
+            open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f
+                for line in eachline(f)
+                    tokens = split(line)
+                    s = Annotation(
+                        tokens[1], # annotation id
+                        tokens[1]; # recording id
+                        channels = [1],
+                        data = Dict("text" => join(tokens[2:end], " "))
+                    )
+                    sups[s.id] = s
+                end
+            end
+        end
+    end
+    sups
+end
+function minils_download(dir)
+    donefile = joinpath(dir, ".download.done")
+    if ! isfile(donefile)
+        run(`mkdir -p $dir`)
+        @debug "downloading the corpus"
+        for subset in ["train", "dev"]
+            run(`wget --no-check-certificate -P $dir $(MINILS_URL[subset])`)
+            tarpath = joinpath(dir, "$(MINILS_SUBSETS[subset]).tar.gz")
+            @debug "extracting"
+            run(`tar -xf $tarpath -C $dir`)
+            run(`rm $tarpath`)
+        end
+        run(pipeline(`date`, stdout = donefile))
+    end
+    @debug "dataset in $dir"
+end
+function minils_prepare(dir)
+    # 1. Recording manifest.
+    out = joinpath(dir, "recordings.jsonl")
+    if ! isfile(out)
+        open(out, "w") do f
+            for subset in ["train", "dev"]
+                @debug "preparing recording manifest ($subset) $out"
+                recs = minils_recordings(dir, subset)
+                writemanifest(f, recs)
+            end
+        end
+    end
+    # 2. Annotation manifests.
+    for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]
+        out = joinpath(dir, "annotations-$name.jsonl")
+        if ! isfile(out)
+            @debug "preparing annotation manifest ($subset) $out"
+            sups = minils_annotations(dir, subset)
+            open(out, "w") do f
+                writemanifest(f, sups)
+            end
+        end
+    end
+end
+function MINILIBRISPEECH(dir, subset)
+    minils_download(dir)
+    minils_prepare(dir)
+    dataset(dir, subset)
+end
--- a/src/corpora/multilingual_librispeech.jl
+++ b/src/corpora/multilingual_librispeech.jl
+# SPDX-License-Identifier: CECILL-2.1
+struct MultilingualLibriSpeech <: SpeechCorpus
+    lang
+    name
+    function MultilingualLibriSpeech(lang)
+        new(lang, "multilingual_librispeech")
+    end
+end
+const MLS_LANG_CODE = Dict(
+    "deu" => "german",
+    "eng" => "english",
+    "esp" => "spanish",
+    "fra" => "french",
+    "ita" => "italian",
+    "nld" => "dutch",
+    "pol" => "polish",
+    "prt" => "portuguese"
+)
+const MLS_AUDIO_URLS = Dict(
+    "deu" => "https://dl.fbaipublicfiles.com/mls/mls_german.tar.gz",
+    "eng" => "https://dl.fbaipublicfiles.com/mls/mls_english.tar.gz",
+    "esp" => "https://dl.fbaipublicfiles.com/mls/mls_spanish.tar.gz",
+    "fra" => "https://dl.fbaipublicfiles.com/mls/mls_french.tar.gz",
+    "ita" => "https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz",
+    "nld" => "https://dl.fbaipublicfiles.com/mls/mls_dutch.tar.gz",
+    "pol" => "https://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz",
+    "prt" => "https://dl.fbaipublicfiles.com/mls/mls_portuguese.tar.gz"
+)
+const MLS_LM_URLS = Dict(
+    "deu" => "https://dl.fbaipublicfiles.com/mls/mls_lm_german.tar.gz",
+    "eng" => "https://dl.fbaipublicfiles.com/mls/mls_lm_english.tar.gz",
+    "esp" => "https://dl.fbaipublicfiles.com/mls/mls_lm_spanish.tar.gz",
+    "fra" => "https://dl.fbaipublicfiles.com/mls/mls_lm_french.tar.gz",
+    "ita" => "https://dl.fbaipublicfiles.com/mls/mls_lm_italian.tar.gz",
+    "nld" => "https://dl.fbaipublicfiles.com/mls/mls_lm_dutch.tar.gz",
+    "pol" => "https://dl.fbaipublicfiles.com/mls/mls_lm_polish.tar.gz",
+    "prt" => "https://dl.fbaipublicfiles.com/mls/mls_lm_portuguese.tar.gz"
+)
+function Base.download(corpus::MultilingualLibriSpeech, outdir)
+    dir = path(corpus, outdir)
+    donefile = joinpath(dir, ".download.done")
+    if ! isfile(donefile)
+        run(`mkdir -p $dir`)
+        @info "downloading the corpus"
+        run(`wget -P $dir $(MLS_AUDIO_URLS[corpus.lang])`)
+        tarpath = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang]).tar.gz")
+        @info "extracting"
+        run(`tar -xf $tarpath -C $dir`)
+        run(`rm $tarpath`)
+        @info "downloading LM data"
+        run(`wget -P $dir $(MLS_LM_URLS[corpus.lang])`)
+        tarpath = joinpath(dir, "mls_lm_$(MLS_LANG_CODE[corpus.lang]).tar.gz")
+        @info "extracting"
+        run(`tar -xf $tarpath -C $dir`)
+        run(`rm $tarpath`)
+        run(pipeline(`date`, stdout = donefile))
+    end
+    @info "dataset in $dir"
+    corpus
+end
+function recordings(corpus::MultilingualLibriSpeech, dir, subset)
+    subsetdir = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "audio")
+    recs = Dict()
+    for d1 in readdir(subsetdir; join = true)
+        for d2 in readdir(d1; join = true)
+            for path in readdir(d2; join = true)
+                id = replace(basename(path), ".flac" =>  "")
+                r = Recording(
+                    id,
+                    CmdAudioSource(`sox $path -t wav -`);
+                    channels = [1],
+                    samplerate = 16000
+                )
+                recs[r.id] = r
+            end
+        end
+    end
+    recs
+end
+function annotations(corpus::MultilingualLibriSpeech, dir, subset)
+    trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")
+    sups = Dict()
+    open(trans, "r") do f
+        for line in eachline(f)
+            tokens = split(line)
+            s = Annotation(tokens[1], tokens[1]; channel = 1,
+                            data = Dict("text" => join(tokens[2:end], " ")))
+            sups[s.id] = s
+        end
+    end
+    sups
+end
+function prepare(corpus::MultilingualLibriSpeech, outdir)
+    dir = path(corpus, outdir)
+    # 1. Recording manifests.
+    for subset in ["train", "dev", "test"]
+        out = joinpath(dir, "recording-manifest-$subset.jsonl")
+        @info "preparing recording manifest ($subset) $out"
+        if ! isfile(out)
+            recs = recordings(corpus, dir, subset)
+            open(out, "w") do f
+                writemanifest(f, recs)
+            end
+        end
+    end
+    # 2. Annotation manifests.
+    for subset in ["train", "dev", "test"]
+        out = joinpath(dir, "annotation-manifest-$subset.jsonl")
+        @info "preparing annotation manifest ($subset) $out"
+        if ! isfile(out)
+            sups = annotations(corpus, dir, subset)
+            open(out, "w") do f
+                writemanifest(f, sups)
+            end
+        end
+    end
+    corpus
+end
--- a/src/corpora/speech2tex.jl
+++ b/src/corpora/speech2tex.jl
+# SPDX-License-Identifier: CECILL-2.1
+function speech2tex_recordings(dir::AbstractString)
+    checkdir(dir)
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            id = filename
+            path = joinpath(root, file)
+            audio_src = FileAudioSource(path)
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 48000
+            )
+        end
+    end
+    recordings
+end
+extract_digits(str::AbstractString) = filter(c->isdigit(c), str)
+isnumber(str::AbstractString) = extract_digits(str)==str
+function speech2tex_get_metadata(filename)
+    # possible cases: line123_p1  line123_124_p1  line123_p1_part2  (not observed but also supported: line123_124_p1_part2)
+    split_name = split(filename, "_")
+    metadata = Dict()
+    if isnumber(split_name[2])
+        metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]
+        metadata["speaker"] = split_name[3]
+    else 
+        metadata["line"] = extract_digits(split_name[1])
+        metadata["speaker"] = split_name[2]
+    end
+    if occursin("part", split_name[end])
+        metadata["part"] = extract_digits(split_name[end])
+    end
+    metadata
+end
+function speech2tex_annotations(audiodir, transcriptiondir, texdir)
+    checkdir.([audiodir, transcriptiondir, texdir])
+    annotations = Dict()
+    for (root, subdirs, files) in walkdir(audiodir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            # extract metadata from csv files
+            metadata = speech2tex_get_metadata(filename)
+            # extract transcription and tex (same filenames but .txt)
+            dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")
+            for (d, label) in dirdict
+                textfilepath = joinpath(d, "$filename.txt")
+                metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
+            end
+            id = filename
+            # generate annotation
+            annotations[id] = Annotation(
+                id, # audio id
+                id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations   
+            )
+        end
+    end
+    annotations
+end
+function speech2tex_prepare(datadir, outputdir)
+    # Validate the data directory
+    checkdir(datadir)
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+    # Recordings
+    recordings = Array{Dict}(undef, 2)
+    recordings_path = joinpath(datadir, "audio")
+    @info "Extracting recordings from $recordings_path"
+    recordings = speech2tex_recordings(recordings_path)
+    manifestpath = joinpath(outputdir, "recordings.jsonl")
+    open(manifestpath, "w") do f
+        writemanifest(f, recordings)
+    end
+    # Annotations
+    transcriptiondir = joinpath(datadir, "sequences")
+    texdir = joinpath(datadir, "latex")
+    @info "Extracting annotations from $transcriptiondir and $texdir"
+    annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+function SPEECH2TEX(datadir, outputdir)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        speech2tex_prepare(datadir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/timit.jl
+++ b/src/corpora/timit.jl
+# SPDX-License-Identifier: CECILL-2.1
+#######################################################################
+const TIMIT_SUBSETS = Dict(
+    "train" => "train",
+    "dev" => "dev",
+    "test" => "test"
+)
+const TIMIT_DEV_SPK_LIST = Set([
+"faks0",
+    "fdac1",
+    "fjem0",
+    "mgwt0",
+    "mjar0",
+    "mmdb1",
+    "mmdm2",
+    "mpdf0",
+    "fcmh0",
+    "fkms0",
+    "mbdg0",
+    "mbwm0",
+    "mcsh0",
+    "fadg0",
+    "fdms0",
+    "fedw0",
+    "mgjf0",
+    "mglb0",
+    "mrtk0",
+    "mtaa0",
+    "mtdt0",
+    "mthc0",
+    "mwjg0",
+    "fnmr0",
+    "frew0",
+    "fsem0",
+    "mbns0",
+    "mmjr0",
+    "mdls0",
+    "mdlf0",
+    "mdvc0",
+    "mers0",
+    "fmah0",
+    "fdrw0",
+    "mrcs0",
+    "mrjm4",
+    "fcal1",
+    "mmwh0",
+    "fjsj0",
+    "majc0",
+    "mjsw0",
+    "mreb0",
+    "fgjd0",
+    "fjmg0",
+    "mroa0",
+    "mteb0",
+    "mjfc0",
+    "mrjr0",
+    "fmml0",
+    "mrws1"
+])
+const TIMIT_TEST_SPK_LIST = Set([
+    "mdab0",
+    "mwbt0",
+    "felc0",
+    "mtas1",
+    "mwew0",
+    "fpas0",
+    "mjmp0",
+    "mlnt0",
+    "fpkt0",
+    "mlll0",
+    "mtls0",
+    "fjlm0",
+    "mbpm0",
+    "mklt0",
+    "fnlp0",
+    "mcmj0",
+    "mjdh0",
+    "fmgd0",
+    "mgrt0",
+    "mnjm0",
+    "fdhc0",
+    "mjln0",
+    "mpam0",
+    "fmld0"
+])
+TIMIT_PHONE_MAP48 = Dict(
+    "aa"    => "aa",
+    "ae"    => "ae",
+    "ah"    => "ah",
+    "ao"    => "ao",
+    "aw"    => "aw",
+    "ax"    => "ax",
+    "ax-h"  => "ax",
+    "axr"   => "er",
+    "ay"    => "ay",
+    "b"     => "b",
+    "bcl"   => "vcl",
+    "ch"    => "ch",
+    "d"     => "d",
+    "dcl"   => "vcl",
+    "dh"    => "dh",
+    "dx"    => "dx",
+    "eh"    => "eh",
+    "el"    => "el",
+    "em"    => "m",
+    "en"    => "en",
+    "eng"   => "ng",
+    "epi"   => "epi",
+    "er"    => "er",
+    "ey"    => "ey",
+    "f"     => "f",
+    "g"     => "g",
+    "gcl"   => "vcl",
+    "h#"    => "sil",
+    "hh"    => "hh",
+    "hv"    => "hh",
+    "ih"    => "ih",
+    "ix"    => "ix",
+    "iy"    => "iy",
+    "jh"    => "jh",
+    "k"     => "k",
+    "kcl"   => "cl",
+    "l"     => "l",
+    "m"     => "m",
+    "n"     => "n",
+    "ng"    => "ng",
+    "nx"    => "n",
+    "ow"    => "ow",
+    "oy"    => "oy",
+    "p"     => "p",
+    "pau"   => "sil",
+    "pcl"   => "cl",
+    "q"     => "",
+    "r"     => "r",
+    "s"     => "s",
+    "sh"    => "sh",
+    "t"     => "t",
+    "tcl"   => "cl",
+    "th"    => "th",
+    "uh"    => "uh",
+    "uw"    => "uw",
+    "ux"    => "uw",
+    "v"     => "v",
+    "w"     => "w",
+    "y"     => "y",
+    "z"     => "z",
+    "zh"    => "zh"
+)
+TIMIT_PHONE_MAP39 = Dict(
+    "aa"    => "aa",
+    "ae"    => "ae",
+    "ah"    => "ah",
+    "ao"    => "aa",
+    "aw"    => "aw",
+    "ax"    => "ah",
+    "ax-h"  => "ah",
+    "axr"   => "er",
+    "ay"    => "ay",
+    "b"     => "b",
+    "bcl"   => "sil",
+    "ch"    => "ch",
+    "d"     => "d",
+    "dcl"   => "sil",
+    "dh"    => "dh",
+    "dx"    => "dx",
+    "eh"    => "eh",
+    "el"    => "l",
+    "em"    => "m",
+    "en"    => "n",
+    "eng"   => "ng",
+    "epi"   => "sil",
+    "er"    => "er",
+    "ey"    => "ey",
+    "f"     => "f",
+    "g"     => "g",
+    "gcl"   => "sil",
+    "h#"    => "sil",
+    "hh"    => "hh",
+    "hv"    => "hh",
+    "ih"    => "ih",
+    "ix"    => "ih",
+    "iy"    => "iy",
+    "jh"    => "jh",
+    "k"     => "k",
+    "kcl"   => "sil",
+    "l"     => "l",
+    "m"     => "m",
+    "n"     => "n",
+    "ng"    => "ng",
+    "nx"    => "n",
+    "ow"    => "ow",
+    "oy"    => "oy",
+    "p"     => "p",
+    "pau"   => "sil",
+    "pcl"   => "sil",
+    "q"     => "",
+    "r"     => "r",
+    "s"     => "s",
+    "sh"    => "sh",
+    "t"     => "t",
+    "tcl"   => "sil",
+    "th"    => "th",
+    "uh"    => "uh",
+    "uw"    => "uw",
+    "ux"    => "uw",
+    "v"     => "v",
+    "w"     => "w",
+    "y"     => "y",
+    "z"     => "z",
+    "zh"    => "sh"
+)
+#######################################################################
+function timit_prepare(timitdir, dir; audio_fmt="SPHERE")
+    # Validate the data directory
+    ! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))
+    # Create the output directory.
+    dir = mkpath(dir)
+    rm(joinpath(dir, "recordings.jsonl"), force=true)
+    ## Recordings
+    @info "Extracting recordings from $timitdir/train"
+    train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)
+    # We extract the name of speakers that are not in the dev set
+    TIMIT_TRAIN_SPK_LIST = Set()
+    for id in keys(train_recordings)
+        _, spk, _ = split(id, "_")
+        if spk ∉ TIMIT_DEV_SPK_LIST
+            push!(TIMIT_TRAIN_SPK_LIST, spk)
+        end
+    end
+    @info "Extracting recordings from $timitdir/test"
+    test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)
+    recordings = merge(train_recordings, test_recordings)
+    manifestpath = joinpath(dir, "recordings.jsonl")
+    open(manifestpath, "a") do f
+        writemanifest(f, recordings)
+    end
+    # Annotations
+    @info "Extracting annotations from $timitdir/train"
+    train_annotations = timit_annotations(joinpath(timitdir, "train"))
+    @info "Extracting annotations from $timitdir/test"
+    test_annotations = timit_annotations(joinpath(timitdir, "test"))
+    annotations = merge(train_annotations, test_annotations)
+    train_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_TRAIN_SPK_LIST
+        )
+    end
+    dev_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_DEV_SPK_LIST
+        )
+    end
+    test_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_TEST_SPK_LIST
+        )
+    end
+    for (x, y) in ("train" => train_annotations,
+                   "dev" => dev_annotations,
+                   "test" => test_annotations)
+        manifestpath = joinpath(dir, "annotations-$(x).jsonl")
+        @info "Creating $manifestpath"
+        open(manifestpath, "w") do f
+            writemanifest(f, y)
+        end
+    end
+end
+function timit_recordings(dir::AbstractString; fmt="SPHERE")
+    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            name, ext = splitext(file)
+            ext != ".wav" && continue
+            spk = basename(root)
+            path = joinpath(root, file)
+            id = "timit_$(spk)_$(name)"
+            audio_src = if fmt == "SPHERE"
+                CmdAudioSource(`sph2pipe -f wav $path`)
+            else
+                FileAudioSource(path)
+            end
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+function timit_annotations(dir)
+    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))
+    splitline(line) = rsplit(line, limit=3)
+    annotations = Dict()
+    processed = Set()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            name, ext = splitext(file)
+            _, dialect, spk = rsplit(root, "/", limit=3)
+            # Annotation files already processed (".wrd" and ".phn")
+            idtuple = (dialect, spk, name)
+            (idtuple in processed) && continue
+            push!(processed, (dialect, spk, name))
+            # Words
+            wpath = joinpath(root, name * ".wrd")
+            words = [last(split(line)) for line in eachline(wpath)]
+            # Phones
+            ppath = joinpath(root, name * ".phn")
+            palign = Tuple{Int,Int,String}[]
+            for line in eachline(ppath)
+                t0, t1, p = split(line)
+                push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))
+            end
+            sentence_type = if startswith(name, "sa")
+                "dialect"
+            elseif startswith(name, "sx")
+                "compact"
+            else # startswith(name, "si")
+                "diverse"
+            end
+            id = "timit_$(spk)_$(name)"
+            annotations[id] = Annotation(
+                id,  # recording id and annotation id are the same since we have
+                id,  # a one-to-one mapping
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                Dict(
+                     "text" => join(words, " "),
+                     "sentence type" => sentence_type,
+                     "alignment" => palign,
+                     "dialect" => dialect,
+                     "speaker" => spk,
+                     "sex" => string(first(spk)),
+                )
+            )
+        end
+    end
+    annotations
+end
+function TIMIT(timitdir, dir, subset)
+    if ! (isfile(joinpath(dir, "recordings.jsonl")) &&
+          isfile(joinpath(dir, "annotations-train.jsonl")) &&
+          isfile(joinpath(dir, "annotations-dev.jsonl")) &&
+          isfile(joinpath(dir, "annotations-test.jsonl")))
+        timit_prepare(timitdir, dir)
+    end
+    dataset(dir, subset)
+end
--- a/src/dataset.jl
+++ b/src/dataset.jl
+# SPDX-License-Identifier: CECILL-2.1
+struct SpeechDataset <: MLUtils.AbstractDataContainer
+    idxs::Vector{AbstractString}
+    annotations::Dict{AbstractString, Annotation}
+    recordings::Dict{AbstractString, Recording}
+end
+"""
+dataset(manifestroot)
+Load `SpeechDataset` from manifest files stored in `manifestroot`.
+Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.
+See also [`Annotation`](@ref).
+# Examples
+```julia-repl
+julia> ds = dataset("./manifests", :train)
+SpeechDataset(
+    ...
+)
+julia> ds[1]
+(
+    (samples=[...], sampling_rate=16_000),
+    Dict(
+        "text" => "Annotation text here"
+    )
+)
+```
+"""
+function dataset(manifestroot::AbstractString, partition)
+    partition_name = partition == "" ? "" : "-$(partition)"
+    annot_path =  joinpath(manifestroot, "annotations$(partition_name).jsonl") 
+    rec_path = joinpath(manifestroot, "recordings.jsonl")
+    annotations = load(Annotation, annot_path)
+    recordings = load(Recording, rec_path)
+    dataset(annotations, recordings)
+end
+function dataset(annotations::AbstractDict, recordings::AbstractDict)
+    idxs = collect(keys(annotations))
+    SpeechDataset(idxs, annotations, recordings)
+end
+Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]
+Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])
+# Fix1 -> partial funcion with fixed 1st argument
+Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)
+Base.length(d::SpeechDataset) = length(d.idxs)
+function Base.filter(fn, d::SpeechDataset)
+    fidxs = filter(d.idxs) do i
+        fn((d.recordings[i], d.annotations[i]))
+    end
+    idset = Set(fidxs)
+    fannotations = filter(d.annotations) do (k, v)
+        k ∈ idset
+    end
+    frecs = filter(d.recordings) do (k, v)
+        k ∈ idset
+    end
+    SpeechDataset(fidxs, fannotations, frecs)
+end
--- a/src/lexicons.jl
+++ b/src/lexicons.jl
+# SPDX-License-Identifier: CECILL-2.1
+const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"
+const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"
+function normalizeword(word)
+    String(uppercase(word))
+end
+function normalizephoneme(phoneme)
+    String(uppercase(phoneme))
+end
+"""
+    CMUDICT(path)
+Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.
+The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent
+calls will only read the file `path` without downloading again the data.
+"""
+function CMUDICT(path)
+    if ! isfile(path)
+        mkpath(dirname(path))
+        dir = mktempdir()
+        run(`wget -P $dir $CMUDICT_URL`)
+        mv(joinpath(dir, "cmudict_SPHINX_40"), path)
+    end
+    lexicon = Dict()
+    open(path, "r") do f
+        for line in eachline(f)
+            word, pron... = split(line)
+            word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")
+            prononciations = get(lexicon, word, [])
+            push!(prononciations, pron)
+            lexicon[word] = prononciations
+        end
+    end
+    lexicon
+end
+"""
+    TIMITDICT(timitdir)
+Return the dictionary of pronunciation as provided by TIMIT corpus (located
+in `timitdir`).
+"""
+function TIMITDICT(timitdir)
+    dictfile = joinpath(timitdir, "doc", "timitdic.txt")
+    iscomment(line) = first(line) == ';'
+    lexicon = Dict{String,Vector{Vector{String}}}()
+    for line in eachline(dictfile)
+        iscomment(line) && continue
+        word, pron = split(line, limit=2)
+        pron = strip(pron, ['/', '\t', ' '])
+        word = '~' in word ? split(word, "~", limit=2)[1] : word
+        word = normalizeword(word)
+        pron = normalizephoneme.(split(pron))
+        prononciations = get(lexicon, word, Vector{String}[])
+        push!(prononciations, pron)
+        lexicon[word] = prononciations
+    end
+    lexicon
+end
+"""
+    MFAFRDICT(path)
+Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)
+"""
+function MFAFRDICT(path)
+    if ! isfile(path)
+        mkpath(dirname(path))
+        dir = mktempdir()
+        run(`wget -P $dir $FRMFA_DICT_URL`)
+        mv(joinpath(dir, "french_mfa.dict"), path)
+    end
+    lexicon = Dict()
+    open(path, "r") do f
+        for line in eachline(f)
+            word, pron... = split(line)
+            prononciations = get(lexicon, word, [])
+            push!(prononciations, pron)
+            lexicon[word] = prononciations
+        end
+    end
+    lexicon
+end
\ No newline at end of file
--- a/src/manifest_io.jl
+++ b/src/manifest_io.jl
+# SPDX-License-Identifier: CECILL-2.1
+#=====================================================================#
+# JSON serialization of a manifest item
+function Base.show(io::IO, m::MIME"application/json", s::FileAudioSource)
+    compact = get(io, :compact, false)
+    indent = get(io, :indent, 0)
+    printfn = compact ? print : println
+    printfn(io, "{")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"path\", ")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"", s.path, "\"")
+    print(io, repeat(" ", indent), "}")
+end
+function Base.show(io::IO, m::MIME"application/json", s::URLAudioSource)
+    compact = get(io, :compact, false)
+    indent = get(io, :indent, 0)
+    printfn = compact ? print : println
+    printfn(io, repeat(" ", indent), "{")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"url\", ")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"", s.url, "\"")
+    print(io, repeat(" ", indent), "}")
+end
+function Base.show(io::IO, m::MIME"application/json", s::CmdAudioSource)
+    compact = get(io, :compact, false)
+    indent = get(io, :indent, 0)
+    printfn = compact ? print : println
+    printfn(io, "{")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"cmd\", ")
+    strcmd = replace("$(s.cmd)", "`" => "")
+    printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"$(strcmd)\"")
+    print(io, repeat(" ", indent), "}")
+end
+function Base.show(io::IO, m::MIME"application/json", r::Recording)
+    compact = get(io, :compact, false)
+    indent = compact ? 0 : 2
+    printfn = compact ? print : println
+    printfn(io, "{")
+    printfn(io, repeat(" ", indent), "\"id\": \"", r.id, "\", ")
+    print(io, repeat(" ", indent), "\"src\": ")
+    show(IOContext(io, :indent => compact ? 0 : 2), m, r.source)
+    printfn(io, ", ")
+    print(io, repeat(" ", indent), "\"channels\": [")
+    for (i, c) in enumerate(r.channels)
+        print(io, c)
+        i < length(r.channels) && print(io, ",")
+    end
+    printfn(io, "], ")
+    printfn(io, repeat(" ", indent), "\"samplerate\": ", r.samplerate)
+    print(io, "}")
+end
+function Base.show(io::IO, m::MIME"application/json", a::Annotation)
+    compact = get(io, :compact, false)
+    indent = compact ? 0 : 2
+    printfn = compact ? print : println
+    printfn(io, "{")
+    printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")
+    printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")
+    printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")
+    printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")
+    printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")
+    printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)
+    print(io, "}")
+end
+function JSON.json(r::Union{Recording, Annotation}; compact = true)
+    out = IOBuffer()
+    show(IOContext(out, :compact => compact), MIME("application/json"), r)
+    String(take!(out))
+end
+#=====================================================================#
+# Converting a dictionary to a manifest item.
+function AudioSource(d::Dict)
+    if d["type"] == "path"
+        T = FileAudioSource
+    elseif d["type"] == "url"
+        T = URLAudioSource
+    elseif d["type"] == "cmd"
+        T = CmdAudioSource
+    else
+        throw(ArgumentError("invalid type: $(d["type"])"))
+    end
+    T(d["data"])
+end
+Recording(d::Dict) = Recording(
+    d["id"],
+    AudioSource(d["src"]),
+    convert(Vector{Int}, d["channels"]),
+    d["samplerate"]
+)
+Annotation(d::Dict) = Annotation(
+    d["id"],
+    d["recording_id"],
+    d["start"],
+    d["duration"],
+    d["channels"],
+    d["data"]
+)
+#=====================================================================#
+# Writing / reading manifest from file.
+function writemanifest(io::IO, manifest::Dict)
+    writefn = x -> println(io, x)
+    for item in values(manifest)
+        item |> json |> writefn
+    end
+end
+function readmanifest(io::IO, T)
+    manifest = Dict()
+    for line in eachline(io)
+        item = JSON.parse(line) |> T
+        manifest[item.id] = item
+    end
+    manifest
+end
+# Some utilities
+manifestname(::Type{<:Recording}, name) = "recordings.jsonl"
+manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"
+"""
+    load(Annotation, path)
+    load(Recording, path)
+Load Recording/Annotation manifest from `path`.
+"""
+load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")
+function checkdir(dir::AbstractString)
+    isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))
+end
--- a/src/manifest_item.jl
+++ b/src/manifest_item.jl
+# SPDX-License-Identifier: CECILL-2.1
+"""
+    abstract type ManifestItem end
+Base class for all manifest item. Every manifest item should have an
+`id` attribute.
+"""
+abstract type ManifestItem end
+"""
+    struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
+        id::AbstractString
+        source::Ts
+        channels::Vector{Int}
+        samplerate::Int
+    end
+A recording is an audio source associated with and id.
+# Constructors
+    Recording(id, source, channels, samplerate)
+    Recording(id, source[; channels = missing, samplerate = missing])
+If the channels or the sample rate are not provided then they will be
+read from `source`.
+!!! warning
+    When preparing large corpus, not providing the channes and/or the
+    sample rate can drastically reduce the speed as it forces to read
+    source.
+"""
+struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
+    id::AbstractString
+    source::Ts
+    channels::Vector{Int}
+    samplerate::Int
+end
+function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)
+    if ismissing(channels) || ismissing(samplerate)
+        x, sr = loadaudio(s)
+        samplerate = ismissing(samplerate) ? Int(sr) : samplerate
+        channels = ismissing(channels) ? collect(1:size(x,2)) : channels
+    end
+    Recording(uttid, s, channels, samplerate)
+end
+"""
+    struct Annotation <: ManifestItem
+        id::AbstractString
+        recording_id::AbstractString
+        start::Float64
+        duration::Float64
+        channel::Union{Vector, Colon}
+        data::Dict
+    end
+An "annotation" defines a segment of a recording on a single channel.
+The `data` field is an arbitrary dictionary holdin the nature of the
+annotation. `start` and `duration` (in seconds) defines,
+where the segment is locatated within the recoding `recording_id`.
+# Constructor
+    Annotation(id, recording_id, start, duration, channel, data)
+    Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
+If `start` and/or `duration` are negative, the segment is considered to
+be the whole sequence length of the recording.
+"""
+struct Annotation <: ManifestItem
+    id::AbstractString
+    recording_id::AbstractString
+    start::Float64
+    duration::Float64
+    channels::Union{Vector, Colon}
+    data::Dict
+end
+Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =
+    Annotation(id, recid, start, duration, channels, data)
+"""
+    load(recording[; start = -1, duration = -1, channels = recording.channels])
+    load(recording, annotation)
+Load the signal from a recording. `start`, `duration` (in seconds) can
+be used to load only a segment. If an `annotation` is given, function
+will return on the portion of the signal corresponding to the
+annotation segment.
+The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array
+- ``N`` is the length of the signal and ``C`` is the number of channels
+- and `sr` is the sampling rate of the signal.
+"""
+function load(r::Recording; start = -1, duration = -1, channels = r.channels)
+    if start >= 0 && duration >= 0
+        s = Int(floor(start * r.samplerate + 1))
+        e = Int(ceil(duration * r.samplerate))
+        subrange = (s:e)
+    else
+        subrange = (:)
+    end
+    x, sr = loadaudio(r.source, subrange)
+    x[:,channels], sr
+end
+load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)
--- a/src/speechcorpus.jl
+++ b/src/speechcorpus.jl
+# SPDX-License-Identifier: CECILL-2.1
+"""
+    abstract type SpeechCorpus end
+Abstract type for all speech corpora.
+"""
+abstract type SpeechCorpus end
+"""
+    lang(corpus)
+Return the ISO 639-3 code of the language of the corpus.
+"""
+lang
+"""
+    name(corpus)
+Return the name identifier of the corpus.
+"""
+name
+"""
+    download(corpus, rootdir)
+Download the data of the corpus to `dir`.
+"""
+Base.download
+"""
+    prepare(corpus, rootdir)
+Prepare the manifests of corpus.
+"""
+prepare
No results found