Compare revisions

Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang · Lucas Ondel Yang
--- a/.gitignore
+++ b/.gitignore
+*outputdir/
+Manifest.toml
+notebook-test.jl
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Tags
+## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024
+### Changed
+- Added support for Speech2Tex dataset
+
+## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024
+### Changed
+- Added support for AVID dataset
+
+## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024
+### Changed
+- Added support for INA Diachrony dataset
+### Fixed
+- Fixed Minilibrispeech data prep
+
+## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024
+### Changed
+- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.
+
+## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024
+### Added
+- filtering speech dataset based on recording id.
+### Improved
+- Faster TIMIT preparation
+
+## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024
+### Added
+- extract alignments from TIMIT
+### Changed
+- `Supervision` is now `Annotation`
+
+## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024
+# Fixed
+- TIMIT data preparation
+
+## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024
+# Fixed
+- `CMUDICT("dir/path")` fails if `dir` does not already exists.
+
+## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024
+# Fixed
+- invalid type for field `channels` of `Recording`
+- `MINILIBRISPEECH` broken
+
+## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024
+# Fixed
+- not possible to use `:` as channel specifier
+
+## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024
+# Changed
+- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`
+## Added
+* CMU and TIMIT lexicon
+
+## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024
+## Features
+* New `dataset` function, which builds `SpeechDataset` from manifest files
+* Compatibility with MLUtils.DataLoader
+
+## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023
+## Changed
+* refactored API, TIMIT dataset working (but not Librispeech anymore)
+
+## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023
+## Added
+- raw audio data source
+
+## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023
+## Added
+- can load the data directly from an audio source with the `load`
+  function.

 ## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023
 ## Added

--- a/Project.toml
+++ b/Project.toml
-name = "SpeechCorpora"
-uuid = "3225a15e-d855-4a07-9546-2418058331ae"
-authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>"]
-version = "0.4.1"
+name = "SpeechDatasets"
+uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"
+authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",
+           "Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",
+           "Nicolas DENIER <nicolas.denier@lisn.fr>"]
+version = "0.15.0"

 [deps]
-Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
-HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
-WAV = "8149f6b0-98f6-5db9-b78f-408fbbb8ef88"
+MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
+SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"

 [compat]
+julia = "1.10"
 JSON = "0.21"
-WAV = "1.2"
-julia = "1.8"
+SpeechFeatures = "0.8"
+
--- a/README.md
+++ b/README.md
-# SpeechCorpora.jl
+# SpeechDatasets.jl

 A Julia package to download and prepare speech corpus.

@@ -7,35 +7,44 @@ A Julia package to download and prepare speech corpus.
 Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)
 to your julia installation. Then, install the package as usual:
 ```
-pkg> add SpeechCorpora
+pkg> add SpeechDatasets
 ```

 ## Example

 ```
-julia> using SpeechCorpora
+julia> using SpeechDatasets

-julia> corpus = MultilingualLibriSpeech("fra") |> download |> prepare
+julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test
+...

-# Load the recording manifest.
-julia> recs = load(corpus, Recording, "dev") # use "train", "dev" or "test"
+julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test
+...

-# Load the supervision manifest.
-julia> sups = load(corpus, Supervision, "dev") # use "train", "dev" or "test"
+julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional
+...

-# Load the signal of the first supervision segment
-julia> s = first(values(sups))
-julia> x, samplerate = load(recs[s.recording_id], s)
+julia> dataset = AVID("/path/to/avid/dir", "outputdir")
+...

-# Play the recording of the first supervision segment
-julia> play(recs[s.recording_id], s)
+julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")
+...

-```

-## Author
-* Lucas ONDEL YANG (LISN, CNRS)
+julia> for ((signal, fs), supervision) in dataset
+           # do something
+       end
+
+# Lexicons
+julia> CMUDICT("outputfile")
+...
+
+julia> TIMITDICT("/path/to/timit/dir")
+...
+
+```

 ## License

-This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE)
+This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))

--- a/src/SpeechCorpora.jl
+++ b/src/SpeechCorpora.jl
 # SPDX-License-Identifier: CECILL-2.1

-module SpeechCorpora
+module SpeechDatasets

-using Base64
-using HTTP
 using JSON
-using WAV
+using SpeechFeatures
+import MLUtils

 export
    # ManifestItem
-    FileAudioSource,
-    CmdAudioSource,
-    URLAudioSource,
    Recording,
-    Supervision,
+    Annotation,
    load,

    # Manifest interface
@@ -22,28 +18,34 @@ export

    # Corpora interface
    download,
+    lang,
+    name,
    prepare,

    # Corpora
    MultilingualLibriSpeech,
-    MiniLibriSpeech
+    MINILIBRISPEECH,
+    TIMIT,
+    INADIACHRONY,
+    AVID,
+    SPEECH2TEX,

+    # Lexicon
+    CMUDICT,
+    TIMITDICT,
+    MFAFRDICT,

-
-SPEECH_CORPORA_ROOTDIR = homedir()
-
-"""
-    setrootdir(path)
-
-Set the root directory where to store the datasets. Default to the user
-home directory.
-"""
-setrootdir(path) = global SPEECH_CORPORA_ROOTDIR = path
+    # Dataset
+    dataset

 include("speechcorpus.jl")
 include("manifest_item.jl")
 include("manifest_io.jl")
-include("corpora/multilingual_librispeech.jl")
-include("corpora/mini_librispeech.jl")
+include("dataset.jl")
+
+# Supported corpora
+include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))
+
+include("lexicons.jl")

 end
--- a/src/corpora/avid.jl
+++ b/src/corpora/avid.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+function avid_recordings(dir::AbstractString)
+    checkdir(dir)
+
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            id = filename
+            path = joinpath(root, file)
+
+            audio_src = FileAudioSource(path)
+
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+
+
+function load_metadata_files(dir::AbstractString)
+    tasksdict = Dict('s' => "SENT", 'p' => "PARA")
+    metadatadict = Dict(key => 
+        readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv")) 
+        for key in keys(tasksdict))
+    return metadatadict
+end
+
+
+function get_metadata(filename, metadatadict)
+    task = split(filename, "_")[3][1]
+    headers = metadatadict[task][1]
+    headers = split(headers, ",")
+    file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]
+    file_metadata = split(file_metadata, ",")
+    metadata = Dict(
+        headers[i] => file_metadata[i]
+        for i = 1:length(headers)
+    )
+    return metadata
+end
+
+
+function avid_annotations(dir)
+    checkdir(dir)
+
+    annotations = Dict()
+    metadatadict = load_metadata_files(dir)
+
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            # extract metadata from csv files
+            metadata = get_metadata(filename, metadatadict)
+            
+            id = filename
+            # generate annotation
+            annotations[id] = Annotation(
+                id, # audio id
+                id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations   
+            )
+        end
+    end
+    annotations
+end
+
+
+function download_avid(dir)
+    @info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"
+    url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"
+    filename = "AVID.zip"
+    filepath = joinpath(dir,filename)
+    run(`mkdir -p $dir`)
+    run(`wget $url -O $filepath`)
+    @info "Download complete, extracting files"
+    run(`unzip $filepath -d $dir`)
+    run(`rm $filepath`)
+    return joinpath(datadir, "/AVID")
+end
+
+
+function avid_prepare(datadir, outputdir)
+    # Validate the data directory
+    isdir(datadir) || (datadir = download_avid(datadir))
+
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+
+    # Recordings
+    recordings = Array{Dict}(undef, 2)
+    recordings_path = joinpath(datadir, "Repository 2")
+    @info "Extracting recordings from $recordings_path"
+    recordings[1] = avid_recordings(recordings_path)
+    # Calibration tones
+    calibtones_path = joinpath(datadir, "Calibration_tones")
+    @info "Extracting recordings from $calibtones_path"
+    recordings[2] = avid_recordings(calibtones_path)
+
+    for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])
+        open(manifestpath, "w") do f
+            writemanifest(f, recordings[i])
+        end
+    end
+
+    # Annotations
+    annotations_path = recordings_path
+    @info "Extracting annotations from $annotations_path"
+    annotations = avid_annotations(annotations_path)
+        
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+
+
+function AVID(datadir, outputdir)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        avid_prepare(datadir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/ina_diachrony.jl
+++ b/src/corpora/ina_diachrony.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+function ina_diachrony_recordings(dir::AbstractString)
+    checkdir(dir)
+
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            id = "ina_diachrony§$filename"
+            path = joinpath(root, file)
+
+            audio_src = FileAudioSource(path)
+
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+
+
+function ina_diachrony_get_metadata(filename)
+    metadata = split(filename, "§")
+    age, sex = split(metadata[2], "_")
+    Dict(
+        "speaker" => metadata[3],
+        "timeperiod" => metadata[1],
+        "age" => age,
+        "sex" => sex,
+    )
+end
+
+
+function ina_diachrony_annotations_whole(dir)
+    checkdir(dir)
+
+    annotations = Dict()
+
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            # extract metadata from filename
+            metadata = ina_diachrony_get_metadata(filename)
+            
+            # extract transcription text (same filename but .txt)
+            textfilepath = joinpath(root, "$filename.txt")
+            metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
+            
+            id = "ina_diachrony§$filename"
+            annotation_id = id*"§0"
+            # generate annotation
+            annotations[annotation_id] = Annotation(
+                id, # audio id
+                annotation_id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations
+            )
+        end
+    end
+    annotations
+end
+
+
+function ina_diachrony_annotations_csv(dir)
+    checkdir(dir)
+
+    annotations = Dict()
+
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".csv" && continue
+
+            # extract metadata from filename
+            metadata = ina_diachrony_get_metadata(filename)
+
+            id = "ina_diachrony§$filename"
+            # generate annotation for each line in csv
+            open(joinpath(root, file)) do f
+                header = readline(f)   
+                line = 1 
+                # read till end of file
+                while ! eof(f) 
+                    current_line = readline(f)
+                    start_time, end_time, text = split(current_line, ",", limit=3)
+                    start_time = parse(Float64, start_time)
+                    duration = parse(Float64, end_time)-start_time
+                    metadata["text"] = text
+                    annotation_id = id*"§$line"
+                    annotations[id] = Annotation(
+                        id, # audio id
+                        annotation_id, # annotation id
+                        start_time,  # start
+                        duration,  # duration
+                        [1], # only 1 channel (mono recording)
+                        metadata # additional informations
+                    )
+                    line += 1
+                end
+            end
+
+        end
+    end
+    annotations
+end
+
+
+function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
+    # Validate the data directory
+    for d in [ina_wav_dir, ina_csv_dir]
+        isnothing(d) || checkdir(d)
+    end
+
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+
+    # Recordings
+    @info "Extracting recordings from $ina_wav_dir"
+    recordings = ina_diachrony_recordings(ina_wav_dir)
+
+    manifestpath = joinpath(outputdir, "recordings.jsonl")
+    open(manifestpath, "w") do f
+        writemanifest(f, recordings)
+    end
+
+    # Annotations
+    @info "Extracting annotations from $ina_wav_dir"
+    annotations = ina_diachrony_annotations_whole(ina_wav_dir)
+    if ! isnothing(ina_csv_dir)
+        @info "Extracting annotations from $ina_csv_dir"
+        csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)
+        annotations = merge(annotations, csv_annotations)
+    end
+        
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+
+function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/mini_librispeech.jl
+++ b/src/corpora/mini_librispeech.jl
@@ -11,15 +11,9 @@ const MINILS_SUBSETS = Dict(
    "dev" => "dev-clean-2"
 )

-const MINILS_LANG = "eng"
-
-const MINILS_NAME = "mini_librispeech"
-
 #######################################################################

-struct MiniLibriSpeech <: SpeechCorpus
-    lang
-    name
+struct MINILIBRISPEECH <: SpeechCorpus
    recordings
    train
    dev
@@ -48,7 +42,7 @@ function minils_recordings(dir, subset)
    recs
 end

-function minils_supervisions(dir, subset)
+function minils_annotations(dir, subset)
    subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
    sups = Dict()
    for d1 in readdir(subsetdir; join = true)
@@ -58,8 +52,12 @@ function minils_supervisions(dir, subset)
            open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f
                for line in eachline(f)
                    tokens = split(line)
-                    s = Supervision(tokens[1], tokens[1]; channel = 1,
-                                    data = Dict("text" => join(tokens[2:end], " ")))
+                    s = Annotation(
+                        tokens[1], # annotation id
+                        tokens[1]; # recording id
+                        channels = [1],
+                        data = Dict("text" => join(tokens[2:end], " "))
+                    )
                    sups[s.id] = s
                end
            end
@@ -89,7 +87,7 @@ end

 function minils_prepare(dir)
    # 1. Recording manifest.
-    out = joinpath(dir, "recording-manifest.jsonl")
+    out = joinpath(dir, "recordings.jsonl")
    if ! isfile(out)
        open(out, "w") do f
            for subset in ["train", "dev"]
@@ -100,12 +98,12 @@ function minils_prepare(dir)
        end
    end

-    # 2. Supervision manifests.
-    for subset in ["train", "dev"]
-        out = joinpath(dir, "supervision-manifest-$subset.jsonl")
+    # 2. Annotation manifests.
+    for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]
+        out = joinpath(dir, "annotations-$name.jsonl")
        if ! isfile(out)
-            @debug "preparing supervision manifest ($subset) $out"
-            sups = minils_supervisions(dir, subset)
+            @debug "preparing annotation manifest ($subset) $out"
+            sups = minils_annotations(dir, subset)
            open(out, "w") do f
                writemanifest(f, sups)
            end
@@ -113,20 +111,10 @@ function minils_prepare(dir)
    end
 end

-function MiniLibriSpeech(outdir)
-    dir = joinpath(outdir, MINILS_LANG, MINILS_NAME)

+function MINILIBRISPEECH(dir, subset)
    minils_download(dir)
    minils_prepare(dir)
-
-    MiniLibriSpeech(
-        MINILS_LANG,
-        MINILS_NAME,
-        load(Recording, joinpath(dir, "recording-manifest.jsonl")),
-        load(Supervision, joinpath(dir, "supervision-manifest-train.jsonl")),
-        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
-        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
-    )
+    dataset(dir, subset)
 end
-MiniLibriSpeech() = MiniLibriSpeech(SPEECH_CORPORA_ROOTDIR)

--- a/src/corpora/multilingual_librispeech.jl
+++ b/src/corpora/multilingual_librispeech.jl
@@ -89,13 +89,13 @@ function recordings(corpus::MultilingualLibriSpeech, dir, subset)
    recs
 end

-function supervisions(corpus::MultilingualLibriSpeech, dir, subset)
+function annotations(corpus::MultilingualLibriSpeech, dir, subset)
    trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")
    sups = Dict()
    open(trans, "r") do f
        for line in eachline(f)
            tokens = split(line)
-            s = Supervision(tokens[1], tokens[1]; channel = 1,
+            s = Annotation(tokens[1], tokens[1]; channel = 1,
                            data = Dict("text" => join(tokens[2:end], " ")))
            sups[s.id] = s
        end
@@ -118,12 +118,12 @@ function prepare(corpus::MultilingualLibriSpeech, outdir)
        end
    end

-    # 2. Supervision manifests.
+    # 2. Annotation manifests.
    for subset in ["train", "dev", "test"]
-        out = joinpath(dir, "supervision-manifest-$subset.jsonl")
-        @info "preparing supervision manifest ($subset) $out"
+        out = joinpath(dir, "annotation-manifest-$subset.jsonl")
+        @info "preparing annotation manifest ($subset) $out"
        if ! isfile(out)
-            sups = supervisions(corpus, dir, subset)
+            sups = annotations(corpus, dir, subset)
            open(out, "w") do f
                writemanifest(f, sups)
            end

--- a/src/corpora/speech2tex.jl
+++ b/src/corpora/speech2tex.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+function speech2tex_recordings(dir::AbstractString)
+    checkdir(dir)
+
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            id = filename
+            path = joinpath(root, file)
+
+            audio_src = FileAudioSource(path)
+
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 48000
+            )
+        end
+    end
+    recordings
+end
+
+extract_digits(str::AbstractString) = filter(c->isdigit(c), str)
+isnumber(str::AbstractString) = extract_digits(str)==str
+
+function speech2tex_get_metadata(filename)
+    # possible cases: line123_p1  line123_124_p1  line123_p1_part2  (not observed but also supported: line123_124_p1_part2)
+    split_name = split(filename, "_")
+    metadata = Dict()
+    if isnumber(split_name[2])
+        metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]
+        metadata["speaker"] = split_name[3]
+    else 
+        metadata["line"] = extract_digits(split_name[1])
+        metadata["speaker"] = split_name[2]
+    end
+    if occursin("part", split_name[end])
+        metadata["part"] = extract_digits(split_name[end])
+    end
+    metadata
+end
+
+
+function speech2tex_annotations(audiodir, transcriptiondir, texdir)
+    checkdir.([audiodir, transcriptiondir, texdir])
+
+    annotations = Dict()
+
+    for (root, subdirs, files) in walkdir(audiodir)
+        for file in files
+            filename, ext = splitext(file)
+            ext != ".wav" && continue
+            
+            # extract metadata from csv files
+            metadata = speech2tex_get_metadata(filename)
+
+            # extract transcription and tex (same filenames but .txt)
+            dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")
+            for (d, label) in dirdict
+                textfilepath = joinpath(d, "$filename.txt")
+                metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
+            end
+            id = filename
+            # generate annotation
+            annotations[id] = Annotation(
+                id, # audio id
+                id, # annotation id
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                metadata # additional informations   
+            )
+        end
+    end
+    annotations
+end
+
+function speech2tex_prepare(datadir, outputdir)
+    # Validate the data directory
+    checkdir(datadir)
+
+    # Create the output directory.
+    outputdir = mkpath(outputdir)
+    rm(joinpath(outputdir, "recordings.jsonl"), force=true)
+
+    # Recordings
+    recordings = Array{Dict}(undef, 2)
+    recordings_path = joinpath(datadir, "audio")
+    @info "Extracting recordings from $recordings_path"
+    recordings = speech2tex_recordings(recordings_path)
+
+    manifestpath = joinpath(outputdir, "recordings.jsonl")
+    open(manifestpath, "w") do f
+        writemanifest(f, recordings)
+    end
+
+    # Annotations
+    transcriptiondir = joinpath(datadir, "sequences")
+    texdir = joinpath(datadir, "latex")
+    @info "Extracting annotations from $transcriptiondir and $texdir"
+    annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)
+        
+    manifestpath = joinpath(outputdir, "annotations.jsonl")
+    @info "Creating $manifestpath"
+    open(manifestpath, "w") do f
+        writemanifest(f, annotations)
+    end
+end
+
+
+function SPEECH2TEX(datadir, outputdir)
+    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
+          isfile(joinpath(outputdir, "annotations.jsonl")))
+        speech2tex_prepare(datadir, outputdir)
+    end
+    dataset(outputdir, "")
+end
--- a/src/corpora/timit.jl
+++ b/src/corpora/timit.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+#######################################################################
+
+
+const TIMIT_SUBSETS = Dict(
+    "train" => "train",
+    "dev" => "dev",
+    "test" => "test"
+)
+
+
+const TIMIT_DEV_SPK_LIST = Set([
+"faks0",
+    "fdac1",
+    "fjem0",
+    "mgwt0",
+    "mjar0",
+    "mmdb1",
+    "mmdm2",
+    "mpdf0",
+    "fcmh0",
+    "fkms0",
+    "mbdg0",
+    "mbwm0",
+    "mcsh0",
+    "fadg0",
+    "fdms0",
+    "fedw0",
+    "mgjf0",
+    "mglb0",
+    "mrtk0",
+    "mtaa0",
+    "mtdt0",
+    "mthc0",
+    "mwjg0",
+    "fnmr0",
+    "frew0",
+    "fsem0",
+    "mbns0",
+    "mmjr0",
+    "mdls0",
+    "mdlf0",
+    "mdvc0",
+    "mers0",
+    "fmah0",
+    "fdrw0",
+    "mrcs0",
+    "mrjm4",
+    "fcal1",
+    "mmwh0",
+    "fjsj0",
+    "majc0",
+    "mjsw0",
+    "mreb0",
+    "fgjd0",
+    "fjmg0",
+    "mroa0",
+    "mteb0",
+    "mjfc0",
+    "mrjr0",
+    "fmml0",
+    "mrws1"
+])
+
+
+const TIMIT_TEST_SPK_LIST = Set([
+    "mdab0",
+    "mwbt0",
+    "felc0",
+    "mtas1",
+    "mwew0",
+    "fpas0",
+    "mjmp0",
+    "mlnt0",
+    "fpkt0",
+    "mlll0",
+    "mtls0",
+    "fjlm0",
+    "mbpm0",
+    "mklt0",
+    "fnlp0",
+    "mcmj0",
+    "mjdh0",
+    "fmgd0",
+    "mgrt0",
+    "mnjm0",
+    "fdhc0",
+    "mjln0",
+    "mpam0",
+    "fmld0"
+])
+
+
+TIMIT_PHONE_MAP48 = Dict(
+    "aa"    => "aa",
+    "ae"    => "ae",
+    "ah"    => "ah",
+    "ao"    => "ao",
+    "aw"    => "aw",
+    "ax"    => "ax",
+    "ax-h"  => "ax",
+    "axr"   => "er",
+    "ay"    => "ay",
+    "b"     => "b",
+    "bcl"   => "vcl",
+    "ch"    => "ch",
+    "d"     => "d",
+    "dcl"   => "vcl",
+    "dh"    => "dh",
+    "dx"    => "dx",
+    "eh"    => "eh",
+    "el"    => "el",
+    "em"    => "m",
+    "en"    => "en",
+    "eng"   => "ng",
+    "epi"   => "epi",
+    "er"    => "er",
+    "ey"    => "ey",
+    "f"     => "f",
+    "g"     => "g",
+    "gcl"   => "vcl",
+    "h#"    => "sil",
+    "hh"    => "hh",
+    "hv"    => "hh",
+    "ih"    => "ih",
+    "ix"    => "ix",
+    "iy"    => "iy",
+    "jh"    => "jh",
+    "k"     => "k",
+    "kcl"   => "cl",
+    "l"     => "l",
+    "m"     => "m",
+    "n"     => "n",
+    "ng"    => "ng",
+    "nx"    => "n",
+    "ow"    => "ow",
+    "oy"    => "oy",
+    "p"     => "p",
+    "pau"   => "sil",
+    "pcl"   => "cl",
+    "q"     => "",
+    "r"     => "r",
+    "s"     => "s",
+    "sh"    => "sh",
+    "t"     => "t",
+    "tcl"   => "cl",
+    "th"    => "th",
+    "uh"    => "uh",
+    "uw"    => "uw",
+    "ux"    => "uw",
+    "v"     => "v",
+    "w"     => "w",
+    "y"     => "y",
+    "z"     => "z",
+    "zh"    => "zh"
+)
+
+
+TIMIT_PHONE_MAP39 = Dict(
+    "aa"    => "aa",
+    "ae"    => "ae",
+    "ah"    => "ah",
+    "ao"    => "aa",
+    "aw"    => "aw",
+    "ax"    => "ah",
+    "ax-h"  => "ah",
+    "axr"   => "er",
+    "ay"    => "ay",
+    "b"     => "b",
+    "bcl"   => "sil",
+    "ch"    => "ch",
+    "d"     => "d",
+    "dcl"   => "sil",
+    "dh"    => "dh",
+    "dx"    => "dx",
+    "eh"    => "eh",
+    "el"    => "l",
+    "em"    => "m",
+    "en"    => "n",
+    "eng"   => "ng",
+    "epi"   => "sil",
+    "er"    => "er",
+    "ey"    => "ey",
+    "f"     => "f",
+    "g"     => "g",
+    "gcl"   => "sil",
+    "h#"    => "sil",
+    "hh"    => "hh",
+    "hv"    => "hh",
+    "ih"    => "ih",
+    "ix"    => "ih",
+    "iy"    => "iy",
+    "jh"    => "jh",
+    "k"     => "k",
+    "kcl"   => "sil",
+    "l"     => "l",
+    "m"     => "m",
+    "n"     => "n",
+    "ng"    => "ng",
+    "nx"    => "n",
+    "ow"    => "ow",
+    "oy"    => "oy",
+    "p"     => "p",
+    "pau"   => "sil",
+    "pcl"   => "sil",
+    "q"     => "",
+    "r"     => "r",
+    "s"     => "s",
+    "sh"    => "sh",
+    "t"     => "t",
+    "tcl"   => "sil",
+    "th"    => "th",
+    "uh"    => "uh",
+    "uw"    => "uw",
+    "ux"    => "uw",
+    "v"     => "v",
+    "w"     => "w",
+    "y"     => "y",
+    "z"     => "z",
+    "zh"    => "sh"
+)
+
+#######################################################################
+
+
+function timit_prepare(timitdir, dir; audio_fmt="SPHERE")
+    # Validate the data directory
+    ! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))
+
+    # Create the output directory.
+    dir = mkpath(dir)
+    rm(joinpath(dir, "recordings.jsonl"), force=true)
+
+    ## Recordings
+    @info "Extracting recordings from $timitdir/train"
+    train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)
+
+    # We extract the name of speakers that are not in the dev set
+    TIMIT_TRAIN_SPK_LIST = Set()
+    for id in keys(train_recordings)
+        _, spk, _ = split(id, "_")
+        if spk ∉ TIMIT_DEV_SPK_LIST
+            push!(TIMIT_TRAIN_SPK_LIST, spk)
+        end
+    end
+
+    @info "Extracting recordings from $timitdir/test"
+    test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)
+    recordings = merge(train_recordings, test_recordings)
+
+    manifestpath = joinpath(dir, "recordings.jsonl")
+    open(manifestpath, "a") do f
+        writemanifest(f, recordings)
+    end
+
+    # Annotations
+    @info "Extracting annotations from $timitdir/train"
+    train_annotations = timit_annotations(joinpath(timitdir, "train"))
+    @info "Extracting annotations from $timitdir/test"
+    test_annotations = timit_annotations(joinpath(timitdir, "test"))
+    annotations = merge(train_annotations, test_annotations)
+
+
+    train_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_TRAIN_SPK_LIST
+        )
+    end
+
+    dev_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_DEV_SPK_LIST
+        )
+    end
+
+    test_annotations = filter(annotations) do (k, v)
+        stype = v.data["sentence type"]
+        spk = v.data["speaker"]
+        (
+            (stype == "compact" || stype == "diverse") &&
+            spk ∈ TIMIT_TEST_SPK_LIST
+        )
+    end
+
+    for (x, y) in ("train" => train_annotations,
+                   "dev" => dev_annotations,
+                   "test" => test_annotations)
+        manifestpath = joinpath(dir, "annotations-$(x).jsonl")
+        @info "Creating $manifestpath"
+
+        open(manifestpath, "w") do f
+            writemanifest(f, y)
+        end
+    end
+end
+
+
+function timit_recordings(dir::AbstractString; fmt="SPHERE")
+    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))
+
+    recordings = Dict()
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            name, ext = splitext(file)
+            ext != ".wav" && continue
+            spk = basename(root)
+            path = joinpath(root, file)
+            id = "timit_$(spk)_$(name)"
+
+            audio_src = if fmt == "SPHERE"
+                CmdAudioSource(`sph2pipe -f wav $path`)
+            else
+                FileAudioSource(path)
+            end
+
+            recordings[id] = Recording(
+                id,
+                audio_src;
+                channels = [1],
+                samplerate = 16000
+            )
+        end
+    end
+    recordings
+end
+
+
+function timit_annotations(dir)
+    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))
+    splitline(line) = rsplit(line, limit=3)
+
+    annotations = Dict()
+    processed = Set()
+
+    for (root, subdirs, files) in walkdir(dir)
+        for file in files
+            name, ext = splitext(file)
+            _, dialect, spk = rsplit(root, "/", limit=3)
+
+            # Annotation files already processed (".wrd" and ".phn")
+            idtuple = (dialect, spk, name)
+            (idtuple in processed) && continue
+            push!(processed, (dialect, spk, name))
+
+            # Words
+            wpath = joinpath(root, name * ".wrd")
+            words = [last(split(line)) for line in eachline(wpath)]
+
+            # Phones
+            ppath = joinpath(root, name * ".phn")
+            palign = Tuple{Int,Int,String}[]
+            for line in eachline(ppath)
+                t0, t1, p = split(line)
+                push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))
+            end
+
+            sentence_type = if startswith(name, "sa")
+                "dialect"
+            elseif startswith(name, "sx")
+                "compact"
+            else # startswith(name, "si")
+                "diverse"
+            end
+
+            id = "timit_$(spk)_$(name)"
+            annotations[id] = Annotation(
+                id,  # recording id and annotation id are the same since we have
+                id,  # a one-to-one mapping
+                -1,  # start and duration is -1 means that we take the whole
+                -1,  # recording
+                [1], # only 1 channel (mono recording)
+                Dict(
+                     "text" => join(words, " "),
+                     "sentence type" => sentence_type,
+                     "alignment" => palign,
+                     "dialect" => dialect,
+                     "speaker" => spk,
+                     "sex" => string(first(spk)),
+                )
+            )
+        end
+    end
+    annotations
+end
+
+
+function TIMIT(timitdir, dir, subset)
+    if ! (isfile(joinpath(dir, "recordings.jsonl")) &&
+          isfile(joinpath(dir, "annotations-train.jsonl")) &&
+          isfile(joinpath(dir, "annotations-dev.jsonl")) &&
+          isfile(joinpath(dir, "annotations-test.jsonl")))
+        timit_prepare(timitdir, dir)
+    end
+    dataset(dir, subset)
+end
+
--- a/src/dataset.jl
+++ b/src/dataset.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+struct SpeechDataset <: MLUtils.AbstractDataContainer
+    idxs::Vector{AbstractString}
+    annotations::Dict{AbstractString, Annotation}
+    recordings::Dict{AbstractString, Recording}
+end
+
+"""
+dataset(manifestroot)
+
+Load `SpeechDataset` from manifest files stored in `manifestroot`.
+
+Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.
+
+See also [`Annotation`](@ref).
+
+# Examples
+```julia-repl
+julia> ds = dataset("./manifests", :train)
+SpeechDataset(
+    ...
+)
+julia> ds[1]
+(
+    (samples=[...], sampling_rate=16_000),
+    Dict(
+        "text" => "Annotation text here"
+    )
+)
+```
+"""
+function dataset(manifestroot::AbstractString, partition)
+    partition_name = partition == "" ? "" : "-$(partition)"
+    annot_path =  joinpath(manifestroot, "annotations$(partition_name).jsonl") 
+    rec_path = joinpath(manifestroot, "recordings.jsonl")
+    annotations = load(Annotation, annot_path)
+    recordings = load(Recording, rec_path)
+    dataset(annotations, recordings)
+end
+
+function dataset(annotations::AbstractDict, recordings::AbstractDict)
+    idxs = collect(keys(annotations))
+    SpeechDataset(idxs, annotations, recordings)
+end
+
+Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]
+Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])
+# Fix1 -> partial funcion with fixed 1st argument
+Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)
+
+Base.length(d::SpeechDataset) = length(d.idxs)
+
+function Base.filter(fn, d::SpeechDataset)
+    fidxs = filter(d.idxs) do i
+        fn((d.recordings[i], d.annotations[i]))
+    end
+    idset = Set(fidxs)
+
+    fannotations = filter(d.annotations) do (k, v)
+        k ∈ idset
+    end
+
+    frecs = filter(d.recordings) do (k, v)
+        k ∈ idset
+    end
+
+    SpeechDataset(fidxs, fannotations, frecs)
+end
+
--- a/src/lexicons.jl
+++ b/src/lexicons.jl
+# SPDX-License-Identifier: CECILL-2.1
+
+
+const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"
+const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"
+
+function normalizeword(word)
+    String(uppercase(word))
+end
+
+function normalizephoneme(phoneme)
+    String(uppercase(phoneme))
+end
+
+
+"""
+    CMUDICT(path)
+
+Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.
+The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent
+calls will only read the file `path` without downloading again the data.
+"""
+function CMUDICT(path)
+    if ! isfile(path)
+        mkpath(dirname(path))
+        dir = mktempdir()
+        run(`wget -P $dir $CMUDICT_URL`)
+        mv(joinpath(dir, "cmudict_SPHINX_40"), path)
+    end
+
+    lexicon = Dict()
+    open(path, "r") do f
+        for line in eachline(f)
+            word, pron... = split(line)
+
+            word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")
+
+            prononciations = get(lexicon, word, [])
+            push!(prononciations, pron)
+            lexicon[word] = prononciations
+        end
+    end
+    lexicon
+end
+
+
+"""
+    TIMITDICT(timitdir)
+
+Return the dictionary of pronunciation as provided by TIMIT corpus (located
+in `timitdir`).
+"""
+function TIMITDICT(timitdir)
+    dictfile = joinpath(timitdir, "doc", "timitdic.txt")
+    iscomment(line) = first(line) == ';'
+
+    lexicon = Dict{String,Vector{Vector{String}}}()
+    for line in eachline(dictfile)
+        iscomment(line) && continue
+
+        word, pron = split(line, limit=2)
+        pron = strip(pron, ['/', '\t', ' '])
+        word = '~' in word ? split(word, "~", limit=2)[1] : word
+
+        word = normalizeword(word)
+        pron = normalizephoneme.(split(pron))
+
+        prononciations = get(lexicon, word, Vector{String}[])
+        push!(prononciations, pron)
+        lexicon[word] = prononciations
+    end
+    lexicon
+end
+
+
+"""
+    MFAFRDICT(path)
+
+Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)
+"""
+
+function MFAFRDICT(path)
+    if ! isfile(path)
+        mkpath(dirname(path))
+        dir = mktempdir()
+        run(`wget -P $dir $FRMFA_DICT_URL`)
+        mv(joinpath(dir, "french_mfa.dict"), path)
+    end
+    lexicon = Dict()
+    open(path, "r") do f
+        for line in eachline(f)
+            word, pron... = split(line)
+            prononciations = get(lexicon, word, [])
+            push!(prononciations, pron)
+            lexicon[word] = prononciations
+        end
+    end
+    lexicon
+end
\ No newline at end of file
--- a/src/manifest_io.jl
+++ b/src/manifest_io.jl
 # SPDX-License-Identifier: CECILL-2.1

-#=====================================================================#
-# HTML pretty display
-
-function Base.show(io::IO, ::MIME"text/html", r::AbstractAudioSource)
-    print(io, "<audio controls ")
-    print(io, "src=\"data:audio/wav;base64,")
-
-    x, s, _ = loadsource(r, :)
-    iob64_encode = Base64EncodePipe(io)
-    wavwrite(x, iob64_encode, Fs = s, nbits = 8, compression = WAV.WAVE_FORMAT_PCM)
-    close(iob64_encode)
-
-    println(io, "\" />")
-end
-
 #=====================================================================#
 # JSON serialization of a manifest item

@@ -68,21 +53,21 @@ function Base.show(io::IO, m::MIME"application/json", r::Recording)
    print(io, "}")
 end

-function Base.show(io::IO, m::MIME"application/json", s::Supervision)
+function Base.show(io::IO, m::MIME"application/json", a::Annotation)
    compact = get(io, :compact, false)
    indent = compact ? 0 : 2
    printfn = compact ? print : println
    printfn(io, "{")
-    printfn(io, repeat(" ", indent), "\"id\": \"", s.id, "\", ")
-    printfn(io, repeat(" ", indent), "\"recording_id\": \"", s.recording_id, "\", ")
-    printfn(io, repeat(" ", indent), "\"start\": ", s.start, ", ")
-    printfn(io, repeat(" ", indent), "\"duration\": ", s.duration, ", ")
-    printfn(io, repeat(" ", indent), "\"channel\": ", s.channel, ", ")
-    printfn(io, repeat(" ", indent), "\"data\": ", s.data |> json)
+    printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")
+    printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")
+    printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")
+    printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")
+    printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")
+    printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)
    print(io, "}")
 end

-function JSON.json(r::Union{Recording, Supervision}; compact = true)
+function JSON.json(r::Union{Recording, Annotation}; compact = true)
    out = IOBuffer()
    show(IOContext(out, :compact => compact), MIME("application/json"), r)
    String(take!(out))
@@ -111,12 +96,12 @@ Recording(d::Dict) = Recording(
    d["samplerate"]
 )

-Supervision(d::Dict) = Supervision(
+Annotation(d::Dict) = Annotation(
    d["id"],
    d["recording_id"],
    d["start"],
    d["duration"],
-    d["channel"],
+    d["channels"],
    d["data"]
 )

@@ -139,13 +124,18 @@ function readmanifest(io::IO, T)
    manifest
 end

-manifestname(T::Type{<:Recording}, subset) = "recording-manifest-$(subset).jsonl"
-manifestname(T::Type{<:Supervision}, subset) = "supervision-manifest-$(subset).jsonl"
+# Some utilities
+manifestname(::Type{<:Recording}, name) = "recordings.jsonl"
+manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"

-load(T::Type{<:Union{Recording,Supervision}}, path::AbstractString) =
-    open(f -> readmanifest(f, T), path, "r")
-load(corpus::SpeechCorpus, dir, T, subset) =
-    load(T, joinpath(path(corpus, dir), manifestname(T, subset)))
-load(corpus::SpeechCorpus, T, subset) =
-    load(corpus, corporadir, T, subset)
+"""
+    load(Annotation, path)
+    load(Recording, path)

+Load Recording/Annotation manifest from `path`.
+"""
+load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")
+
+function checkdir(dir::AbstractString)
+    isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))
+end
--- a/src/manifest_item.jl
+++ b/src/manifest_item.jl
 # SPDX-License-Identifier: CECILL-2.1

-"""
-    abstract type AbstractAudioSource end
-
-Base class for all audio source. Possible audio sources are:
-* `FileAudioSource`
-* `URLAudioSource`
-* `CmdAudioSource`
-
-You can load the data of an audio source with the internal function
-
-    loadsoce(s::AbstractAudioSource, subrange)
-
-"""
-abstract type AbstractAudioSource end
-
-struct FileAudioSource <: AbstractAudioSource
-    path::AbstractString
-end
-
-struct URLAudioSource <: AbstractAudioSource
-    url::AbstractString
-end
-
-struct CmdAudioSource <: AbstractAudioSource
-    cmd
-end
-CmdAudioSource(c::String) = CmdAudioSource(Cmd(String.(split(c))))
-
-loadsource(s::FileAudioSource, subrange) = wavread(s.path; subrange)
-loadsource(s::URLAudioSource, subrange) = wavread(IOBuffer(HTTP.get(s.url).body); subrange)
-loadsource(s::CmdAudioSource, subrange) = wavread(IOBuffer(read(pipeline(s.cmd))); subrange)
-
 """
    abstract type ManifestItem end

@@ -71,7 +39,7 @@ end

 function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)
    if ismissing(channels) || ismissing(samplerate)
-        x, sr = loadsource(s, :)
+        x, sr = loadaudio(s)
        samplerate = ismissing(samplerate) ? Int(sr) : samplerate
        channels = ismissing(channels) ? collect(1:size(x,2)) : channels
    end
@@ -79,47 +47,49 @@ function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate
 end

 """
-    struct Supervision <: ManifestItem
+    struct Annotation <: ManifestItem
        id::AbstractString
        recording_id::AbstractString
        start::Float64
        duration::Float64
-        channel::Int
+        channel::Union{Vector, Colon}
        data::Dict
    end

-A "supervision" defines a segment of a recording on a single channel.
+An "annotation" defines a segment of a recording on a single channel.
 The `data` field is an arbitrary dictionary holdin the nature of the
-supervision.
+annotation. `start` and `duration` (in seconds) defines,
+where the segment is locatated within the recoding `recording_id`.

 # Constructor

-    Supervision(id, recording_id, start, duration, channel, data)
-    Supervision(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
+    Annotation(id, recording_id, start, duration, channel, data)
+    Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

 If `start` and/or `duration` are negative, the segment is considered to
 be the whole sequence length of the recording.
 """
-struct Supervision <: ManifestItem
+struct Annotation <: ManifestItem
    id::AbstractString
    recording_id::AbstractString
    start::Float64
    duration::Float64
-    channel::Int
+    channels::Union{Vector, Colon}
    data::Dict
 end

-Supervision(id, recid; channel = missing, start = -1, duration = -1, data = missing) =
-    Supervision(id, recid, start, duration, channel, data)
+Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =
+    Annotation(id, recid, start, duration, channels, data)
+

 """
    load(recording[; start = -1, duration = -1, channels = recording.channels])
-    load(recording, supervision)
+    load(recording, annotation)

 Load the signal from a recording. `start`, `duration` (in seconds) can
-be used to load only a segment. If a `supervision` is given, function
+be used to load only a segment. If an `annotation` is given, function
 will return on the portion of the signal corresponding to the
-supervision segment.
+annotation segment.

 The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array
 - ``N`` is the length of the signal and ``C`` is the number of channels
@@ -134,10 +104,9 @@ function load(r::Recording; start = -1, duration = -1, channels = r.channels)
        subrange = (:)
    end

-    x, sr, _, _ = loadsource(r.source, subrange)
+    x, sr = loadaudio(r.source, subrange)
    x[:,channels], sr
 end

-load(r::Recording, s::Supervision) =
-    load(r; start = s.start, duration = s.duration, channels = [s.channel])
+load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)

--- a/src/speechcorpus.jl
+++ b/src/speechcorpus.jl
 # SPDX-License-Identifier: CECILL-2.1

+
 """
-    abstract type SpeechCorpus
+    abstract type SpeechCorpus end

 Abstract type for all speech corpora.
 """
 abstract type SpeechCorpus end

+
+"""
+    lang(corpus)
+
+Return the ISO 639-3 code of the language of the corpus.
 """
-    path(corpus)
+lang

-Path to the directory where is stored the corpus' data.
+
+"""
+    name(corpus)
+
+Return the name identifier of the corpus.
 """
-path(corpus::SpeechCorpus, dir) = joinpath(dir, corpus.lang, corpus.name)
+name
+

 """
-    download(corpus[, dir = homedir()])
+    download(corpus, rootdir)

 Download the data of the corpus to `dir`.
 """
-Base.download(corpus::SpeechCorpus) = download(corpus, SPEECH_CORPORA_ROOTDIR)
+Base.download

 """
-    prepare(corpus[, dir = homedir()])
+    prepare(corpus, rootdir)

-Prepare the manifests of corpus to `dir`.
+Prepare the manifests of corpus.
 """
-prepare(corpus::SpeechCorpus) = prepare(corpus, SPEECH_CORPORA_ROOTDIR)
+prepare
No results found