Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • ina_diachronic_corpus
  • khetrapal-std-main-patch-24767
  • lwazi
  • khetrapal-std-main-patch-39953
  • fix_channels
  • 7-symbol-mapping
  • timit
  • displaysource
  • v0.1.0
  • v0.1.1
  • v0.10.0
  • v0.11.0
  • v0.12.0
  • v0.13.0
  • v0.14.0
  • v0.15.0
  • v0.2.0
  • v0.3.0
  • v0.4.0
  • v0.4.1
  • v0.5.0
  • v0.6.0
  • v0.7.0
  • v0.8.0
  • v0.9.0
  • v0.9.1
  • v0.9.2
  • v0.9.3
  • v0.9.4
30 results

Target

Select target project
No results found
Select Git revision
  • main
  • ina_diachronic_corpus
  • khetrapal-std-main-patch-24767
  • lwazi
  • khetrapal-std-main-patch-39953
  • fix_channels
  • 7-symbol-mapping
  • timit
  • displaysource
  • v0.1.0
  • v0.1.1
  • v0.10.0
  • v0.11.0
  • v0.12.0
  • v0.13.0
  • v0.14.0
  • v0.15.0
  • v0.2.0
  • v0.3.0
  • v0.4.0
  • v0.4.1
  • v0.5.0
  • v0.6.0
  • v0.7.0
  • v0.8.0
  • v0.9.0
  • v0.9.1
  • v0.9.2
  • v0.9.3
  • v0.9.4
30 results
Show changes

Commits on Source 56

16 files
+ 1238
151
Compare changes
  • Side-by-side
  • Inline

Files

.gitignore

0 → 100644
+3 −0
Original line number Diff line number Diff line
*outputdir/
Manifest.toml
notebook-test.jl
+70 −0
Original line number Diff line number Diff line
# Tags
## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024
### Changed
- Added support for Speech2Tex dataset

## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024
### Changed
- Added support for AVID dataset

## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024
### Changed
- Added support for INA Diachrony dataset
### Fixed
- Fixed Minilibrispeech data prep

## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024
### Changed
- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.

## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024
### Added
- filtering speech dataset based on recording id.
### Improved
- Faster TIMIT preparation

## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024
### Added
- extract alignments from TIMIT
### Changed
- `Supervision` is now `Annotation`

## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024
# Fixed
- TIMIT data preparation

## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024
# Fixed
- `CMUDICT("dir/path")` fails if `dir` does not already exists.

## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024
# Fixed
- invalid type for field `channels` of `Recording`
- `MINILIBRISPEECH` broken

## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024
# Fixed
- not possible to use `:` as channel specifier

## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024
# Changed
- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`
## Added
* CMU and TIMIT lexicon

## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024
## Features
* New `dataset` function, which builds `SpeechDataset` from manifest files
* Compatibility with MLUtils.DataLoader

## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023
## Changed
* refactored API, TIMIT dataset working (but not Librispeech anymore)

## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023
## Added
- raw audio data source

## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023
## Added
- can load the data directly from an audio source with the `load`
  function.

## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023
## Added
+11 −9
Original line number Diff line number Diff line
name = "SpeechCorpora"
uuid = "3225a15e-d855-4a07-9546-2418058331ae"
authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>"]
version = "0.4.1"
name = "SpeechDatasets"
uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"
authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",
           "Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",
           "Nicolas DENIER <nicolas.denier@lisn.fr>"]
version = "0.15.0"

[deps]
Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
WAV = "8149f6b0-98f6-5db9-b78f-408fbbb8ef88"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"

[compat]
julia = "1.10"
JSON = "0.21"
WAV = "1.2"
julia = "1.8"
SpeechFeatures = "0.8"
+26 −17
Original line number Diff line number Diff line
# SpeechCorpora.jl
# SpeechDatasets.jl

A Julia package to download and prepare speech corpus.

@@ -7,35 +7,44 @@ A Julia package to download and prepare speech corpus.
Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)
to your julia installation. Then, install the package as usual:
```
pkg> add SpeechCorpora
pkg> add SpeechDatasets
```

## Example

```
julia> using SpeechCorpora
julia> using SpeechDatasets

julia> corpus = MultilingualLibriSpeech("fra") |> download |> prepare
julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test
...

# Load the recording manifest.
julia> recs = load(corpus, Recording, "dev") # use "train", "dev" or "test"
julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test
...

# Load the supervision manifest.
julia> sups = load(corpus, Supervision, "dev") # use "train", "dev" or "test"
julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional
...

# Load the signal of the first supervision segment
julia> s = first(values(sups))
julia> x, samplerate = load(recs[s.recording_id], s)
julia> dataset = AVID("/path/to/avid/dir", "outputdir")
...

# Play the recording of the first supervision segment
julia> play(recs[s.recording_id], s)
julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")
...

```

## Author
* Lucas ONDEL YANG (LISN, CNRS)
julia> for ((signal, fs), supervision) in dataset
           # do something
       end

# Lexicons
julia> CMUDICT("outputfile")
...

julia> TIMITDICT("/path/to/timit/dir")
...

```

## License

This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE)
This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

module SpeechCorpora
module SpeechDatasets

using Base64
using HTTP
using JSON
using WAV
using SpeechFeatures
import MLUtils

export
    # ManifestItem
    FileAudioSource,
    CmdAudioSource,
    URLAudioSource,
    Recording,
    Supervision,
    Annotation,
    load,

    # Manifest interface
@@ -22,28 +18,34 @@ export

    # Corpora interface
    download,
    lang,
    name,
    prepare,

    # Corpora
    MultilingualLibriSpeech,
    MiniLibriSpeech
    MINILIBRISPEECH,
    TIMIT,
    INADIACHRONY,
    AVID,
    SPEECH2TEX,

    # Lexicon
    CMUDICT,
    TIMITDICT,
    MFAFRDICT,


SPEECH_CORPORA_ROOTDIR = homedir()

"""
    setrootdir(path)

Set the root directory where to store the datasets. Default to the user
home directory.
"""
setrootdir(path) = global SPEECH_CORPORA_ROOTDIR = path
    # Dataset
    dataset

include("speechcorpus.jl")
include("manifest_item.jl")
include("manifest_io.jl")
include("corpora/multilingual_librispeech.jl")
include("corpora/mini_librispeech.jl")
include("dataset.jl")

# Supported corpora
include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))

include("lexicons.jl")

end

src/corpora/avid.jl

0 → 100644
+140 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

function avid_recordings(dir::AbstractString)
    checkdir(dir)

    recordings = Dict()
    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            id = filename
            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(
                id,
                audio_src;
                channels = [1],
                samplerate = 16000
            )
        end
    end
    recordings
end


function load_metadata_files(dir::AbstractString)
    tasksdict = Dict('s' => "SENT", 'p' => "PARA")
    metadatadict = Dict(key => 
        readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv")) 
        for key in keys(tasksdict))
    return metadatadict
end


function get_metadata(filename, metadatadict)
    task = split(filename, "_")[3][1]
    headers = metadatadict[task][1]
    headers = split(headers, ",")
    file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]
    file_metadata = split(file_metadata, ",")
    metadata = Dict(
        headers[i] => file_metadata[i]
        for i = 1:length(headers)
    )
    return metadata
end


function avid_annotations(dir)
    checkdir(dir)

    annotations = Dict()
    metadatadict = load_metadata_files(dir)

    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            # extract metadata from csv files
            metadata = get_metadata(filename, metadatadict)
            
            id = filename
            # generate annotation
            annotations[id] = Annotation(
                id, # audio id
                id, # annotation id
                -1,  # start and duration is -1 means that we take the whole
                -1,  # recording
                [1], # only 1 channel (mono recording)
                metadata # additional informations   
            )
        end
    end
    annotations
end


function download_avid(dir)
    @info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"
    url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"
    filename = "AVID.zip"
    filepath = joinpath(dir,filename)
    run(`mkdir -p $dir`)
    run(`wget $url -O $filepath`)
    @info "Download complete, extracting files"
    run(`unzip $filepath -d $dir`)
    run(`rm $filepath`)
    return joinpath(datadir, "/AVID")
end


function avid_prepare(datadir, outputdir)
    # Validate the data directory
    isdir(datadir) || (datadir = download_avid(datadir))

    # Create the output directory.
    outputdir = mkpath(outputdir)
    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings
    recordings = Array{Dict}(undef, 2)
    recordings_path = joinpath(datadir, "Repository 2")
    @info "Extracting recordings from $recordings_path"
    recordings[1] = avid_recordings(recordings_path)
    # Calibration tones
    calibtones_path = joinpath(datadir, "Calibration_tones")
    @info "Extracting recordings from $calibtones_path"
    recordings[2] = avid_recordings(calibtones_path)

    for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])
        open(manifestpath, "w") do f
            writemanifest(f, recordings[i])
        end
    end

    # Annotations
    annotations_path = recordings_path
    @info "Extracting annotations from $annotations_path"
    annotations = avid_annotations(annotations_path)
        
    manifestpath = joinpath(outputdir, "annotations.jsonl")
    @info "Creating $manifestpath"
    open(manifestpath, "w") do f
        writemanifest(f, annotations)
    end
end


function AVID(datadir, outputdir)
    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
          isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&
          isfile(joinpath(outputdir, "annotations.jsonl")))
        avid_prepare(datadir, outputdir)
    end
    dataset(outputdir, "")
end
+160 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

function ina_diachrony_recordings(dir::AbstractString)
    checkdir(dir)

    recordings = Dict()
    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            id = "ina_diachrony§$filename"
            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(
                id,
                audio_src;
                channels = [1],
                samplerate = 16000
            )
        end
    end
    recordings
end


function ina_diachrony_get_metadata(filename)
    metadata = split(filename, "§")
    age, sex = split(metadata[2], "_")
    Dict(
        "speaker" => metadata[3],
        "timeperiod" => metadata[1],
        "age" => age,
        "sex" => sex,
    )
end


function ina_diachrony_annotations_whole(dir)
    checkdir(dir)

    annotations = Dict()

    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            # extract metadata from filename
            metadata = ina_diachrony_get_metadata(filename)
            
            # extract transcription text (same filename but .txt)
            textfilepath = joinpath(root, "$filename.txt")
            metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
            
            id = "ina_diachrony§$filename"
            annotation_id = id*"§0"
            # generate annotation
            annotations[annotation_id] = Annotation(
                id, # audio id
                annotation_id, # annotation id
                -1,  # start and duration is -1 means that we take the whole
                -1,  # recording
                [1], # only 1 channel (mono recording)
                metadata # additional informations
            )
        end
    end
    annotations
end


function ina_diachrony_annotations_csv(dir)
    checkdir(dir)

    annotations = Dict()

    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".csv" && continue

            # extract metadata from filename
            metadata = ina_diachrony_get_metadata(filename)

            id = "ina_diachrony§$filename"
            # generate annotation for each line in csv
            open(joinpath(root, file)) do f
                header = readline(f)   
                line = 1 
                # read till end of file
                while ! eof(f) 
                    current_line = readline(f)
                    start_time, end_time, text = split(current_line, ",", limit=3)
                    start_time = parse(Float64, start_time)
                    duration = parse(Float64, end_time)-start_time
                    metadata["text"] = text
                    annotation_id = id*$line"
                    annotations[id] = Annotation(
                        id, # audio id
                        annotation_id, # annotation id
                        start_time,  # start
                        duration,  # duration
                        [1], # only 1 channel (mono recording)
                        metadata # additional informations
                    )
                    line += 1
                end
            end

        end
    end
    annotations
end


function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
    # Validate the data directory
    for d in [ina_wav_dir, ina_csv_dir]
        isnothing(d) || checkdir(d)
    end

    # Create the output directory.
    outputdir = mkpath(outputdir)
    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings
    @info "Extracting recordings from $ina_wav_dir"
    recordings = ina_diachrony_recordings(ina_wav_dir)

    manifestpath = joinpath(outputdir, "recordings.jsonl")
    open(manifestpath, "w") do f
        writemanifest(f, recordings)
    end

    # Annotations
    @info "Extracting annotations from $ina_wav_dir"
    annotations = ina_diachrony_annotations_whole(ina_wav_dir)
    if ! isnothing(ina_csv_dir)
        @info "Extracting annotations from $ina_csv_dir"
        csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)
        annotations = merge(annotations, csv_annotations)
    end
        
    manifestpath = joinpath(outputdir, "annotations.jsonl")
    @info "Creating $manifestpath"
    open(manifestpath, "w") do f
        writemanifest(f, annotations)
    end
end

function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)
    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
          isfile(joinpath(outputdir, "annotations.jsonl")))
        ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
    end
    dataset(outputdir, "")
end
Original line number Diff line number Diff line
@@ -11,15 +11,9 @@ const MINILS_SUBSETS = Dict(
    "dev" => "dev-clean-2"
)

const MINILS_LANG = "eng"

const MINILS_NAME = "mini_librispeech"

#######################################################################

struct MiniLibriSpeech <: SpeechCorpus
    lang
    name
struct MINILIBRISPEECH <: SpeechCorpus
    recordings
    train
    dev
@@ -48,7 +42,7 @@ function minils_recordings(dir, subset)
    recs
end

function minils_supervisions(dir, subset)
function minils_annotations(dir, subset)
    subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
    sups = Dict()
    for d1 in readdir(subsetdir; join = true)
@@ -58,8 +52,12 @@ function minils_supervisions(dir, subset)
            open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f
                for line in eachline(f)
                    tokens = split(line)
                    s = Supervision(tokens[1], tokens[1]; channel = 1,
                                    data = Dict("text" => join(tokens[2:end], " ")))
                    s = Annotation(
                        tokens[1], # annotation id
                        tokens[1]; # recording id
                        channels = [1],
                        data = Dict("text" => join(tokens[2:end], " "))
                    )
                    sups[s.id] = s
                end
            end
@@ -89,7 +87,7 @@ end

function minils_prepare(dir)
    # 1. Recording manifest.
    out = joinpath(dir, "recording-manifest.jsonl")
    out = joinpath(dir, "recordings.jsonl")
    if ! isfile(out)
        open(out, "w") do f
            for subset in ["train", "dev"]
@@ -100,12 +98,12 @@ function minils_prepare(dir)
        end
    end

    # 2. Supervision manifests.
    for subset in ["train", "dev"]
        out = joinpath(dir, "supervision-manifest-$subset.jsonl")
    # 2. Annotation manifests.
    for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]
        out = joinpath(dir, "annotations-$name.jsonl")
        if ! isfile(out)
            @debug "preparing supervision manifest ($subset) $out"
            sups = minils_supervisions(dir, subset)
            @debug "preparing annotation manifest ($subset) $out"
            sups = minils_annotations(dir, subset)
            open(out, "w") do f
                writemanifest(f, sups)
            end
@@ -113,20 +111,10 @@ function minils_prepare(dir)
    end
end

function MiniLibriSpeech(outdir)
    dir = joinpath(outdir, MINILS_LANG, MINILS_NAME)

function MINILIBRISPEECH(dir, subset)
    minils_download(dir)
    minils_prepare(dir)

    MiniLibriSpeech(
        MINILS_LANG,
        MINILS_NAME,
        load(Recording, joinpath(dir, "recording-manifest.jsonl")),
        load(Supervision, joinpath(dir, "supervision-manifest-train.jsonl")),
        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
    )
    dataset(dir, subset)
end
MiniLibriSpeech() = MiniLibriSpeech(SPEECH_CORPORA_ROOTDIR)
Original line number Diff line number Diff line
@@ -89,13 +89,13 @@ function recordings(corpus::MultilingualLibriSpeech, dir, subset)
    recs
end

function supervisions(corpus::MultilingualLibriSpeech, dir, subset)
function annotations(corpus::MultilingualLibriSpeech, dir, subset)
    trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")
    sups = Dict()
    open(trans, "r") do f
        for line in eachline(f)
            tokens = split(line)
            s = Supervision(tokens[1], tokens[1]; channel = 1,
            s = Annotation(tokens[1], tokens[1]; channel = 1,
                            data = Dict("text" => join(tokens[2:end], " ")))
            sups[s.id] = s
        end
@@ -118,12 +118,12 @@ function prepare(corpus::MultilingualLibriSpeech, outdir)
        end
    end

    # 2. Supervision manifests.
    # 2. Annotation manifests.
    for subset in ["train", "dev", "test"]
        out = joinpath(dir, "supervision-manifest-$subset.jsonl")
        @info "preparing supervision manifest ($subset) $out"
        out = joinpath(dir, "annotation-manifest-$subset.jsonl")
        @info "preparing annotation manifest ($subset) $out"
        if ! isfile(out)
            sups = supervisions(corpus, dir, subset)
            sups = annotations(corpus, dir, subset)
            open(out, "w") do f
                writemanifest(f, sups)
            end
+122 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

function speech2tex_recordings(dir::AbstractString)
    checkdir(dir)

    recordings = Dict()
    for (root, subdirs, files) in walkdir(dir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            id = filename
            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(
                id,
                audio_src;
                channels = [1],
                samplerate = 48000
            )
        end
    end
    recordings
end

extract_digits(str::AbstractString) = filter(c->isdigit(c), str)
isnumber(str::AbstractString) = extract_digits(str)==str

function speech2tex_get_metadata(filename)
    # possible cases: line123_p1  line123_124_p1  line123_p1_part2  (not observed but also supported: line123_124_p1_part2)
    split_name = split(filename, "_")
    metadata = Dict()
    if isnumber(split_name[2])
        metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]
        metadata["speaker"] = split_name[3]
    else 
        metadata["line"] = extract_digits(split_name[1])
        metadata["speaker"] = split_name[2]
    end
    if occursin("part", split_name[end])
        metadata["part"] = extract_digits(split_name[end])
    end
    metadata
end


function speech2tex_annotations(audiodir, transcriptiondir, texdir)
    checkdir.([audiodir, transcriptiondir, texdir])

    annotations = Dict()

    for (root, subdirs, files) in walkdir(audiodir)
        for file in files
            filename, ext = splitext(file)
            ext != ".wav" && continue
            
            # extract metadata from csv files
            metadata = speech2tex_get_metadata(filename)

            # extract transcription and tex (same filenames but .txt)
            dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")
            for (d, label) in dirdict
                textfilepath = joinpath(d, "$filename.txt")
                metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
            end
            id = filename
            # generate annotation
            annotations[id] = Annotation(
                id, # audio id
                id, # annotation id
                -1,  # start and duration is -1 means that we take the whole
                -1,  # recording
                [1], # only 1 channel (mono recording)
                metadata # additional informations   
            )
        end
    end
    annotations
end

function speech2tex_prepare(datadir, outputdir)
    # Validate the data directory
    checkdir(datadir)

    # Create the output directory.
    outputdir = mkpath(outputdir)
    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings
    recordings = Array{Dict}(undef, 2)
    recordings_path = joinpath(datadir, "audio")
    @info "Extracting recordings from $recordings_path"
    recordings = speech2tex_recordings(recordings_path)

    manifestpath = joinpath(outputdir, "recordings.jsonl")
    open(manifestpath, "w") do f
        writemanifest(f, recordings)
    end

    # Annotations
    transcriptiondir = joinpath(datadir, "sequences")
    texdir = joinpath(datadir, "latex")
    @info "Extracting annotations from $transcriptiondir and $texdir"
    annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)
        
    manifestpath = joinpath(outputdir, "annotations.jsonl")
    @info "Creating $manifestpath"
    open(manifestpath, "w") do f
        writemanifest(f, annotations)
    end
end


function SPEECH2TEX(datadir, outputdir)
    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
          isfile(joinpath(outputdir, "annotations.jsonl")))
        speech2tex_prepare(datadir, outputdir)
    end
    dataset(outputdir, "")
end

src/corpora/timit.jl

0 → 100644
+403 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

#######################################################################


const TIMIT_SUBSETS = Dict(
    "train" => "train",
    "dev" => "dev",
    "test" => "test"
)


const TIMIT_DEV_SPK_LIST = Set([
"faks0",
    "fdac1",
    "fjem0",
    "mgwt0",
    "mjar0",
    "mmdb1",
    "mmdm2",
    "mpdf0",
    "fcmh0",
    "fkms0",
    "mbdg0",
    "mbwm0",
    "mcsh0",
    "fadg0",
    "fdms0",
    "fedw0",
    "mgjf0",
    "mglb0",
    "mrtk0",
    "mtaa0",
    "mtdt0",
    "mthc0",
    "mwjg0",
    "fnmr0",
    "frew0",
    "fsem0",
    "mbns0",
    "mmjr0",
    "mdls0",
    "mdlf0",
    "mdvc0",
    "mers0",
    "fmah0",
    "fdrw0",
    "mrcs0",
    "mrjm4",
    "fcal1",
    "mmwh0",
    "fjsj0",
    "majc0",
    "mjsw0",
    "mreb0",
    "fgjd0",
    "fjmg0",
    "mroa0",
    "mteb0",
    "mjfc0",
    "mrjr0",
    "fmml0",
    "mrws1"
])


const TIMIT_TEST_SPK_LIST = Set([
    "mdab0",
    "mwbt0",
    "felc0",
    "mtas1",
    "mwew0",
    "fpas0",
    "mjmp0",
    "mlnt0",
    "fpkt0",
    "mlll0",
    "mtls0",
    "fjlm0",
    "mbpm0",
    "mklt0",
    "fnlp0",
    "mcmj0",
    "mjdh0",
    "fmgd0",
    "mgrt0",
    "mnjm0",
    "fdhc0",
    "mjln0",
    "mpam0",
    "fmld0"
])


TIMIT_PHONE_MAP48 = Dict(
    "aa"    => "aa",
    "ae"    => "ae",
    "ah"    => "ah",
    "ao"    => "ao",
    "aw"    => "aw",
    "ax"    => "ax",
    "ax-h"  => "ax",
    "axr"   => "er",
    "ay"    => "ay",
    "b"     => "b",
    "bcl"   => "vcl",
    "ch"    => "ch",
    "d"     => "d",
    "dcl"   => "vcl",
    "dh"    => "dh",
    "dx"    => "dx",
    "eh"    => "eh",
    "el"    => "el",
    "em"    => "m",
    "en"    => "en",
    "eng"   => "ng",
    "epi"   => "epi",
    "er"    => "er",
    "ey"    => "ey",
    "f"     => "f",
    "g"     => "g",
    "gcl"   => "vcl",
    "h#"    => "sil",
    "hh"    => "hh",
    "hv"    => "hh",
    "ih"    => "ih",
    "ix"    => "ix",
    "iy"    => "iy",
    "jh"    => "jh",
    "k"     => "k",
    "kcl"   => "cl",
    "l"     => "l",
    "m"     => "m",
    "n"     => "n",
    "ng"    => "ng",
    "nx"    => "n",
    "ow"    => "ow",
    "oy"    => "oy",
    "p"     => "p",
    "pau"   => "sil",
    "pcl"   => "cl",
    "q"     => "",
    "r"     => "r",
    "s"     => "s",
    "sh"    => "sh",
    "t"     => "t",
    "tcl"   => "cl",
    "th"    => "th",
    "uh"    => "uh",
    "uw"    => "uw",
    "ux"    => "uw",
    "v"     => "v",
    "w"     => "w",
    "y"     => "y",
    "z"     => "z",
    "zh"    => "zh"
)


TIMIT_PHONE_MAP39 = Dict(
    "aa"    => "aa",
    "ae"    => "ae",
    "ah"    => "ah",
    "ao"    => "aa",
    "aw"    => "aw",
    "ax"    => "ah",
    "ax-h"  => "ah",
    "axr"   => "er",
    "ay"    => "ay",
    "b"     => "b",
    "bcl"   => "sil",
    "ch"    => "ch",
    "d"     => "d",
    "dcl"   => "sil",
    "dh"    => "dh",
    "dx"    => "dx",
    "eh"    => "eh",
    "el"    => "l",
    "em"    => "m",
    "en"    => "n",
    "eng"   => "ng",
    "epi"   => "sil",
    "er"    => "er",
    "ey"    => "ey",
    "f"     => "f",
    "g"     => "g",
    "gcl"   => "sil",
    "h#"    => "sil",
    "hh"    => "hh",
    "hv"    => "hh",
    "ih"    => "ih",
    "ix"    => "ih",
    "iy"    => "iy",
    "jh"    => "jh",
    "k"     => "k",
    "kcl"   => "sil",
    "l"     => "l",
    "m"     => "m",
    "n"     => "n",
    "ng"    => "ng",
    "nx"    => "n",
    "ow"    => "ow",
    "oy"    => "oy",
    "p"     => "p",
    "pau"   => "sil",
    "pcl"   => "sil",
    "q"     => "",
    "r"     => "r",
    "s"     => "s",
    "sh"    => "sh",
    "t"     => "t",
    "tcl"   => "sil",
    "th"    => "th",
    "uh"    => "uh",
    "uw"    => "uw",
    "ux"    => "uw",
    "v"     => "v",
    "w"     => "w",
    "y"     => "y",
    "z"     => "z",
    "zh"    => "sh"
)

#######################################################################


function timit_prepare(timitdir, dir; audio_fmt="SPHERE")
    # Validate the data directory
    ! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))

    # Create the output directory.
    dir = mkpath(dir)
    rm(joinpath(dir, "recordings.jsonl"), force=true)

    ## Recordings
    @info "Extracting recordings from $timitdir/train"
    train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)

    # We extract the name of speakers that are not in the dev set
    TIMIT_TRAIN_SPK_LIST = Set()
    for id in keys(train_recordings)
        _, spk, _ = split(id, "_")
        if spk  TIMIT_DEV_SPK_LIST
            push!(TIMIT_TRAIN_SPK_LIST, spk)
        end
    end

    @info "Extracting recordings from $timitdir/test"
    test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)
    recordings = merge(train_recordings, test_recordings)

    manifestpath = joinpath(dir, "recordings.jsonl")
    open(manifestpath, "a") do f
        writemanifest(f, recordings)
    end

    # Annotations
    @info "Extracting annotations from $timitdir/train"
    train_annotations = timit_annotations(joinpath(timitdir, "train"))
    @info "Extracting annotations from $timitdir/test"
    test_annotations = timit_annotations(joinpath(timitdir, "test"))
    annotations = merge(train_annotations, test_annotations)


    train_annotations = filter(annotations) do (k, v)
        stype = v.data["sentence type"]
        spk = v.data["speaker"]
        (
            (stype == "compact" || stype == "diverse") &&
            spk  TIMIT_TRAIN_SPK_LIST
        )
    end

    dev_annotations = filter(annotations) do (k, v)
        stype = v.data["sentence type"]
        spk = v.data["speaker"]
        (
            (stype == "compact" || stype == "diverse") &&
            spk  TIMIT_DEV_SPK_LIST
        )
    end

    test_annotations = filter(annotations) do (k, v)
        stype = v.data["sentence type"]
        spk = v.data["speaker"]
        (
            (stype == "compact" || stype == "diverse") &&
            spk  TIMIT_TEST_SPK_LIST
        )
    end

    for (x, y) in ("train" => train_annotations,
                   "dev" => dev_annotations,
                   "test" => test_annotations)
        manifestpath = joinpath(dir, "annotations-$(x).jsonl")
        @info "Creating $manifestpath"

        open(manifestpath, "w") do f
            writemanifest(f, y)
        end
    end
end


function timit_recordings(dir::AbstractString; fmt="SPHERE")
    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))

    recordings = Dict()
    for (root, subdirs, files) in walkdir(dir)
        for file in files
            name, ext = splitext(file)
            ext != ".wav" && continue
            spk = basename(root)
            path = joinpath(root, file)
            id = "timit_$(spk)_$(name)"

            audio_src = if fmt == "SPHERE"
                CmdAudioSource(`sph2pipe -f wav $path`)
            else
                FileAudioSource(path)
            end

            recordings[id] = Recording(
                id,
                audio_src;
                channels = [1],
                samplerate = 16000
            )
        end
    end
    recordings
end


function timit_annotations(dir)
    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))
    splitline(line) = rsplit(line, limit=3)

    annotations = Dict()
    processed = Set()

    for (root, subdirs, files) in walkdir(dir)
        for file in files
            name, ext = splitext(file)
            _, dialect, spk = rsplit(root, "/", limit=3)

            # Annotation files already processed (".wrd" and ".phn")
            idtuple = (dialect, spk, name)
            (idtuple in processed) && continue
            push!(processed, (dialect, spk, name))

            # Words
            wpath = joinpath(root, name * ".wrd")
            words = [last(split(line)) for line in eachline(wpath)]

            # Phones
            ppath = joinpath(root, name * ".phn")
            palign = Tuple{Int,Int,String}[]
            for line in eachline(ppath)
                t0, t1, p = split(line)
                push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))
            end

            sentence_type = if startswith(name, "sa")
                "dialect"
            elseif startswith(name, "sx")
                "compact"
            else # startswith(name, "si")
                "diverse"
            end

            id = "timit_$(spk)_$(name)"
            annotations[id] = Annotation(
                id,  # recording id and annotation id are the same since we have
                id,  # a one-to-one mapping
                -1,  # start and duration is -1 means that we take the whole
                -1,  # recording
                [1], # only 1 channel (mono recording)
                Dict(
                     "text" => join(words, " "),
                     "sentence type" => sentence_type,
                     "alignment" => palign,
                     "dialect" => dialect,
                     "speaker" => spk,
                     "sex" => string(first(spk)),
                )
            )
        end
    end
    annotations
end


function TIMIT(timitdir, dir, subset)
    if ! (isfile(joinpath(dir, "recordings.jsonl")) &&
          isfile(joinpath(dir, "annotations-train.jsonl")) &&
          isfile(joinpath(dir, "annotations-dev.jsonl")) &&
          isfile(joinpath(dir, "annotations-test.jsonl")))
        timit_prepare(timitdir, dir)
    end
    dataset(dir, subset)
end

src/dataset.jl

0 → 100644
+70 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

struct SpeechDataset <: MLUtils.AbstractDataContainer
    idxs::Vector{AbstractString}
    annotations::Dict{AbstractString, Annotation}
    recordings::Dict{AbstractString, Recording}
end

"""
dataset(manifestroot)

Load `SpeechDataset` from manifest files stored in `manifestroot`.

Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.

See also [`Annotation`](@ref).

# Examples
```julia-repl
julia> ds = dataset("./manifests", :train)
SpeechDataset(
    ...
)
julia> ds[1]
(
    (samples=[...], sampling_rate=16_000),
    Dict(
        "text" => "Annotation text here"
    )
)
```
"""
function dataset(manifestroot::AbstractString, partition)
    partition_name = partition == "" ? "" : "-$(partition)"
    annot_path =  joinpath(manifestroot, "annotations$(partition_name).jsonl") 
    rec_path = joinpath(manifestroot, "recordings.jsonl")
    annotations = load(Annotation, annot_path)
    recordings = load(Recording, rec_path)
    dataset(annotations, recordings)
end

function dataset(annotations::AbstractDict, recordings::AbstractDict)
    idxs = collect(keys(annotations))
    SpeechDataset(idxs, annotations, recordings)
end

Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]
Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])
# Fix1 -> partial funcion with fixed 1st argument
Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)

Base.length(d::SpeechDataset) = length(d.idxs)

function Base.filter(fn, d::SpeechDataset)
    fidxs = filter(d.idxs) do i
        fn((d.recordings[i], d.annotations[i]))
    end
    idset = Set(fidxs)

    fannotations = filter(d.annotations) do (k, v)
        k  idset
    end

    frecs = filter(d.recordings) do (k, v)
        k  idset
    end

    SpeechDataset(fidxs, fannotations, frecs)
end

src/lexicons.jl

0 → 100644
+99 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1


const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"
const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"

function normalizeword(word)
    String(uppercase(word))
end

function normalizephoneme(phoneme)
    String(uppercase(phoneme))
end


"""
    CMUDICT(path)

Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.
The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent
calls will only read the file `path` without downloading again the data.
"""
function CMUDICT(path)
    if ! isfile(path)
        mkpath(dirname(path))
        dir = mktempdir()
        run(`wget -P $dir $CMUDICT_URL`)
        mv(joinpath(dir, "cmudict_SPHINX_40"), path)
    end

    lexicon = Dict()
    open(path, "r") do f
        for line in eachline(f)
            word, pron... = split(line)

            word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")

            prononciations = get(lexicon, word, [])
            push!(prononciations, pron)
            lexicon[word] = prononciations
        end
    end
    lexicon
end


"""
    TIMITDICT(timitdir)

Return the dictionary of pronunciation as provided by TIMIT corpus (located
in `timitdir`).
"""
function TIMITDICT(timitdir)
    dictfile = joinpath(timitdir, "doc", "timitdic.txt")
    iscomment(line) = first(line) == ';'

    lexicon = Dict{String,Vector{Vector{String}}}()
    for line in eachline(dictfile)
        iscomment(line) && continue

        word, pron = split(line, limit=2)
        pron = strip(pron, ['/', '\t', ' '])
        word = '~' in word ? split(word, "~", limit=2)[1] : word

        word = normalizeword(word)
        pron = normalizephoneme.(split(pron))

        prononciations = get(lexicon, word, Vector{String}[])
        push!(prononciations, pron)
        lexicon[word] = prononciations
    end
    lexicon
end


"""
    MFAFRDICT(path)

Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)
"""

function MFAFRDICT(path)
    if ! isfile(path)
        mkpath(dirname(path))
        dir = mktempdir()
        run(`wget -P $dir $FRMFA_DICT_URL`)
        mv(joinpath(dir, "french_mfa.dict"), path)
    end
    lexicon = Dict()
    open(path, "r") do f
        for line in eachline(f)
            word, pron... = split(line)
            prononciations = get(lexicon, word, [])
            push!(prononciations, pron)
            lexicon[word] = prononciations
        end
    end
    lexicon
end
 No newline at end of file
+23 −33
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

#=====================================================================#
# HTML pretty display

function Base.show(io::IO, ::MIME"text/html", r::AbstractAudioSource)
    print(io, "<audio controls ")
    print(io, "src=\"data:audio/wav;base64,")

    x, s, _ = loadsource(r, :)
    iob64_encode = Base64EncodePipe(io)
    wavwrite(x, iob64_encode, Fs = s, nbits = 8, compression = WAV.WAVE_FORMAT_PCM)
    close(iob64_encode)

    println(io, "\" />")
end

#=====================================================================#
# JSON serialization of a manifest item

@@ -68,21 +53,21 @@ function Base.show(io::IO, m::MIME"application/json", r::Recording)
    print(io, "}")
end

function Base.show(io::IO, m::MIME"application/json", s::Supervision)
function Base.show(io::IO, m::MIME"application/json", a::Annotation)
    compact = get(io, :compact, false)
    indent = compact ? 0 : 2
    printfn = compact ? print : println
    printfn(io, "{")
    printfn(io, repeat(" ", indent), "\"id\": \"", s.id, "\", ")
    printfn(io, repeat(" ", indent), "\"recording_id\": \"", s.recording_id, "\", ")
    printfn(io, repeat(" ", indent), "\"start\": ", s.start, ", ")
    printfn(io, repeat(" ", indent), "\"duration\": ", s.duration, ", ")
    printfn(io, repeat(" ", indent), "\"channel\": ", s.channel, ", ")
    printfn(io, repeat(" ", indent), "\"data\": ", s.data |> json)
    printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")
    printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")
    printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")
    printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")
    printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")
    printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)
    print(io, "}")
end

function JSON.json(r::Union{Recording, Supervision}; compact = true)
function JSON.json(r::Union{Recording, Annotation}; compact = true)
    out = IOBuffer()
    show(IOContext(out, :compact => compact), MIME("application/json"), r)
    String(take!(out))
@@ -111,12 +96,12 @@ Recording(d::Dict) = Recording(
    d["samplerate"]
)

Supervision(d::Dict) = Supervision(
Annotation(d::Dict) = Annotation(
    d["id"],
    d["recording_id"],
    d["start"],
    d["duration"],
    d["channel"],
    d["channels"],
    d["data"]
)

@@ -139,13 +124,18 @@ function readmanifest(io::IO, T)
    manifest
end

manifestname(T::Type{<:Recording}, subset) = "recording-manifest-$(subset).jsonl"
manifestname(T::Type{<:Supervision}, subset) = "supervision-manifest-$(subset).jsonl"
# Some utilities
manifestname(::Type{<:Recording}, name) = "recordings.jsonl"
manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"

load(T::Type{<:Union{Recording,Supervision}}, path::AbstractString) =
    open(f -> readmanifest(f, T), path, "r")
load(corpus::SpeechCorpus, dir, T, subset) =
    load(T, joinpath(path(corpus, dir), manifestname(T, subset)))
load(corpus::SpeechCorpus, T, subset) =
    load(corpus, corporadir, T, subset)
"""
    load(Annotation, path)
    load(Recording, path)

Load Recording/Annotation manifest from `path`.
"""
load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")

function checkdir(dir::AbstractString)
    isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))
end
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1

"""
    abstract type AbstractAudioSource end

Base class for all audio source. Possible audio sources are:
* `FileAudioSource`
* `URLAudioSource`
* `CmdAudioSource`

You can load the data of an audio source with the internal function

    loadsoce(s::AbstractAudioSource, subrange)

"""
abstract type AbstractAudioSource end

struct FileAudioSource <: AbstractAudioSource
    path::AbstractString
end

struct URLAudioSource <: AbstractAudioSource
    url::AbstractString
end

struct CmdAudioSource <: AbstractAudioSource
    cmd
end
CmdAudioSource(c::String) = CmdAudioSource(Cmd(String.(split(c))))

loadsource(s::FileAudioSource, subrange) = wavread(s.path; subrange)
loadsource(s::URLAudioSource, subrange) = wavread(IOBuffer(HTTP.get(s.url).body); subrange)
loadsource(s::CmdAudioSource, subrange) = wavread(IOBuffer(read(pipeline(s.cmd))); subrange)

"""
    abstract type ManifestItem end

@@ -71,7 +39,7 @@ end

function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)
    if ismissing(channels) || ismissing(samplerate)
        x, sr = loadsource(s, :)
        x, sr = loadaudio(s)
        samplerate = ismissing(samplerate) ? Int(sr) : samplerate
        channels = ismissing(channels) ? collect(1:size(x,2)) : channels
    end
@@ -79,47 +47,49 @@ function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate
end

"""
    struct Supervision <: ManifestItem
    struct Annotation <: ManifestItem
        id::AbstractString
        recording_id::AbstractString
        start::Float64
        duration::Float64
        channel::Int
        channel::Union{Vector, Colon}
        data::Dict
    end

A "supervision" defines a segment of a recording on a single channel.
An "annotation" defines a segment of a recording on a single channel.
The `data` field is an arbitrary dictionary holdin the nature of the
supervision.
annotation. `start` and `duration` (in seconds) defines,
where the segment is locatated within the recoding `recording_id`.

# Constructor

    Supervision(id, recording_id, start, duration, channel, data)
    Supervision(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
    Annotation(id, recording_id, start, duration, channel, data)
    Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

If `start` and/or `duration` are negative, the segment is considered to
be the whole sequence length of the recording.
"""
struct Supervision <: ManifestItem
struct Annotation <: ManifestItem
    id::AbstractString
    recording_id::AbstractString
    start::Float64
    duration::Float64
    channel::Int
    channels::Union{Vector, Colon}
    data::Dict
end

Supervision(id, recid; channel = missing, start = -1, duration = -1, data = missing) =
    Supervision(id, recid, start, duration, channel, data)
Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =
    Annotation(id, recid, start, duration, channels, data)


"""
    load(recording[; start = -1, duration = -1, channels = recording.channels])
    load(recording, supervision)
    load(recording, annotation)

Load the signal from a recording. `start`, `duration` (in seconds) can
be used to load only a segment. If a `supervision` is given, function
be used to load only a segment. If an `annotation` is given, function
will return on the portion of the signal corresponding to the
supervision segment.
annotation segment.

The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array
- ``N`` is the length of the signal and ``C`` is the number of channels
@@ -134,10 +104,9 @@ function load(r::Recording; start = -1, duration = -1, channels = r.channels)
        subrange = (:)
    end

    x, sr, _, _ = loadsource(r.source, subrange)
    x, sr = loadaudio(r.source, subrange)
    x[:,channels], sr
end

load(r::Recording, s::Supervision) =
    load(r; start = s.start, duration = s.duration, channels = [s.channel])
load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)
Original line number Diff line number Diff line
# SPDX-License-Identifier: CECILL-2.1


"""
    abstract type SpeechCorpus
    abstract type SpeechCorpus end

Abstract type for all speech corpora.
"""
abstract type SpeechCorpus end


"""
    path(corpus)
    lang(corpus)

Path to the directory where is stored the corpus' data.
Return the ISO 639-3 code of the language of the corpus.
"""
path(corpus::SpeechCorpus, dir) = joinpath(dir, corpus.lang, corpus.name)
lang


"""
    name(corpus)

Return the name identifier of the corpus.
"""
name


"""
    download(corpus[, dir = homedir()])
    download(corpus, rootdir)

Download the data of the corpus to `dir`.
"""
Base.download(corpus::SpeechCorpus) = download(corpus, SPEECH_CORPORA_ROOTDIR)
Base.download

"""
    prepare(corpus[, dir = homedir()])
    prepare(corpus, rootdir)

Prepare the manifests of corpus to `dir`.
Prepare the manifests of corpus.
"""
prepare(corpus::SpeechCorpus) = prepare(corpus, SPEECH_CORPORA_ROOTDIR)
prepare