An error occurred while retrieving target projects.

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Target

Show changes

Only incoming changes from source

Include changes to target since source was created

Commits on Source 56

load from audio source · 2142f5d1
Lucas Ondel Yang authored Sep 25, 2023

2142f5d1
added raw audio source · 5d51b0d2
Lucas Ondel Yang authored Sep 28, 2023

5d51b0d2
version 0.6.0 · 7cf16bf1
Lucas Ondel Yang authored Sep 28, 2023

7cf16bf1
preparing recipe timit · 8dfd49d2
Lucas Ondel Yang authored Nov 26, 2023

8dfd49d2
timit recording manifest · aa6eaf0d
Lucas Ondel Yang authored Dec 14, 2023

aa6eaf0d
bumped version number · 1bee8c12
Lucas Ondel Yang authored Dec 14, 2023

1bee8c12
Timit data preparation · c234333f
Martin Kocour authored Jan 23, 2024

c234333f
Merge branch 'corpora/timit' into 'main' · 7a7ff2f6
Lucas Ondel Yang authored Jan 30, 2024
```
Timit data preparation

See merge request fast/speechcorpora.jl!1
```
7a7ff2f6
Support multi channel supervisions · c7ad7a19
Martin Kocour authored Jan 31, 2024

c7ad7a19
Merge branch 'multi_channel' into 'main' · fcdc1d21
Lucas Ondel Yang authored Jan 31, 2024
```
Support multi channel supervisions

See merge request fast/speechcorpora.jl!2
```
fcdc1d21
Dataset · 620e3571
Martin Kocour authored Feb 1, 2024 and Lucas Ondel Yang committed Feb 1, 2024

620e3571
Merge branch 'dataset' into 'main' · dbe1036d
Lucas Ondel Yang authored Feb 1, 2024
```
Dataset

See merge request fast/speechcorpora.jl!3
```
dbe1036d
version 0.8.0 · 53fb7e34
Martin Kocour authored Feb 2, 2024

53fb7e34
Hotfix: ranmae SpeechCorpora to SpeechDatasets · 25659378
Martin Kocour authored Feb 2, 2024

25659378
changed uuid · e0bbe27d
Lucas Ondel Yang authored Feb 2, 2024

e0bbe27d
Resolve "prepare lexicon" · ad8ab070
Lucas Ondel Yang authored Feb 9, 2024

ad8ab070
Merge branch '6-prepare-lexicon' into 'main' · 955e185b
Lucas Ondel Yang authored Feb 9, 2024
```
Resolve "prepare lexicon"

See merge request !4
```
955e185b
Fix channels · 1da44ae0
Martin Kocour authored Feb 9, 2024 and Lucas Ondel Yang committed Feb 9, 2024

1da44ae0
Merge branch 'fix_channels' into 'main' · 6a322703
Lucas Ondel Yang authored Feb 9, 2024
```
Fix channels

See merge request !5
```
6a322703
version 0.9.1 · d07b03db
Lucas Ondel Yang authored Feb 9, 2024

d07b03db
minor fixes · 93a7d677
Lucas Ondel Yang authored Feb 9, 2024

93a7d677
fixed cmudict path creationg · ade58cf6
Lucas Ondel Yang authored Feb 13, 2024

ade58cf6
Merge branch '8-cmudict' into 'main' · 546cff43
Lucas Ondel Yang authored Feb 13, 2024
```
fixed cmudict path creationg

See merge request !6
```
546cff43
Resolve "timit dataset not complete" · 1af0caca
Martin Kocour authored Feb 14, 2024

1af0caca
Merge branch '8-timit-dataset-not-complete' into 'main' · 991ec3b9
Martin Kocour authored Feb 14, 2024
```
Resolve "timit dataset not complete"

Closes #8

See merge request !7
```
991ec3b9
version 0.9.4 · 4291ddf5
Lucas Ondel Yang authored Feb 22, 2024

4291ddf5
MFA French dictionnary added in lexicons · a82df3ee
Simon Devauchelle authored Mar 4, 2024

a82df3ee
MFA_FR : OOVS list of words generated from G2P models can be added to the lexicon · c31c1938
Simon Devauchelle authored Mar 4, 2024

c31c1938
Adding TIMIT phonetic alignments. · 0d04b5f9
simon devauchelle authored Mar 15, 2024

0d04b5f9
Merge branch 'add_timit_alignments' into ina_diachronic_corpus · 9feaabf5
simon devauchelle authored Mar 15, 2024
```
merge
```
9feaabf5
No oovs keywords + removing ina corpus references · e030c1a2
simon devauchelle authored Mar 18, 2024

e030c1a2
Merging Supervision and Alignment together into a new Annotation object. · 89228cb7
Simon Devauchelle authored Mar 18, 2024

89228cb7
Merge branch 'mfa_lexicons' into 'main' · a39c675b
Simon Devauchelle authored Mar 18, 2024
```
Mfa lexicons

See merge request !12
```
a39c675b
Merge branch 'main' of https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl into main · b7f384c6
Lucas Ondel Yang authored Mar 22, 2024

b7f384c6
version 0.10.0 · da1dee86
Lucas Ondel Yang authored Mar 22, 2024

da1dee86
fixed compat · 13bf3ced
Lucas Ondel Yang authored Mar 22, 2024

13bf3ced
version 0.11 · 76869121
Lucas Ondel Yang authored May 21, 2024

76869121
Merge branch 'filter' into 'main' · 0b47f281
Lucas Ondel Yang authored May 21, 2024
```
version 0.11

See merge request !13
```
0b47f281
speechdataset iterate over tuple of recording, annotation · 0d8fcc5a
Lucas Ondel Yang authored May 21, 2024

0d8fcc5a
bumped version · d1a0cc49
Lucas Ondel Yang authored May 21, 2024

d1a0cc49
rename Supervision calls to Annotation · e5aeb52d
Nicolas Denier authored Jun 4, 2024

e5aeb52d
Merge branch 'test' into 'main' · 38b05991
Lucas Ondel Yang authored Jun 5, 2024
```
rename Supervision calls to Annotation

See merge request !14
```
38b05991
Generate annotations and recordings jsonl for ina diachrony dataset · b24e4214
Nicolas Denier authored Jun 6, 2024

b24e4214
fix text file reading, update readme · 2fea769a
Nicolas Denier authored Jun 7, 2024

2fea769a
remove paths · 936aa412
Nicolas Denier authored Jun 7, 2024

936aa412
Merge branch 'ina_diachrony' into 'main' · fd14c219
Lucas Ondel Yang authored Jun 7, 2024
```
Support Ina diachrony dataset

See merge request !15
```
fd14c219
version update · 4335df3a
Nicolas Denier authored Jun 10, 2024

4335df3a
generate recordings and annotation jsonl for AVID dataset · fbed9737
Nicolas Denier authored Jun 10, 2024

fbed9737
download data if not found · 54f0ee8b
Nicolas Denier authored Jun 10, 2024

54f0ee8b
Merge branch 'avid' into 'main' · f2442632
Lucas Ondel Yang authored Jun 10, 2024
```
Add support for AVID dataset

See merge request !16
```
f2442632
version update 0.14.0 · eb5bfe25
Nicolas Denier authored Jun 11, 2024

eb5bfe25
support Speech2Tex dataset · a0ec8166
Nicolas Denier authored Jun 14, 2024

a0ec8166
Merge branch 'speech2tex' into 'main' · e2a934a2
Lucas Ondel Yang authored Jun 19, 2024
```
Support Speech2tex

See merge request !17
```
e2a934a2
version update v0.15.0 · 6a0a9a4f
Nicolas Denier authored Jun 19, 2024

6a0a9a4f
version update v0.15.0 · 34125b54
Nicolas Denier authored Jun 19, 2024

34125b54
Merge branch 'main' of serveur-gitlab.lisn.upsaclay.fr:fast/speechdatasets.jl · 046268fd
Nicolas Denier authored Jun 19, 2024

046268fd

.gitignore0 → 100644

+3
−0

Original line number
Diff line number
Diff line

*outputdir/

Manifest.toml

notebook-test.jl

CHANGELOG.md

+70
−0

Original line number
Diff line number
Diff line

# Tags

## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024

### Changed

- Added support for Speech2Tex dataset

## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024

### Changed

- Added support for AVID dataset

## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024

### Changed

- Added support for INA Diachrony dataset

### Fixed

- Fixed Minilibrispeech data prep

## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024

### Changed

- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.

## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024

### Added

- filtering speech dataset based on recording id.

### Improved

- Faster TIMIT preparation

## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024

### Added

- extract alignments from TIMIT

### Changed

- `Supervision` is now `Annotation`

## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024

# Fixed

- TIMIT data preparation

## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024

# Fixed

- `CMUDICT("dir/path")` fails if `dir` does not already exists.

## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024

# Fixed

- invalid type for field `channels` of `Recording`

- `MINILIBRISPEECH` broken

## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024

# Fixed

- not possible to use `:` as channel specifier

## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024

# Changed

- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`

## Added

* CMU and TIMIT lexicon

## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024

## Features

* New `dataset` function, which builds `SpeechDataset` from manifest files

* Compatibility with MLUtils.DataLoader

## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023

## Changed

* refactored API, TIMIT dataset working (but not Librispeech anymore)

## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023

## Added

- raw audio data source

## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023

## Added

- can load the data directly from an audio source with the `load`

  function.

## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023

## Added

Project.toml

+11
−9

Original line number
Diff line number
Diff line

name = "SpeechCorpora"

uuid = "3225a15e-d855-4a07-9546-2418058331ae"

authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>"]

version = "0.4.1"

name = "SpeechDatasets"

uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"

authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",

           "Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",

           "Nicolas DENIER <nicolas.denier@lisn.fr>"]

version = "0.15.0"

[deps]

Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"

JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"

WAV = "8149f6b0-98f6-5db9-b78f-408fbbb8ef88"

MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"

SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"

[compat]

julia = "1.10"

JSON = "0.21"

WAV = "1.2"

julia = "1.8"

SpeechFeatures = "0.8"

README.md

+26
−17

Original line number
Diff line number
Diff line

# SpeechCorpora.jl

# SpeechDatasets.jl

A Julia package to download and prepare speech corpus.

@@ -7,35 +7,44 @@ A Julia package to download and prepare speech corpus.

Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)

to your julia installation. Then, install the package as usual:

```

pkg> add SpeechCorpora

pkg> add SpeechDatasets

```

## Example

```

julia> using SpeechCorpora

julia> using SpeechDatasets

julia> corpus = MultilingualLibriSpeech("fra") |> download |> prepare

julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test

...

# Load the recording manifest.

julia> recs = load(corpus, Recording, "dev") # use "train", "dev" or "test"

julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test

...

# Load the supervision manifest.

julia> sups = load(corpus, Supervision, "dev") # use "train", "dev" or "test"

julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional

...

# Load the signal of the first supervision segment

julia> s = first(values(sups))

julia> x, samplerate = load(recs[s.recording_id], s)

julia> dataset = AVID("/path/to/avid/dir", "outputdir")

...

# Play the recording of the first supervision segment

julia> play(recs[s.recording_id], s)

julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")

...

```

## Author

* Lucas ONDEL YANG (LISN, CNRS)

julia> for ((signal, fs), supervision) in dataset

           # do something

       end

# Lexicons

julia> CMUDICT("outputfile")

...

julia> TIMITDICT("/path/to/timit/dir")

...

```

## License

This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE)

This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))

src/SpeechCorpora.jl→src/SpeechDatasets.jl

+51
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

module SpeechCorpora

module SpeechDatasets

using Base64

using HTTP

using JSON

using WAV

using SpeechFeatures

import MLUtils

export

    # ManifestItem

    FileAudioSource,

    CmdAudioSource,

    URLAudioSource,

    Recording,

    Supervision,

    Annotation,

    load,

    # Manifest interface

@@ -22,28 +18,34 @@ export

    # Corpora interface

    download,

    lang,

    name,

    prepare,

    # Corpora

    MultilingualLibriSpeech,

    MiniLibriSpeech

    MINILIBRISPEECH,

    TIMIT,

    INADIACHRONY,

    AVID,

    SPEECH2TEX,

    # Lexicon

    CMUDICT,

    TIMITDICT,

    MFAFRDICT,

SPEECH_CORPORA_ROOTDIR = homedir()

"""

    setrootdir(path)

Set the root directory where to store the datasets. Default to the user

home directory.

"""

setrootdir(path) = global SPEECH_CORPORA_ROOTDIR = path

    # Dataset

    dataset

include("speechcorpus.jl")

include("manifest_item.jl")

include("manifest_io.jl")

include("corpora/multilingual_librispeech.jl")

include("corpora/mini_librispeech.jl")

include("dataset.jl")

# Supported corpora

include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))

include("lexicons.jl")

end

src/corpora/avid.jl0 → 100644

+140
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

function avid_recordings(dir::AbstractString)

    checkdir(dir)

    recordings = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            id = filename

            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(

                id,

                audio_src;

                channels = [1],

                samplerate = 16000

            )

        end

    end

    recordings

end

function load_metadata_files(dir::AbstractString)

    tasksdict = Dict('s' => "SENT", 'p' => "PARA")

    metadatadict = Dict(key => 

        readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv")) 

        for key in keys(tasksdict))

    return metadatadict

end

function get_metadata(filename, metadatadict)

    task = split(filename, "_")[3][1]

    headers = metadatadict[task][1]

    headers = split(headers, ",")

    file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]

    file_metadata = split(file_metadata, ",")

    metadata = Dict(

        headers[i] => file_metadata[i]

        for i = 1:length(headers)

    )

    return metadata

end

function avid_annotations(dir)

    checkdir(dir)

    annotations = Dict()

    metadatadict = load_metadata_files(dir)

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            # extract metadata from csv files

            metadata = get_metadata(filename, metadatadict)

            id = filename

            # generate annotation

            annotations[id] = Annotation(

                id, # audio id

                id, # annotation id

                -1,  # start and duration is -1 means that we take the whole

                -1,  # recording

                [1], # only 1 channel (mono recording)

                metadata # additional informations   

            )

        end

    end

    annotations

end

function download_avid(dir)

    @info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"

    url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"

    filename = "AVID.zip"

    filepath = joinpath(dir,filename)

    run(`mkdir -p $dir`)

    run(`wget $url -O $filepath`)

    @info "Download complete, extracting files"

    run(`unzip $filepath -d $dir`)

    run(`rm $filepath`)

    return joinpath(datadir, "/AVID")

end

function avid_prepare(datadir, outputdir)

    # Validate the data directory

    isdir(datadir) || (datadir = download_avid(datadir))

    # Create the output directory.

    outputdir = mkpath(outputdir)

    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings

    recordings = Array{Dict}(undef, 2)

    recordings_path = joinpath(datadir, "Repository 2")

    @info "Extracting recordings from $recordings_path"

    recordings[1] = avid_recordings(recordings_path)

    # Calibration tones

    calibtones_path = joinpath(datadir, "Calibration_tones")

    @info "Extracting recordings from $calibtones_path"

    recordings[2] = avid_recordings(calibtones_path)

    for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])

        open(manifestpath, "w") do f

            writemanifest(f, recordings[i])

        end

    end

    # Annotations

    annotations_path = recordings_path

    @info "Extracting annotations from $annotations_path"

    annotations = avid_annotations(annotations_path)

    manifestpath = joinpath(outputdir, "annotations.jsonl")

    @info "Creating $manifestpath"

    open(manifestpath, "w") do f

        writemanifest(f, annotations)

    end

end

function AVID(datadir, outputdir)

    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&

          isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&

          isfile(joinpath(outputdir, "annotations.jsonl")))

        avid_prepare(datadir, outputdir)

    end

    dataset(outputdir, "")

end

src/corpora/ina_diachrony.jl0 → 100644

+160
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

function ina_diachrony_recordings(dir::AbstractString)

    checkdir(dir)

    recordings = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            id = "ina_diachrony§$filename"

            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(

                id,

                audio_src;

                channels = [1],

                samplerate = 16000

            )

        end

    end

    recordings

end

function ina_diachrony_get_metadata(filename)

    metadata = split(filename, "§")

    age, sex = split(metadata[2], "_")

    Dict(

        "speaker" => metadata[3],

        "timeperiod" => metadata[1],

        "age" => age,

        "sex" => sex,

    )

end

function ina_diachrony_annotations_whole(dir)

    checkdir(dir)

    annotations = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            # extract metadata from filename

            metadata = ina_diachrony_get_metadata(filename)

            # extract transcription text (same filename but .txt)

            textfilepath = joinpath(root, "$filename.txt")

            metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""

            id = "ina_diachrony§$filename"

            annotation_id = id*"§0"

            # generate annotation

            annotations[annotation_id] = Annotation(

                id, # audio id

                annotation_id, # annotation id

                -1,  # start and duration is -1 means that we take the whole

                -1,  # recording

                [1], # only 1 channel (mono recording)

                metadata # additional informations

            )

        end

    end

    annotations

end

function ina_diachrony_annotations_csv(dir)

    checkdir(dir)

    annotations = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".csv" && continue

            # extract metadata from filename

            metadata = ina_diachrony_get_metadata(filename)

            id = "ina_diachrony§$filename"

            # generate annotation for each line in csv

            open(joinpath(root, file)) do f

                header = readline(f)   

                line = 1 

                # read till end of file

                while ! eof(f) 

                    current_line = readline(f)

                    start_time, end_time, text = split(current_line, ",", limit=3)

                    start_time = parse(Float64, start_time)

                    duration = parse(Float64, end_time)-start_time

                    metadata["text"] = text

                    annotation_id = id*"§$line"

                    annotations[id] = Annotation(

                        id, # audio id

                        annotation_id, # annotation id

                        start_time,  # start

                        duration,  # duration

                        [1], # only 1 channel (mono recording)

                        metadata # additional informations

                    )

                    line += 1

                end

            end

        end

    end

    annotations

end

function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)

    # Validate the data directory

    for d in [ina_wav_dir, ina_csv_dir]

        isnothing(d) || checkdir(d)

    end

    # Create the output directory.

    outputdir = mkpath(outputdir)

    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings

    @info "Extracting recordings from $ina_wav_dir"

    recordings = ina_diachrony_recordings(ina_wav_dir)

    manifestpath = joinpath(outputdir, "recordings.jsonl")

    open(manifestpath, "w") do f

        writemanifest(f, recordings)

    end

    # Annotations

    @info "Extracting annotations from $ina_wav_dir"

    annotations = ina_diachrony_annotations_whole(ina_wav_dir)

    if ! isnothing(ina_csv_dir)

        @info "Extracting annotations from $ina_csv_dir"

        csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)

        annotations = merge(annotations, csv_annotations)

    end

    manifestpath = joinpath(outputdir, "annotations.jsonl")

    @info "Creating $manifestpath"

    open(manifestpath, "w") do f

        writemanifest(f, annotations)

    end

end

function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)

    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&

          isfile(joinpath(outputdir, "annotations.jsonl")))

        ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)

    end

    dataset(outputdir, "")

end

src/corpora/mini_librispeech.jl

+16
−28

Original line number
Diff line number
Diff line

@@ -11,15 +11,9 @@ const MINILS_SUBSETS = Dict(

    "dev" => "dev-clean-2"

)

const MINILS_LANG = "eng"

const MINILS_NAME = "mini_librispeech"

#######################################################################

struct MiniLibriSpeech <: SpeechCorpus

    lang

    name

struct MINILIBRISPEECH <: SpeechCorpus

    recordings

    train

    dev

@@ -48,7 +42,7 @@ function minils_recordings(dir, subset)

    recs

end

function minils_supervisions(dir, subset)

function minils_annotations(dir, subset)

    subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])

    sups = Dict()

    for d1 in readdir(subsetdir; join = true)

@@ -58,8 +52,12 @@ function minils_supervisions(dir, subset)

            open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f

                for line in eachline(f)

                    tokens = split(line)

                    s = Supervision(tokens[1], tokens[1]; channel = 1,

                                    data = Dict("text" => join(tokens[2:end], " ")))

                    s = Annotation(

                        tokens[1], # annotation id

                        tokens[1]; # recording id

                        channels = [1],

                        data = Dict("text" => join(tokens[2:end], " "))

                    )

                    sups[s.id] = s

                end

            end

@@ -89,7 +87,7 @@ end

function minils_prepare(dir)

    # 1. Recording manifest.

    out = joinpath(dir, "recording-manifest.jsonl")

    out = joinpath(dir, "recordings.jsonl")

    if ! isfile(out)

        open(out, "w") do f

            for subset in ["train", "dev"]

@@ -100,12 +98,12 @@ function minils_prepare(dir)

        end

    end

    # 2. Supervision manifests.

    for subset in ["train", "dev"]

        out = joinpath(dir, "supervision-manifest-$subset.jsonl")

    # 2. Annotation manifests.

    for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]

        out = joinpath(dir, "annotations-$name.jsonl")

        if ! isfile(out)

            @debug "preparing supervision manifest ($subset) $out"

            sups = minils_supervisions(dir, subset)

            @debug "preparing annotation manifest ($subset) $out"

            sups = minils_annotations(dir, subset)

            open(out, "w") do f

                writemanifest(f, sups)

            end

@@ -113,20 +111,10 @@ function minils_prepare(dir)

    end

end

function MiniLibriSpeech(outdir)

    dir = joinpath(outdir, MINILS_LANG, MINILS_NAME)

function MINILIBRISPEECH(dir, subset)

    minils_download(dir)

    minils_prepare(dir)

    MiniLibriSpeech(

        MINILS_LANG,

        MINILS_NAME,

        load(Recording, joinpath(dir, "recording-manifest.jsonl")),

        load(Supervision, joinpath(dir, "supervision-manifest-train.jsonl")),

        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),

        load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),

    )

    dataset(dir, subset)

end

MiniLibriSpeech() = MiniLibriSpeech(SPEECH_CORPORA_ROOTDIR)

src/corpora/multilingual_librispeech.jl

+6
−6

Original line number
Diff line number
Diff line

@@ -89,13 +89,13 @@ function recordings(corpus::MultilingualLibriSpeech, dir, subset)

    recs

end

function supervisions(corpus::MultilingualLibriSpeech, dir, subset)

function annotations(corpus::MultilingualLibriSpeech, dir, subset)

    trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")

    sups = Dict()

    open(trans, "r") do f

        for line in eachline(f)

            tokens = split(line)

            s = Supervision(tokens[1], tokens[1]; channel = 1,

            s = Annotation(tokens[1], tokens[1]; channel = 1,

                            data = Dict("text" => join(tokens[2:end], " ")))

            sups[s.id] = s

        end

@@ -118,12 +118,12 @@ function prepare(corpus::MultilingualLibriSpeech, outdir)

        end

    end

    # 2. Supervision manifests.

    # 2. Annotation manifests.

    for subset in ["train", "dev", "test"]

        out = joinpath(dir, "supervision-manifest-$subset.jsonl")

        @info "preparing supervision manifest ($subset) $out"

        out = joinpath(dir, "annotation-manifest-$subset.jsonl")

        @info "preparing annotation manifest ($subset) $out"

        if ! isfile(out)

            sups = supervisions(corpus, dir, subset)

            sups = annotations(corpus, dir, subset)

            open(out, "w") do f

                writemanifest(f, sups)

            end

src/corpora/speech2tex.jl0 → 100644

+122
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

function speech2tex_recordings(dir::AbstractString)

    checkdir(dir)

    recordings = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            id = filename

            path = joinpath(root, file)

            audio_src = FileAudioSource(path)

            recordings[id] = Recording(

                id,

                audio_src;

                channels = [1],

                samplerate = 48000

            )

        end

    end

    recordings

end

extract_digits(str::AbstractString) = filter(c->isdigit(c), str)

isnumber(str::AbstractString) = extract_digits(str)==str

function speech2tex_get_metadata(filename)

    # possible cases: line123_p1  line123_124_p1  line123_p1_part2  (not observed but also supported: line123_124_p1_part2)

    split_name = split(filename, "_")

    metadata = Dict()

    if isnumber(split_name[2])

        metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]

        metadata["speaker"] = split_name[3]

    else 

        metadata["line"] = extract_digits(split_name[1])

        metadata["speaker"] = split_name[2]

    end

    if occursin("part", split_name[end])

        metadata["part"] = extract_digits(split_name[end])

    end

    metadata

end

function speech2tex_annotations(audiodir, transcriptiondir, texdir)

    checkdir.([audiodir, transcriptiondir, texdir])

    annotations = Dict()

    for (root, subdirs, files) in walkdir(audiodir)

        for file in files

            filename, ext = splitext(file)

            ext != ".wav" && continue

            # extract metadata from csv files

            metadata = speech2tex_get_metadata(filename)

            # extract transcription and tex (same filenames but .txt)

            dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")

            for (d, label) in dirdict

                textfilepath = joinpath(d, "$filename.txt")

                metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""

            end

            id = filename

            # generate annotation

            annotations[id] = Annotation(

                id, # audio id

                id, # annotation id

                -1,  # start and duration is -1 means that we take the whole

                -1,  # recording

                [1], # only 1 channel (mono recording)

                metadata # additional informations   

            )

        end

    end

    annotations

end

function speech2tex_prepare(datadir, outputdir)

    # Validate the data directory

    checkdir(datadir)

    # Create the output directory.

    outputdir = mkpath(outputdir)

    rm(joinpath(outputdir, "recordings.jsonl"), force=true)

    # Recordings

    recordings = Array{Dict}(undef, 2)

    recordings_path = joinpath(datadir, "audio")

    @info "Extracting recordings from $recordings_path"

    recordings = speech2tex_recordings(recordings_path)

    manifestpath = joinpath(outputdir, "recordings.jsonl")

    open(manifestpath, "w") do f

        writemanifest(f, recordings)

    end

    # Annotations

    transcriptiondir = joinpath(datadir, "sequences")

    texdir = joinpath(datadir, "latex")

    @info "Extracting annotations from $transcriptiondir and $texdir"

    annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)

    manifestpath = joinpath(outputdir, "annotations.jsonl")

    @info "Creating $manifestpath"

    open(manifestpath, "w") do f

        writemanifest(f, annotations)

    end

end

function SPEECH2TEX(datadir, outputdir)

    if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&

          isfile(joinpath(outputdir, "annotations.jsonl")))

        speech2tex_prepare(datadir, outputdir)

    end

    dataset(outputdir, "")

end

src/corpora/timit.jl0 → 100644

+403
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

#######################################################################

const TIMIT_SUBSETS = Dict(

    "train" => "train",

    "dev" => "dev",

    "test" => "test"

)

const TIMIT_DEV_SPK_LIST = Set([

"faks0",

    "fdac1",

    "fjem0",

    "mgwt0",

    "mjar0",

    "mmdb1",

    "mmdm2",

    "mpdf0",

    "fcmh0",

    "fkms0",

    "mbdg0",

    "mbwm0",

    "mcsh0",

    "fadg0",

    "fdms0",

    "fedw0",

    "mgjf0",

    "mglb0",

    "mrtk0",

    "mtaa0",

    "mtdt0",

    "mthc0",

    "mwjg0",

    "fnmr0",

    "frew0",

    "fsem0",

    "mbns0",

    "mmjr0",

    "mdls0",

    "mdlf0",

    "mdvc0",

    "mers0",

    "fmah0",

    "fdrw0",

    "mrcs0",

    "mrjm4",

    "fcal1",

    "mmwh0",

    "fjsj0",

    "majc0",

    "mjsw0",

    "mreb0",

    "fgjd0",

    "fjmg0",

    "mroa0",

    "mteb0",

    "mjfc0",

    "mrjr0",

    "fmml0",

    "mrws1"

])

const TIMIT_TEST_SPK_LIST = Set([

    "mdab0",

    "mwbt0",

    "felc0",

    "mtas1",

    "mwew0",

    "fpas0",

    "mjmp0",

    "mlnt0",

    "fpkt0",

    "mlll0",

    "mtls0",

    "fjlm0",

    "mbpm0",

    "mklt0",

    "fnlp0",

    "mcmj0",

    "mjdh0",

    "fmgd0",

    "mgrt0",

    "mnjm0",

    "fdhc0",

    "mjln0",

    "mpam0",

    "fmld0"

])

TIMIT_PHONE_MAP48 = Dict(

    "aa"    => "aa",

    "ae"    => "ae",

    "ah"    => "ah",

    "ao"    => "ao",

    "aw"    => "aw",

    "ax"    => "ax",

    "ax-h"  => "ax",

    "axr"   => "er",

    "ay"    => "ay",

    "b"     => "b",

    "bcl"   => "vcl",

    "ch"    => "ch",

    "d"     => "d",

    "dcl"   => "vcl",

    "dh"    => "dh",

    "dx"    => "dx",

    "eh"    => "eh",

    "el"    => "el",

    "em"    => "m",

    "en"    => "en",

    "eng"   => "ng",

    "epi"   => "epi",

    "er"    => "er",

    "ey"    => "ey",

    "f"     => "f",

    "g"     => "g",

    "gcl"   => "vcl",

    "h#"    => "sil",

    "hh"    => "hh",

    "hv"    => "hh",

    "ih"    => "ih",

    "ix"    => "ix",

    "iy"    => "iy",

    "jh"    => "jh",

    "k"     => "k",

    "kcl"   => "cl",

    "l"     => "l",

    "m"     => "m",

    "n"     => "n",

    "ng"    => "ng",

    "nx"    => "n",

    "ow"    => "ow",

    "oy"    => "oy",

    "p"     => "p",

    "pau"   => "sil",

    "pcl"   => "cl",

    "q"     => "",

    "r"     => "r",

    "s"     => "s",

    "sh"    => "sh",

    "t"     => "t",

    "tcl"   => "cl",

    "th"    => "th",

    "uh"    => "uh",

    "uw"    => "uw",

    "ux"    => "uw",

    "v"     => "v",

    "w"     => "w",

    "y"     => "y",

    "z"     => "z",

    "zh"    => "zh"

)

TIMIT_PHONE_MAP39 = Dict(

    "aa"    => "aa",

    "ae"    => "ae",

    "ah"    => "ah",

    "ao"    => "aa",

    "aw"    => "aw",

    "ax"    => "ah",

    "ax-h"  => "ah",

    "axr"   => "er",

    "ay"    => "ay",

    "b"     => "b",

    "bcl"   => "sil",

    "ch"    => "ch",

    "d"     => "d",

    "dcl"   => "sil",

    "dh"    => "dh",

    "dx"    => "dx",

    "eh"    => "eh",

    "el"    => "l",

    "em"    => "m",

    "en"    => "n",

    "eng"   => "ng",

    "epi"   => "sil",

    "er"    => "er",

    "ey"    => "ey",

    "f"     => "f",

    "g"     => "g",

    "gcl"   => "sil",

    "h#"    => "sil",

    "hh"    => "hh",

    "hv"    => "hh",

    "ih"    => "ih",

    "ix"    => "ih",

    "iy"    => "iy",

    "jh"    => "jh",

    "k"     => "k",

    "kcl"   => "sil",

    "l"     => "l",

    "m"     => "m",

    "n"     => "n",

    "ng"    => "ng",

    "nx"    => "n",

    "ow"    => "ow",

    "oy"    => "oy",

    "p"     => "p",

    "pau"   => "sil",

    "pcl"   => "sil",

    "q"     => "",

    "r"     => "r",

    "s"     => "s",

    "sh"    => "sh",

    "t"     => "t",

    "tcl"   => "sil",

    "th"    => "th",

    "uh"    => "uh",

    "uw"    => "uw",

    "ux"    => "uw",

    "v"     => "v",

    "w"     => "w",

    "y"     => "y",

    "z"     => "z",

    "zh"    => "sh"

)

#######################################################################

function timit_prepare(timitdir, dir; audio_fmt="SPHERE")

    # Validate the data directory

    ! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))

    # Create the output directory.

    dir = mkpath(dir)

    rm(joinpath(dir, "recordings.jsonl"), force=true)

    ## Recordings

    @info "Extracting recordings from $timitdir/train"

    train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)

    # We extract the name of speakers that are not in the dev set

    TIMIT_TRAIN_SPK_LIST = Set()

    for id in keys(train_recordings)

        _, spk, _ = split(id, "_")

        if spk ∉ TIMIT_DEV_SPK_LIST

            push!(TIMIT_TRAIN_SPK_LIST, spk)

        end

    end

    @info "Extracting recordings from $timitdir/test"

    test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)

    recordings = merge(train_recordings, test_recordings)

    manifestpath = joinpath(dir, "recordings.jsonl")

    open(manifestpath, "a") do f

        writemanifest(f, recordings)

    end

    # Annotations

    @info "Extracting annotations from $timitdir/train"

    train_annotations = timit_annotations(joinpath(timitdir, "train"))

    @info "Extracting annotations from $timitdir/test"

    test_annotations = timit_annotations(joinpath(timitdir, "test"))

    annotations = merge(train_annotations, test_annotations)

    train_annotations = filter(annotations) do (k, v)

        stype = v.data["sentence type"]

        spk = v.data["speaker"]

        (

            (stype == "compact" || stype == "diverse") &&

            spk ∈ TIMIT_TRAIN_SPK_LIST

        )

    end

    dev_annotations = filter(annotations) do (k, v)

        stype = v.data["sentence type"]

        spk = v.data["speaker"]

        (

            (stype == "compact" || stype == "diverse") &&

            spk ∈ TIMIT_DEV_SPK_LIST

        )

    end

    test_annotations = filter(annotations) do (k, v)

        stype = v.data["sentence type"]

        spk = v.data["speaker"]

        (

            (stype == "compact" || stype == "diverse") &&

            spk ∈ TIMIT_TEST_SPK_LIST

        )

    end

    for (x, y) in ("train" => train_annotations,

                   "dev" => dev_annotations,

                   "test" => test_annotations)

        manifestpath = joinpath(dir, "annotations-$(x).jsonl")

        @info "Creating $manifestpath"

        open(manifestpath, "w") do f

            writemanifest(f, y)

        end

    end

end

function timit_recordings(dir::AbstractString; fmt="SPHERE")

    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))

    recordings = Dict()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            name, ext = splitext(file)

            ext != ".wav" && continue

            spk = basename(root)

            path = joinpath(root, file)

            id = "timit_$(spk)_$(name)"

            audio_src = if fmt == "SPHERE"

                CmdAudioSource(`sph2pipe -f wav $path`)

            else

                FileAudioSource(path)

            end

            recordings[id] = Recording(

                id,

                audio_src;

                channels = [1],

                samplerate = 16000

            )

        end

    end

    recordings

end

function timit_annotations(dir)

    ! isdir(dir) && throw(ArgumentError("expected directory $dir"))

    splitline(line) = rsplit(line, limit=3)

    annotations = Dict()

    processed = Set()

    for (root, subdirs, files) in walkdir(dir)

        for file in files

            name, ext = splitext(file)

            _, dialect, spk = rsplit(root, "/", limit=3)

            # Annotation files already processed (".wrd" and ".phn")

            idtuple = (dialect, spk, name)

            (idtuple in processed) && continue

            push!(processed, (dialect, spk, name))

            # Words

            wpath = joinpath(root, name * ".wrd")

            words = [last(split(line)) for line in eachline(wpath)]

            # Phones

            ppath = joinpath(root, name * ".phn")

            palign = Tuple{Int,Int,String}[]

            for line in eachline(ppath)

                t0, t1, p = split(line)

                push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))

            end

            sentence_type = if startswith(name, "sa")

                "dialect"

            elseif startswith(name, "sx")

                "compact"

            else # startswith(name, "si")

                "diverse"

            end

            id = "timit_$(spk)_$(name)"

            annotations[id] = Annotation(

                id,  # recording id and annotation id are the same since we have

                id,  # a one-to-one mapping

                -1,  # start and duration is -1 means that we take the whole

                -1,  # recording

                [1], # only 1 channel (mono recording)

                Dict(

                     "text" => join(words, " "),

                     "sentence type" => sentence_type,

                     "alignment" => palign,

                     "dialect" => dialect,

                     "speaker" => spk,

                     "sex" => string(first(spk)),

                )

            )

        end

    end

    annotations

end

function TIMIT(timitdir, dir, subset)

    if ! (isfile(joinpath(dir, "recordings.jsonl")) &&

          isfile(joinpath(dir, "annotations-train.jsonl")) &&

          isfile(joinpath(dir, "annotations-dev.jsonl")) &&

          isfile(joinpath(dir, "annotations-test.jsonl")))

        timit_prepare(timitdir, dir)

    end

    dataset(dir, subset)

end

src/dataset.jl0 → 100644

+70
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

struct SpeechDataset <: MLUtils.AbstractDataContainer

    idxs::Vector{AbstractString}

    annotations::Dict{AbstractString, Annotation}

    recordings::Dict{AbstractString, Recording}

end

"""

dataset(manifestroot)

Load `SpeechDataset` from manifest files stored in `manifestroot`.

Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.

See also [`Annotation`](@ref).

# Examples

```julia-repl

julia> ds = dataset("./manifests", :train)

SpeechDataset(

    ...

)

julia> ds[1]

(

    (samples=[...], sampling_rate=16_000),

    Dict(

        "text" => "Annotation text here"

    )

)

```

"""

function dataset(manifestroot::AbstractString, partition)

    partition_name = partition == "" ? "" : "-$(partition)"

    annot_path =  joinpath(manifestroot, "annotations$(partition_name).jsonl") 

    rec_path = joinpath(manifestroot, "recordings.jsonl")

    annotations = load(Annotation, annot_path)

    recordings = load(Recording, rec_path)

    dataset(annotations, recordings)

end

function dataset(annotations::AbstractDict, recordings::AbstractDict)

    idxs = collect(keys(annotations))

    SpeechDataset(idxs, annotations, recordings)

end

Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]

Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])

# Fix1 -> partial funcion with fixed 1st argument

Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)

Base.length(d::SpeechDataset) = length(d.idxs)

function Base.filter(fn, d::SpeechDataset)

    fidxs = filter(d.idxs) do i

        fn((d.recordings[i], d.annotations[i]))

    end

    idset = Set(fidxs)

    fannotations = filter(d.annotations) do (k, v)

        k ∈ idset

    end

    frecs = filter(d.recordings) do (k, v)

        k ∈ idset

    end

    SpeechDataset(fidxs, fannotations, frecs)

end

src/lexicons.jl0 → 100644

+99
−0

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"

const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"

function normalizeword(word)

    String(uppercase(word))

end

function normalizephoneme(phoneme)

    String(uppercase(phoneme))

end

"""

    CMUDICT(path)

Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.

The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent

calls will only read the file `path` without downloading again the data.

"""

function CMUDICT(path)

    if ! isfile(path)

        mkpath(dirname(path))

        dir = mktempdir()

        run(`wget -P $dir $CMUDICT_URL`)

        mv(joinpath(dir, "cmudict_SPHINX_40"), path)

    end

    lexicon = Dict()

    open(path, "r") do f

        for line in eachline(f)

            word, pron... = split(line)

            word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")

            prononciations = get(lexicon, word, [])

            push!(prononciations, pron)

            lexicon[word] = prononciations

        end

    end

    lexicon

end

"""

    TIMITDICT(timitdir)

Return the dictionary of pronunciation as provided by TIMIT corpus (located

in `timitdir`).

"""

function TIMITDICT(timitdir)

    dictfile = joinpath(timitdir, "doc", "timitdic.txt")

    iscomment(line) = first(line) == ';'

    lexicon = Dict{String,Vector{Vector{String}}}()

    for line in eachline(dictfile)

        iscomment(line) && continue

        word, pron = split(line, limit=2)

        pron = strip(pron, ['/', '\t', ' '])

        word = '~' in word ? split(word, "~", limit=2)[1] : word

        word = normalizeword(word)

        pron = normalizephoneme.(split(pron))

        prononciations = get(lexicon, word, Vector{String}[])

        push!(prononciations, pron)

        lexicon[word] = prononciations

    end

    lexicon

end

"""

    MFAFRDICT(path)

Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)

"""

function MFAFRDICT(path)

    if ! isfile(path)

        mkpath(dirname(path))

        dir = mktempdir()

        run(`wget -P $dir $FRMFA_DICT_URL`)

        mv(joinpath(dir, "french_mfa.dict"), path)

    end

    lexicon = Dict()

    open(path, "r") do f

        for line in eachline(f)

            word, pron... = split(line)

            prononciations = get(lexicon, word, [])

            push!(prononciations, pron)

            lexicon[word] = prononciations

        end

    end

    lexicon

end

 No newline at end of file

src/manifest_io.jl

+23
−33

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

#=====================================================================#

# HTML pretty display

function Base.show(io::IO, ::MIME"text/html", r::AbstractAudioSource)

    print(io, "<audio controls ")

    print(io, "src=\"data:audio/wav;base64,")

    x, s, _ = loadsource(r, :)

    iob64_encode = Base64EncodePipe(io)

    wavwrite(x, iob64_encode, Fs = s, nbits = 8, compression = WAV.WAVE_FORMAT_PCM)

    close(iob64_encode)

    println(io, "\" />")

end

#=====================================================================#

# JSON serialization of a manifest item

@@ -68,21 +53,21 @@ function Base.show(io::IO, m::MIME"application/json", r::Recording)

    print(io, "}")

end

function Base.show(io::IO, m::MIME"application/json", s::Supervision)

function Base.show(io::IO, m::MIME"application/json", a::Annotation)

    compact = get(io, :compact, false)

    indent = compact ? 0 : 2

    printfn = compact ? print : println

    printfn(io, "{")

    printfn(io, repeat(" ", indent), "\"id\": \"", s.id, "\", ")

    printfn(io, repeat(" ", indent), "\"recording_id\": \"", s.recording_id, "\", ")

    printfn(io, repeat(" ", indent), "\"start\": ", s.start, ", ")

    printfn(io, repeat(" ", indent), "\"duration\": ", s.duration, ", ")

    printfn(io, repeat(" ", indent), "\"channel\": ", s.channel, ", ")

    printfn(io, repeat(" ", indent), "\"data\": ", s.data |> json)

    printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")

    printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")

    printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")

    printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")

    printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")

    printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)

    print(io, "}")

end

function JSON.json(r::Union{Recording, Supervision}; compact = true)

function JSON.json(r::Union{Recording, Annotation}; compact = true)

    out = IOBuffer()

    show(IOContext(out, :compact => compact), MIME("application/json"), r)

    String(take!(out))

@@ -111,12 +96,12 @@ Recording(d::Dict) = Recording(

    d["samplerate"]

)

Supervision(d::Dict) = Supervision(

Annotation(d::Dict) = Annotation(

    d["id"],

    d["recording_id"],

    d["start"],

    d["duration"],

    d["channel"],

    d["channels"],

    d["data"]

)

@@ -139,13 +124,18 @@ function readmanifest(io::IO, T)

    manifest

end

manifestname(T::Type{<:Recording}, subset) = "recording-manifest-$(subset).jsonl"

manifestname(T::Type{<:Supervision}, subset) = "supervision-manifest-$(subset).jsonl"

# Some utilities

manifestname(::Type{<:Recording}, name) = "recordings.jsonl"

manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"

load(T::Type{<:Union{Recording,Supervision}}, path::AbstractString) =

    open(f -> readmanifest(f, T), path, "r")

load(corpus::SpeechCorpus, dir, T, subset) =

    load(T, joinpath(path(corpus, dir), manifestname(T, subset)))

load(corpus::SpeechCorpus, T, subset) =

    load(corpus, corporadir, T, subset)

"""

    load(Annotation, path)

    load(Recording, path)

Load Recording/Annotation manifest from `path`.

"""

load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")

function checkdir(dir::AbstractString)

    isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))

end

src/manifest_item.jl

+18
−49

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

"""

    abstract type AbstractAudioSource end

Base class for all audio source. Possible audio sources are:

* `FileAudioSource`

* `URLAudioSource`

* `CmdAudioSource`

You can load the data of an audio source with the internal function

    loadsoce(s::AbstractAudioSource, subrange)

"""

abstract type AbstractAudioSource end

struct FileAudioSource <: AbstractAudioSource

    path::AbstractString

end

struct URLAudioSource <: AbstractAudioSource

    url::AbstractString

end

struct CmdAudioSource <: AbstractAudioSource

    cmd

end

CmdAudioSource(c::String) = CmdAudioSource(Cmd(String.(split(c))))

loadsource(s::FileAudioSource, subrange) = wavread(s.path; subrange)

loadsource(s::URLAudioSource, subrange) = wavread(IOBuffer(HTTP.get(s.url).body); subrange)

loadsource(s::CmdAudioSource, subrange) = wavread(IOBuffer(read(pipeline(s.cmd))); subrange)

"""

    abstract type ManifestItem end

@@ -71,7 +39,7 @@ end

function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)

    if ismissing(channels) || ismissing(samplerate)

        x, sr = loadsource(s, :)

        x, sr = loadaudio(s)

        samplerate = ismissing(samplerate) ? Int(sr) : samplerate

        channels = ismissing(channels) ? collect(1:size(x,2)) : channels

    end

@@ -79,47 +47,49 @@ function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate

end

"""

    struct Supervision <: ManifestItem

    struct Annotation <: ManifestItem

        id::AbstractString

        recording_id::AbstractString

        start::Float64

        duration::Float64

        channel::Int

        channel::Union{Vector, Colon}

        data::Dict

    end

A "supervision" defines a segment of a recording on a single channel.

An "annotation" defines a segment of a recording on a single channel.

The `data` field is an arbitrary dictionary holdin the nature of the

supervision.

annotation. `start` and `duration` (in seconds) defines,

where the segment is locatated within the recoding `recording_id`.

# Constructor

    Supervision(id, recording_id, start, duration, channel, data)

    Supervision(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

    Annotation(id, recording_id, start, duration, channel, data)

    Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

If `start` and/or `duration` are negative, the segment is considered to

be the whole sequence length of the recording.

"""

struct Supervision <: ManifestItem

struct Annotation <: ManifestItem

    id::AbstractString

    recording_id::AbstractString

    start::Float64

    duration::Float64

    channel::Int

    channels::Union{Vector, Colon}

    data::Dict

end

Supervision(id, recid; channel = missing, start = -1, duration = -1, data = missing) =

    Supervision(id, recid, start, duration, channel, data)

Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =

    Annotation(id, recid, start, duration, channels, data)

"""

    load(recording[; start = -1, duration = -1, channels = recording.channels])

    load(recording, supervision)

    load(recording, annotation)

Load the signal from a recording. `start`, `duration` (in seconds) can

be used to load only a segment. If a `supervision` is given, function

be used to load only a segment. If an `annotation` is given, function

will return on the portion of the signal corresponding to the

supervision segment.

annotation segment.

The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array

- ``N`` is the length of the signal and ``C`` is the number of channels

@@ -134,10 +104,9 @@ function load(r::Recording; start = -1, duration = -1, channels = r.channels)

        subrange = (:)

    end

    x, sr, _, _ = loadsource(r.source, subrange)

    x, sr = loadaudio(r.source, subrange)

    x[:,channels], sr

end

load(r::Recording, s::Supervision) =

    load(r; start = s.start, duration = s.duration, channels = [s.channel])

load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)

src/speechcorpus.jl

+20
−9

Original line number
Diff line number
Diff line

# SPDX-License-Identifier: CECILL-2.1

"""

    abstract type SpeechCorpus

    abstract type SpeechCorpus end

Abstract type for all speech corpora.

"""

abstract type SpeechCorpus end

"""

    path(corpus)

    lang(corpus)

Path to the directory where is stored the corpus' data.

Return the ISO 639-3 code of the language of the corpus.

"""

path(corpus::SpeechCorpus, dir) = joinpath(dir, corpus.lang, corpus.name)

lang

"""

    name(corpus)

Return the name identifier of the corpus.

"""

name

"""

    download(corpus[, dir = homedir()])

    download(corpus, rootdir)

Download the data of the corpus to `dir`.

"""

Base.download(corpus::SpeechCorpus) = download(corpus, SPEECH_CORPORA_ROOTDIR)

Base.download

"""

    prepare(corpus[, dir = homedir()])

    prepare(corpus, rootdir)

Prepare the manifests of corpus to `dir`.

Prepare the manifests of corpus.

"""

prepare(corpus::SpeechCorpus) = prepare(corpus, SPEECH_CORPORA_ROOTDIR)

prepare

Source

Target

Files