Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • fast/speechdatasets.jl
  • PTAL/Datasets/SpeechDatasets.jl
2 results
Show changes
Commits on Source (56)
*outputdir/
Manifest.toml
notebook-test.jl
# Tags
## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024
### Changed
- Added support for Speech2Tex dataset
## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024
### Changed
- Added support for AVID dataset
## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024
### Changed
- Added support for INA Diachrony dataset
### Fixed
- Fixed Minilibrispeech data prep
## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024
### Changed
- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.
## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024
### Added
- filtering speech dataset based on recording id.
### Improved
- Faster TIMIT preparation
## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024
### Added
- extract alignments from TIMIT
### Changed
- `Supervision` is now `Annotation`
## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024
# Fixed
- TIMIT data preparation
## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024
# Fixed
- `CMUDICT("dir/path")` fails if `dir` does not already exists.
## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024
# Fixed
- invalid type for field `channels` of `Recording`
- `MINILIBRISPEECH` broken
## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024
# Fixed
- not possible to use `:` as channel specifier
## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024
# Changed
- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`
## Added
* CMU and TIMIT lexicon
## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024
## Features
* New `dataset` function, which builds `SpeechDataset` from manifest files
* Compatibility with MLUtils.DataLoader
## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023
## Changed
* refactored API, TIMIT dataset working (but not Librispeech anymore)
## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023
## Added
- raw audio data source
## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023
## Added
- can load the data directly from an audio source with the `load`
function.
## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023
## Added
......
name = "SpeechCorpora"
uuid = "3225a15e-d855-4a07-9546-2418058331ae"
authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>"]
version = "0.4.1"
name = "SpeechDatasets"
uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"
authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",
"Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",
"Nicolas DENIER <nicolas.denier@lisn.fr>"]
version = "0.15.0"
[deps]
Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
WAV = "8149f6b0-98f6-5db9-b78f-408fbbb8ef88"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"
[compat]
julia = "1.10"
JSON = "0.21"
WAV = "1.2"
julia = "1.8"
SpeechFeatures = "0.8"
# SpeechCorpora.jl
# SpeechDatasets.jl
A Julia package to download and prepare speech corpus.
......@@ -7,35 +7,44 @@ A Julia package to download and prepare speech corpus.
Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)
to your julia installation. Then, install the package as usual:
```
pkg> add SpeechCorpora
pkg> add SpeechDatasets
```
## Example
```
julia> using SpeechCorpora
julia> using SpeechDatasets
julia> corpus = MultilingualLibriSpeech("fra") |> download |> prepare
julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test
...
# Load the recording manifest.
julia> recs = load(corpus, Recording, "dev") # use "train", "dev" or "test"
julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test
...
# Load the supervision manifest.
julia> sups = load(corpus, Supervision, "dev") # use "train", "dev" or "test"
julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional
...
# Load the signal of the first supervision segment
julia> s = first(values(sups))
julia> x, samplerate = load(recs[s.recording_id], s)
julia> dataset = AVID("/path/to/avid/dir", "outputdir")
...
# Play the recording of the first supervision segment
julia> play(recs[s.recording_id], s)
julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")
...
```
## Author
* Lucas ONDEL YANG (LISN, CNRS)
julia> for ((signal, fs), supervision) in dataset
# do something
end
# Lexicons
julia> CMUDICT("outputfile")
...
julia> TIMITDICT("/path/to/timit/dir")
...
```
## License
This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE)
This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))
# SPDX-License-Identifier: CECILL-2.1
module SpeechCorpora
module SpeechDatasets
using Base64
using HTTP
using JSON
using WAV
using SpeechFeatures
import MLUtils
export
# ManifestItem
FileAudioSource,
CmdAudioSource,
URLAudioSource,
Recording,
Supervision,
Annotation,
load,
# Manifest interface
......@@ -22,28 +18,34 @@ export
# Corpora interface
download,
lang,
name,
prepare,
# Corpora
MultilingualLibriSpeech,
MiniLibriSpeech
MINILIBRISPEECH,
TIMIT,
INADIACHRONY,
AVID,
SPEECH2TEX,
# Lexicon
CMUDICT,
TIMITDICT,
MFAFRDICT,
SPEECH_CORPORA_ROOTDIR = homedir()
"""
setrootdir(path)
Set the root directory where to store the datasets. Default to the user
home directory.
"""
setrootdir(path) = global SPEECH_CORPORA_ROOTDIR = path
# Dataset
dataset
include("speechcorpus.jl")
include("manifest_item.jl")
include("manifest_io.jl")
include("corpora/multilingual_librispeech.jl")
include("corpora/mini_librispeech.jl")
include("dataset.jl")
# Supported corpora
include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))
include("lexicons.jl")
end
# SPDX-License-Identifier: CECILL-2.1
function avid_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = filename
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function load_metadata_files(dir::AbstractString)
tasksdict = Dict('s' => "SENT", 'p' => "PARA")
metadatadict = Dict(key =>
readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv"))
for key in keys(tasksdict))
return metadatadict
end
function get_metadata(filename, metadatadict)
task = split(filename, "_")[3][1]
headers = metadatadict[task][1]
headers = split(headers, ",")
file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]
file_metadata = split(file_metadata, ",")
metadata = Dict(
headers[i] => file_metadata[i]
for i = 1:length(headers)
)
return metadata
end
function avid_annotations(dir)
checkdir(dir)
annotations = Dict()
metadatadict = load_metadata_files(dir)
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from csv files
metadata = get_metadata(filename, metadatadict)
id = filename
# generate annotation
annotations[id] = Annotation(
id, # audio id
id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function download_avid(dir)
@info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"
url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"
filename = "AVID.zip"
filepath = joinpath(dir,filename)
run(`mkdir -p $dir`)
run(`wget $url -O $filepath`)
@info "Download complete, extracting files"
run(`unzip $filepath -d $dir`)
run(`rm $filepath`)
return joinpath(datadir, "/AVID")
end
function avid_prepare(datadir, outputdir)
# Validate the data directory
isdir(datadir) || (datadir = download_avid(datadir))
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
recordings = Array{Dict}(undef, 2)
recordings_path = joinpath(datadir, "Repository 2")
@info "Extracting recordings from $recordings_path"
recordings[1] = avid_recordings(recordings_path)
# Calibration tones
calibtones_path = joinpath(datadir, "Calibration_tones")
@info "Extracting recordings from $calibtones_path"
recordings[2] = avid_recordings(calibtones_path)
for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])
open(manifestpath, "w") do f
writemanifest(f, recordings[i])
end
end
# Annotations
annotations_path = recordings_path
@info "Extracting annotations from $annotations_path"
annotations = avid_annotations(annotations_path)
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function AVID(datadir, outputdir)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
avid_prepare(datadir, outputdir)
end
dataset(outputdir, "")
end
# SPDX-License-Identifier: CECILL-2.1
function ina_diachrony_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = "ina_diachrony§$filename"
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function ina_diachrony_get_metadata(filename)
metadata = split(filename, "§")
age, sex = split(metadata[2], "_")
Dict(
"speaker" => metadata[3],
"timeperiod" => metadata[1],
"age" => age,
"sex" => sex,
)
end
function ina_diachrony_annotations_whole(dir)
checkdir(dir)
annotations = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from filename
metadata = ina_diachrony_get_metadata(filename)
# extract transcription text (same filename but .txt)
textfilepath = joinpath(root, "$filename.txt")
metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
id = "ina_diachrony§$filename"
annotation_id = id*"§0"
# generate annotation
annotations[annotation_id] = Annotation(
id, # audio id
annotation_id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function ina_diachrony_annotations_csv(dir)
checkdir(dir)
annotations = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".csv" && continue
# extract metadata from filename
metadata = ina_diachrony_get_metadata(filename)
id = "ina_diachrony§$filename"
# generate annotation for each line in csv
open(joinpath(root, file)) do f
header = readline(f)
line = 1
# read till end of file
while ! eof(f)
current_line = readline(f)
start_time, end_time, text = split(current_line, ",", limit=3)
start_time = parse(Float64, start_time)
duration = parse(Float64, end_time)-start_time
metadata["text"] = text
annotation_id = id*$line"
annotations[id] = Annotation(
id, # audio id
annotation_id, # annotation id
start_time, # start
duration, # duration
[1], # only 1 channel (mono recording)
metadata # additional informations
)
line += 1
end
end
end
end
annotations
end
function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
# Validate the data directory
for d in [ina_wav_dir, ina_csv_dir]
isnothing(d) || checkdir(d)
end
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
@info "Extracting recordings from $ina_wav_dir"
recordings = ina_diachrony_recordings(ina_wav_dir)
manifestpath = joinpath(outputdir, "recordings.jsonl")
open(manifestpath, "w") do f
writemanifest(f, recordings)
end
# Annotations
@info "Extracting annotations from $ina_wav_dir"
annotations = ina_diachrony_annotations_whole(ina_wav_dir)
if ! isnothing(ina_csv_dir)
@info "Extracting annotations from $ina_csv_dir"
csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)
annotations = merge(annotations, csv_annotations)
end
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
end
dataset(outputdir, "")
end
......@@ -11,15 +11,9 @@ const MINILS_SUBSETS = Dict(
"dev" => "dev-clean-2"
)
const MINILS_LANG = "eng"
const MINILS_NAME = "mini_librispeech"
#######################################################################
struct MiniLibriSpeech <: SpeechCorpus
lang
name
struct MINILIBRISPEECH <: SpeechCorpus
recordings
train
dev
......@@ -48,7 +42,7 @@ function minils_recordings(dir, subset)
recs
end
function minils_supervisions(dir, subset)
function minils_annotations(dir, subset)
subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
sups = Dict()
for d1 in readdir(subsetdir; join = true)
......@@ -58,8 +52,12 @@ function minils_supervisions(dir, subset)
open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f
for line in eachline(f)
tokens = split(line)
s = Supervision(tokens[1], tokens[1]; channel = 1,
data = Dict("text" => join(tokens[2:end], " ")))
s = Annotation(
tokens[1], # annotation id
tokens[1]; # recording id
channels = [1],
data = Dict("text" => join(tokens[2:end], " "))
)
sups[s.id] = s
end
end
......@@ -89,7 +87,7 @@ end
function minils_prepare(dir)
# 1. Recording manifest.
out = joinpath(dir, "recording-manifest.jsonl")
out = joinpath(dir, "recordings.jsonl")
if ! isfile(out)
open(out, "w") do f
for subset in ["train", "dev"]
......@@ -100,12 +98,12 @@ function minils_prepare(dir)
end
end
# 2. Supervision manifests.
for subset in ["train", "dev"]
out = joinpath(dir, "supervision-manifest-$subset.jsonl")
# 2. Annotation manifests.
for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]
out = joinpath(dir, "annotations-$name.jsonl")
if ! isfile(out)
@debug "preparing supervision manifest ($subset) $out"
sups = minils_supervisions(dir, subset)
@debug "preparing annotation manifest ($subset) $out"
sups = minils_annotations(dir, subset)
open(out, "w") do f
writemanifest(f, sups)
end
......@@ -113,20 +111,10 @@ function minils_prepare(dir)
end
end
function MiniLibriSpeech(outdir)
dir = joinpath(outdir, MINILS_LANG, MINILS_NAME)
function MINILIBRISPEECH(dir, subset)
minils_download(dir)
minils_prepare(dir)
MiniLibriSpeech(
MINILS_LANG,
MINILS_NAME,
load(Recording, joinpath(dir, "recording-manifest.jsonl")),
load(Supervision, joinpath(dir, "supervision-manifest-train.jsonl")),
load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
load(Supervision, joinpath(dir, "supervision-manifest-dev.jsonl")),
)
dataset(dir, subset)
end
MiniLibriSpeech() = MiniLibriSpeech(SPEECH_CORPORA_ROOTDIR)
......@@ -89,13 +89,13 @@ function recordings(corpus::MultilingualLibriSpeech, dir, subset)
recs
end
function supervisions(corpus::MultilingualLibriSpeech, dir, subset)
function annotations(corpus::MultilingualLibriSpeech, dir, subset)
trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")
sups = Dict()
open(trans, "r") do f
for line in eachline(f)
tokens = split(line)
s = Supervision(tokens[1], tokens[1]; channel = 1,
s = Annotation(tokens[1], tokens[1]; channel = 1,
data = Dict("text" => join(tokens[2:end], " ")))
sups[s.id] = s
end
......@@ -118,12 +118,12 @@ function prepare(corpus::MultilingualLibriSpeech, outdir)
end
end
# 2. Supervision manifests.
# 2. Annotation manifests.
for subset in ["train", "dev", "test"]
out = joinpath(dir, "supervision-manifest-$subset.jsonl")
@info "preparing supervision manifest ($subset) $out"
out = joinpath(dir, "annotation-manifest-$subset.jsonl")
@info "preparing annotation manifest ($subset) $out"
if ! isfile(out)
sups = supervisions(corpus, dir, subset)
sups = annotations(corpus, dir, subset)
open(out, "w") do f
writemanifest(f, sups)
end
......
# SPDX-License-Identifier: CECILL-2.1
function speech2tex_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = filename
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 48000
)
end
end
recordings
end
extract_digits(str::AbstractString) = filter(c->isdigit(c), str)
isnumber(str::AbstractString) = extract_digits(str)==str
function speech2tex_get_metadata(filename)
# possible cases: line123_p1 line123_124_p1 line123_p1_part2 (not observed but also supported: line123_124_p1_part2)
split_name = split(filename, "_")
metadata = Dict()
if isnumber(split_name[2])
metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]
metadata["speaker"] = split_name[3]
else
metadata["line"] = extract_digits(split_name[1])
metadata["speaker"] = split_name[2]
end
if occursin("part", split_name[end])
metadata["part"] = extract_digits(split_name[end])
end
metadata
end
function speech2tex_annotations(audiodir, transcriptiondir, texdir)
checkdir.([audiodir, transcriptiondir, texdir])
annotations = Dict()
for (root, subdirs, files) in walkdir(audiodir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from csv files
metadata = speech2tex_get_metadata(filename)
# extract transcription and tex (same filenames but .txt)
dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")
for (d, label) in dirdict
textfilepath = joinpath(d, "$filename.txt")
metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
end
id = filename
# generate annotation
annotations[id] = Annotation(
id, # audio id
id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function speech2tex_prepare(datadir, outputdir)
# Validate the data directory
checkdir(datadir)
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
recordings = Array{Dict}(undef, 2)
recordings_path = joinpath(datadir, "audio")
@info "Extracting recordings from $recordings_path"
recordings = speech2tex_recordings(recordings_path)
manifestpath = joinpath(outputdir, "recordings.jsonl")
open(manifestpath, "w") do f
writemanifest(f, recordings)
end
# Annotations
transcriptiondir = joinpath(datadir, "sequences")
texdir = joinpath(datadir, "latex")
@info "Extracting annotations from $transcriptiondir and $texdir"
annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function SPEECH2TEX(datadir, outputdir)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
speech2tex_prepare(datadir, outputdir)
end
dataset(outputdir, "")
end
# SPDX-License-Identifier: CECILL-2.1
#######################################################################
const TIMIT_SUBSETS = Dict(
"train" => "train",
"dev" => "dev",
"test" => "test"
)
const TIMIT_DEV_SPK_LIST = Set([
"faks0",
"fdac1",
"fjem0",
"mgwt0",
"mjar0",
"mmdb1",
"mmdm2",
"mpdf0",
"fcmh0",
"fkms0",
"mbdg0",
"mbwm0",
"mcsh0",
"fadg0",
"fdms0",
"fedw0",
"mgjf0",
"mglb0",
"mrtk0",
"mtaa0",
"mtdt0",
"mthc0",
"mwjg0",
"fnmr0",
"frew0",
"fsem0",
"mbns0",
"mmjr0",
"mdls0",
"mdlf0",
"mdvc0",
"mers0",
"fmah0",
"fdrw0",
"mrcs0",
"mrjm4",
"fcal1",
"mmwh0",
"fjsj0",
"majc0",
"mjsw0",
"mreb0",
"fgjd0",
"fjmg0",
"mroa0",
"mteb0",
"mjfc0",
"mrjr0",
"fmml0",
"mrws1"
])
const TIMIT_TEST_SPK_LIST = Set([
"mdab0",
"mwbt0",
"felc0",
"mtas1",
"mwew0",
"fpas0",
"mjmp0",
"mlnt0",
"fpkt0",
"mlll0",
"mtls0",
"fjlm0",
"mbpm0",
"mklt0",
"fnlp0",
"mcmj0",
"mjdh0",
"fmgd0",
"mgrt0",
"mnjm0",
"fdhc0",
"mjln0",
"mpam0",
"fmld0"
])
TIMIT_PHONE_MAP48 = Dict(
"aa" => "aa",
"ae" => "ae",
"ah" => "ah",
"ao" => "ao",
"aw" => "aw",
"ax" => "ax",
"ax-h" => "ax",
"axr" => "er",
"ay" => "ay",
"b" => "b",
"bcl" => "vcl",
"ch" => "ch",
"d" => "d",
"dcl" => "vcl",
"dh" => "dh",
"dx" => "dx",
"eh" => "eh",
"el" => "el",
"em" => "m",
"en" => "en",
"eng" => "ng",
"epi" => "epi",
"er" => "er",
"ey" => "ey",
"f" => "f",
"g" => "g",
"gcl" => "vcl",
"h#" => "sil",
"hh" => "hh",
"hv" => "hh",
"ih" => "ih",
"ix" => "ix",
"iy" => "iy",
"jh" => "jh",
"k" => "k",
"kcl" => "cl",
"l" => "l",
"m" => "m",
"n" => "n",
"ng" => "ng",
"nx" => "n",
"ow" => "ow",
"oy" => "oy",
"p" => "p",
"pau" => "sil",
"pcl" => "cl",
"q" => "",
"r" => "r",
"s" => "s",
"sh" => "sh",
"t" => "t",
"tcl" => "cl",
"th" => "th",
"uh" => "uh",
"uw" => "uw",
"ux" => "uw",
"v" => "v",
"w" => "w",
"y" => "y",
"z" => "z",
"zh" => "zh"
)
TIMIT_PHONE_MAP39 = Dict(
"aa" => "aa",
"ae" => "ae",
"ah" => "ah",
"ao" => "aa",
"aw" => "aw",
"ax" => "ah",
"ax-h" => "ah",
"axr" => "er",
"ay" => "ay",
"b" => "b",
"bcl" => "sil",
"ch" => "ch",
"d" => "d",
"dcl" => "sil",
"dh" => "dh",
"dx" => "dx",
"eh" => "eh",
"el" => "l",
"em" => "m",
"en" => "n",
"eng" => "ng",
"epi" => "sil",
"er" => "er",
"ey" => "ey",
"f" => "f",
"g" => "g",
"gcl" => "sil",
"h#" => "sil",
"hh" => "hh",
"hv" => "hh",
"ih" => "ih",
"ix" => "ih",
"iy" => "iy",
"jh" => "jh",
"k" => "k",
"kcl" => "sil",
"l" => "l",
"m" => "m",
"n" => "n",
"ng" => "ng",
"nx" => "n",
"ow" => "ow",
"oy" => "oy",
"p" => "p",
"pau" => "sil",
"pcl" => "sil",
"q" => "",
"r" => "r",
"s" => "s",
"sh" => "sh",
"t" => "t",
"tcl" => "sil",
"th" => "th",
"uh" => "uh",
"uw" => "uw",
"ux" => "uw",
"v" => "v",
"w" => "w",
"y" => "y",
"z" => "z",
"zh" => "sh"
)
#######################################################################
function timit_prepare(timitdir, dir; audio_fmt="SPHERE")
# Validate the data directory
! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))
# Create the output directory.
dir = mkpath(dir)
rm(joinpath(dir, "recordings.jsonl"), force=true)
## Recordings
@info "Extracting recordings from $timitdir/train"
train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)
# We extract the name of speakers that are not in the dev set
TIMIT_TRAIN_SPK_LIST = Set()
for id in keys(train_recordings)
_, spk, _ = split(id, "_")
if spk TIMIT_DEV_SPK_LIST
push!(TIMIT_TRAIN_SPK_LIST, spk)
end
end
@info "Extracting recordings from $timitdir/test"
test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)
recordings = merge(train_recordings, test_recordings)
manifestpath = joinpath(dir, "recordings.jsonl")
open(manifestpath, "a") do f
writemanifest(f, recordings)
end
# Annotations
@info "Extracting annotations from $timitdir/train"
train_annotations = timit_annotations(joinpath(timitdir, "train"))
@info "Extracting annotations from $timitdir/test"
test_annotations = timit_annotations(joinpath(timitdir, "test"))
annotations = merge(train_annotations, test_annotations)
train_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_TRAIN_SPK_LIST
)
end
dev_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_DEV_SPK_LIST
)
end
test_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_TEST_SPK_LIST
)
end
for (x, y) in ("train" => train_annotations,
"dev" => dev_annotations,
"test" => test_annotations)
manifestpath = joinpath(dir, "annotations-$(x).jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, y)
end
end
end
function timit_recordings(dir::AbstractString; fmt="SPHERE")
! isdir(dir) && throw(ArgumentError("expected directory $dir"))
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
name, ext = splitext(file)
ext != ".wav" && continue
spk = basename(root)
path = joinpath(root, file)
id = "timit_$(spk)_$(name)"
audio_src = if fmt == "SPHERE"
CmdAudioSource(`sph2pipe -f wav $path`)
else
FileAudioSource(path)
end
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function timit_annotations(dir)
! isdir(dir) && throw(ArgumentError("expected directory $dir"))
splitline(line) = rsplit(line, limit=3)
annotations = Dict()
processed = Set()
for (root, subdirs, files) in walkdir(dir)
for file in files
name, ext = splitext(file)
_, dialect, spk = rsplit(root, "/", limit=3)
# Annotation files already processed (".wrd" and ".phn")
idtuple = (dialect, spk, name)
(idtuple in processed) && continue
push!(processed, (dialect, spk, name))
# Words
wpath = joinpath(root, name * ".wrd")
words = [last(split(line)) for line in eachline(wpath)]
# Phones
ppath = joinpath(root, name * ".phn")
palign = Tuple{Int,Int,String}[]
for line in eachline(ppath)
t0, t1, p = split(line)
push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))
end
sentence_type = if startswith(name, "sa")
"dialect"
elseif startswith(name, "sx")
"compact"
else # startswith(name, "si")
"diverse"
end
id = "timit_$(spk)_$(name)"
annotations[id] = Annotation(
id, # recording id and annotation id are the same since we have
id, # a one-to-one mapping
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
Dict(
"text" => join(words, " "),
"sentence type" => sentence_type,
"alignment" => palign,
"dialect" => dialect,
"speaker" => spk,
"sex" => string(first(spk)),
)
)
end
end
annotations
end
function TIMIT(timitdir, dir, subset)
if ! (isfile(joinpath(dir, "recordings.jsonl")) &&
isfile(joinpath(dir, "annotations-train.jsonl")) &&
isfile(joinpath(dir, "annotations-dev.jsonl")) &&
isfile(joinpath(dir, "annotations-test.jsonl")))
timit_prepare(timitdir, dir)
end
dataset(dir, subset)
end
# SPDX-License-Identifier: CECILL-2.1
struct SpeechDataset <: MLUtils.AbstractDataContainer
idxs::Vector{AbstractString}
annotations::Dict{AbstractString, Annotation}
recordings::Dict{AbstractString, Recording}
end
"""
dataset(manifestroot)
Load `SpeechDataset` from manifest files stored in `manifestroot`.
Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.
See also [`Annotation`](@ref).
# Examples
```julia-repl
julia> ds = dataset("./manifests", :train)
SpeechDataset(
...
)
julia> ds[1]
(
(samples=[...], sampling_rate=16_000),
Dict(
"text" => "Annotation text here"
)
)
```
"""
function dataset(manifestroot::AbstractString, partition)
partition_name = partition == "" ? "" : "-$(partition)"
annot_path = joinpath(manifestroot, "annotations$(partition_name).jsonl")
rec_path = joinpath(manifestroot, "recordings.jsonl")
annotations = load(Annotation, annot_path)
recordings = load(Recording, rec_path)
dataset(annotations, recordings)
end
function dataset(annotations::AbstractDict, recordings::AbstractDict)
idxs = collect(keys(annotations))
SpeechDataset(idxs, annotations, recordings)
end
Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]
Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])
# Fix1 -> partial funcion with fixed 1st argument
Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)
Base.length(d::SpeechDataset) = length(d.idxs)
function Base.filter(fn, d::SpeechDataset)
fidxs = filter(d.idxs) do i
fn((d.recordings[i], d.annotations[i]))
end
idset = Set(fidxs)
fannotations = filter(d.annotations) do (k, v)
k idset
end
frecs = filter(d.recordings) do (k, v)
k idset
end
SpeechDataset(fidxs, fannotations, frecs)
end
# SPDX-License-Identifier: CECILL-2.1
const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"
const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"
function normalizeword(word)
String(uppercase(word))
end
function normalizephoneme(phoneme)
String(uppercase(phoneme))
end
"""
CMUDICT(path)
Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.
The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent
calls will only read the file `path` without downloading again the data.
"""
function CMUDICT(path)
if ! isfile(path)
mkpath(dirname(path))
dir = mktempdir()
run(`wget -P $dir $CMUDICT_URL`)
mv(joinpath(dir, "cmudict_SPHINX_40"), path)
end
lexicon = Dict()
open(path, "r") do f
for line in eachline(f)
word, pron... = split(line)
word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")
prononciations = get(lexicon, word, [])
push!(prononciations, pron)
lexicon[word] = prononciations
end
end
lexicon
end
"""
TIMITDICT(timitdir)
Return the dictionary of pronunciation as provided by TIMIT corpus (located
in `timitdir`).
"""
function TIMITDICT(timitdir)
dictfile = joinpath(timitdir, "doc", "timitdic.txt")
iscomment(line) = first(line) == ';'
lexicon = Dict{String,Vector{Vector{String}}}()
for line in eachline(dictfile)
iscomment(line) && continue
word, pron = split(line, limit=2)
pron = strip(pron, ['/', '\t', ' '])
word = '~' in word ? split(word, "~", limit=2)[1] : word
word = normalizeword(word)
pron = normalizephoneme.(split(pron))
prononciations = get(lexicon, word, Vector{String}[])
push!(prononciations, pron)
lexicon[word] = prononciations
end
lexicon
end
"""
MFAFRDICT(path)
Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)
"""
function MFAFRDICT(path)
if ! isfile(path)
mkpath(dirname(path))
dir = mktempdir()
run(`wget -P $dir $FRMFA_DICT_URL`)
mv(joinpath(dir, "french_mfa.dict"), path)
end
lexicon = Dict()
open(path, "r") do f
for line in eachline(f)
word, pron... = split(line)
prononciations = get(lexicon, word, [])
push!(prononciations, pron)
lexicon[word] = prononciations
end
end
lexicon
end
\ No newline at end of file
# SPDX-License-Identifier: CECILL-2.1
#=====================================================================#
# HTML pretty display
function Base.show(io::IO, ::MIME"text/html", r::AbstractAudioSource)
print(io, "<audio controls ")
print(io, "src=\"data:audio/wav;base64,")
x, s, _ = loadsource(r, :)
iob64_encode = Base64EncodePipe(io)
wavwrite(x, iob64_encode, Fs = s, nbits = 8, compression = WAV.WAVE_FORMAT_PCM)
close(iob64_encode)
println(io, "\" />")
end
#=====================================================================#
# JSON serialization of a manifest item
......@@ -68,21 +53,21 @@ function Base.show(io::IO, m::MIME"application/json", r::Recording)
print(io, "}")
end
function Base.show(io::IO, m::MIME"application/json", s::Supervision)
function Base.show(io::IO, m::MIME"application/json", a::Annotation)
compact = get(io, :compact, false)
indent = compact ? 0 : 2
printfn = compact ? print : println
printfn(io, "{")
printfn(io, repeat(" ", indent), "\"id\": \"", s.id, "\", ")
printfn(io, repeat(" ", indent), "\"recording_id\": \"", s.recording_id, "\", ")
printfn(io, repeat(" ", indent), "\"start\": ", s.start, ", ")
printfn(io, repeat(" ", indent), "\"duration\": ", s.duration, ", ")
printfn(io, repeat(" ", indent), "\"channel\": ", s.channel, ", ")
printfn(io, repeat(" ", indent), "\"data\": ", s.data |> json)
printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")
printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")
printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")
printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")
printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")
printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)
print(io, "}")
end
function JSON.json(r::Union{Recording, Supervision}; compact = true)
function JSON.json(r::Union{Recording, Annotation}; compact = true)
out = IOBuffer()
show(IOContext(out, :compact => compact), MIME("application/json"), r)
String(take!(out))
......@@ -111,12 +96,12 @@ Recording(d::Dict) = Recording(
d["samplerate"]
)
Supervision(d::Dict) = Supervision(
Annotation(d::Dict) = Annotation(
d["id"],
d["recording_id"],
d["start"],
d["duration"],
d["channel"],
d["channels"],
d["data"]
)
......@@ -139,13 +124,18 @@ function readmanifest(io::IO, T)
manifest
end
manifestname(T::Type{<:Recording}, subset) = "recording-manifest-$(subset).jsonl"
manifestname(T::Type{<:Supervision}, subset) = "supervision-manifest-$(subset).jsonl"
# Some utilities
manifestname(::Type{<:Recording}, name) = "recordings.jsonl"
manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"
load(T::Type{<:Union{Recording,Supervision}}, path::AbstractString) =
open(f -> readmanifest(f, T), path, "r")
load(corpus::SpeechCorpus, dir, T, subset) =
load(T, joinpath(path(corpus, dir), manifestname(T, subset)))
load(corpus::SpeechCorpus, T, subset) =
load(corpus, corporadir, T, subset)
"""
load(Annotation, path)
load(Recording, path)
Load Recording/Annotation manifest from `path`.
"""
load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")
function checkdir(dir::AbstractString)
isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))
end
# SPDX-License-Identifier: CECILL-2.1
"""
abstract type AbstractAudioSource end
Base class for all audio source. Possible audio sources are:
* `FileAudioSource`
* `URLAudioSource`
* `CmdAudioSource`
You can load the data of an audio source with the internal function
loadsoce(s::AbstractAudioSource, subrange)
"""
abstract type AbstractAudioSource end
struct FileAudioSource <: AbstractAudioSource
path::AbstractString
end
struct URLAudioSource <: AbstractAudioSource
url::AbstractString
end
struct CmdAudioSource <: AbstractAudioSource
cmd
end
CmdAudioSource(c::String) = CmdAudioSource(Cmd(String.(split(c))))
loadsource(s::FileAudioSource, subrange) = wavread(s.path; subrange)
loadsource(s::URLAudioSource, subrange) = wavread(IOBuffer(HTTP.get(s.url).body); subrange)
loadsource(s::CmdAudioSource, subrange) = wavread(IOBuffer(read(pipeline(s.cmd))); subrange)
"""
abstract type ManifestItem end
......@@ -71,7 +39,7 @@ end
function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)
if ismissing(channels) || ismissing(samplerate)
x, sr = loadsource(s, :)
x, sr = loadaudio(s)
samplerate = ismissing(samplerate) ? Int(sr) : samplerate
channels = ismissing(channels) ? collect(1:size(x,2)) : channels
end
......@@ -79,47 +47,49 @@ function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate
end
"""
struct Supervision <: ManifestItem
struct Annotation <: ManifestItem
id::AbstractString
recording_id::AbstractString
start::Float64
duration::Float64
channel::Int
channel::Union{Vector, Colon}
data::Dict
end
A "supervision" defines a segment of a recording on a single channel.
An "annotation" defines a segment of a recording on a single channel.
The `data` field is an arbitrary dictionary holdin the nature of the
supervision.
annotation. `start` and `duration` (in seconds) defines,
where the segment is locatated within the recoding `recording_id`.
# Constructor
Supervision(id, recording_id, start, duration, channel, data)
Supervision(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
If `start` and/or `duration` are negative, the segment is considered to
be the whole sequence length of the recording.
"""
struct Supervision <: ManifestItem
struct Annotation <: ManifestItem
id::AbstractString
recording_id::AbstractString
start::Float64
duration::Float64
channel::Int
channels::Union{Vector, Colon}
data::Dict
end
Supervision(id, recid; channel = missing, start = -1, duration = -1, data = missing) =
Supervision(id, recid, start, duration, channel, data)
Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =
Annotation(id, recid, start, duration, channels, data)
"""
load(recording[; start = -1, duration = -1, channels = recording.channels])
load(recording, supervision)
load(recording, annotation)
Load the signal from a recording. `start`, `duration` (in seconds) can
be used to load only a segment. If a `supervision` is given, function
be used to load only a segment. If an `annotation` is given, function
will return on the portion of the signal corresponding to the
supervision segment.
annotation segment.
The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array
- ``N`` is the length of the signal and ``C`` is the number of channels
......@@ -134,10 +104,9 @@ function load(r::Recording; start = -1, duration = -1, channels = r.channels)
subrange = (:)
end
x, sr, _, _ = loadsource(r.source, subrange)
x, sr = loadaudio(r.source, subrange)
x[:,channels], sr
end
load(r::Recording, s::Supervision) =
load(r; start = s.start, duration = s.duration, channels = [s.channel])
load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)
# SPDX-License-Identifier: CECILL-2.1
"""
abstract type SpeechCorpus
abstract type SpeechCorpus end
Abstract type for all speech corpora.
"""
abstract type SpeechCorpus end
"""
lang(corpus)
Return the ISO 639-3 code of the language of the corpus.
"""
path(corpus)
lang
Path to the directory where is stored the corpus' data.
"""
name(corpus)
Return the name identifier of the corpus.
"""
path(corpus::SpeechCorpus, dir) = joinpath(dir, corpus.lang, corpus.name)
name
"""
download(corpus[, dir = homedir()])
download(corpus, rootdir)
Download the data of the corpus to `dir`.
"""
Base.download(corpus::SpeechCorpus) = download(corpus, SPEECH_CORPORA_ROOTDIR)
Base.download
"""
prepare(corpus[, dir = homedir()])
prepare(corpus, rootdir)
Prepare the manifests of corpus to `dir`.
Prepare the manifests of corpus.
"""
prepare(corpus::SpeechCorpus) = prepare(corpus, SPEECH_CORPORA_ROOTDIR)
prepare