Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • fast/speechdatasets.jl
  • PTAL/Datasets/SpeechDatasets.jl
2 results
Show changes
Commits on Source (67)
*outputdir/
Manifest.toml
notebook-test.jl
# Tags
## [0.15.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.15.0) - 19/06/2024
### Changed
- Added support for Speech2Tex dataset
## [0.14.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.14.0) - 11/06/2024
### Changed
- Added support for AVID dataset
## [0.13.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechdatasets.jl/-/tags/v0.13.0) - 10/06/2024
### Changed
- Added support for INA Diachrony dataset
### Fixed
- Fixed Minilibrispeech data prep
## [0.12.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.12.0) - 21/05/2024
### Changed
- `SpeechDataset` is a collection of tuple of `Recording` and `Annotation`.
## [0.11.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.11.0) - 21/05/2024
### Added
- filtering speech dataset based on recording id.
### Improved
- Faster TIMIT preparation
## [0.10.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.10.0) - 22/02/2024
### Added
- extract alignments from TIMIT
### Changed
- `Supervision` is now `Annotation`
## [0.9.4](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.4)- 22/02/2024
# Fixed
- TIMIT data preparation
## [0.9.3](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.3)- 12/02/2024
# Fixed
- `CMUDICT("dir/path")` fails if `dir` does not already exists.
## [0.9.2](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.2)- 09/02/2024
# Fixed
- invalid type for field `channels` of `Recording`
- `MINILIBRISPEECH` broken
## [0.9.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.1)- 09/02/2024
# Fixed
- not possible to use `:` as channel specifier
## [0.9.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.9.0)- 09/02/2024
# Changed
- `TIMIT` and `MINILIBRISPEECH` directly create the `dataset`
## Added
* CMU and TIMIT lexicon
## [0.8.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.8.0)- 02/02/2024
## Features
* New `dataset` function, which builds `SpeechDataset` from manifest files
* Compatibility with MLUtils.DataLoader
## [0.7.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.7.0)- 14/12/2023
## Changed
* refactored API, TIMIT dataset working (but not Librispeech anymore)
## [0.6.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.6.0)- 28/09/2023
## Added
- raw audio data source
## [0.5.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.5.0)- 25/09/2023
## Added
- can load the data directly from an audio source with the `load`
function.
## [0.4.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.1)- 25/09/2023
## Added
* HTML display of AudioSource rather than recording
## Fixed
* creating Recording from audio source without specifying the channels
and the sampling rate
## [0.4.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.4.0)- 08/03/2023
## Removed
* `play` function and dependcy to PortAudio
* dependency with Fast
## Added
* HTML display of recording used in pluto notebook for instance
* `setrootdir` function to specify the location of the corpora
## [0.3.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.3.0)- 03/03/2023
## Added
* user do not need to specify the output directory -> relying on
Fast.jl to provide the default directory
* MiniLibriSpeech
## [0.2.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.2.0)- 03/03/2023
## Added
* MiniLibriSpeech
## [0.1.1](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.1.1)- 17/02/2023
## Fixed
* do not regenerate the manifest if they have been already created
## [0.1.0](https://https://gitlab.lisn.upsaclay.fr/fast/speechcorpora.jl/-/tags/v0.1.0)- 17/02/2023
## Added
* download and preparation of the multilingual Librispeech corpus
name = "SpeechDatasets"
uuid = "ae813453-fab8-46d9-ab8f-a64c05464021"
authors = ["Lucas ONDEL YANG <lucas.ondel@cnrs.fr>",
"Simon DEVAUCHELLE <simon.devauchelle@universite-paris-saclay.fr>",
"Nicolas DENIER <nicolas.denier@lisn.fr>"]
version = "0.15.0"
[deps]
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
SpeechFeatures = "6f3487c4-5ca2-4050-bfeb-2cf56df92307"
[compat]
julia = "1.10"
JSON = "0.21"
SpeechFeatures = "0.8"
# SpeechDatasets.jl
A Julia package to download and prepare speech corpus.
## Installation
Make sure to add the [FAST registry](https://gitlab.lisn.upsaclay.fr/fast/registry)
to your julia installation. Then, install the package as usual:
```
pkg> add SpeechDatasets
```
## Example
```
julia> using SpeechDatasets
julia> dataset = MINILIBRISPEECH("outputdir", :train) # :dev | :test
...
julia> dataset = TIMIT("/path/to/timit/dir", "outputdir", :train) # :dev | :test
...
julia> dataset = INADIACHRONY("/path/to/ina_wav/dir", "outputdir", "/path/to/ina_csv/dir") # ina_csv dir optional
...
julia> dataset = AVID("/path/to/avid/dir", "outputdir")
...
julia> dataset = SPEECH2TEX("/path/to/speech2tex/dir", "outputdir")
...
julia> for ((signal, fs), supervision) in dataset
# do something
end
# Lexicons
julia> CMUDICT("outputfile")
...
julia> TIMITDICT("/path/to/timit/dir")
...
```
## License
This software is provided under the CeCILL 2.1 license (see the [`/LICENSE`](/LICENSE))
# SPDX-License-Identifier: CECILL-2.1
module SpeechDatasets
using JSON
using SpeechFeatures
import MLUtils
export
# ManifestItem
Recording,
Annotation,
load,
# Manifest interface
writemanifest,
readmanifest,
# Corpora interface
download,
lang,
name,
prepare,
# Corpora
MultilingualLibriSpeech,
MINILIBRISPEECH,
TIMIT,
INADIACHRONY,
AVID,
SPEECH2TEX,
# Lexicon
CMUDICT,
TIMITDICT,
MFAFRDICT,
# Dataset
dataset
include("speechcorpus.jl")
include("manifest_item.jl")
include("manifest_io.jl")
include("dataset.jl")
# Supported corpora
include.("corpora/".*filter(contains(r".jl$"), readdir("src/corpora/")))
include("lexicons.jl")
end
# SPDX-License-Identifier: CECILL-2.1
function avid_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = filename
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function load_metadata_files(dir::AbstractString)
tasksdict = Dict('s' => "SENT", 'p' => "PARA")
metadatadict = Dict(key =>
readlines(joinpath(dir, "Metadata_with_labels_$(tasksdict[key]).csv"))
for key in keys(tasksdict))
return metadatadict
end
function get_metadata(filename, metadatadict)
task = split(filename, "_")[3][1]
headers = metadatadict[task][1]
headers = split(headers, ",")
file_metadata = filter(x -> contains(x, filename), metadatadict[task])[1]
file_metadata = split(file_metadata, ",")
metadata = Dict(
headers[i] => file_metadata[i]
for i = 1:length(headers)
)
return metadata
end
function avid_annotations(dir)
checkdir(dir)
annotations = Dict()
metadatadict = load_metadata_files(dir)
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from csv files
metadata = get_metadata(filename, metadatadict)
id = filename
# generate annotation
annotations[id] = Annotation(
id, # audio id
id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function download_avid(dir)
@info "Directory $dir not found.\nDownloading AVID dataset (9.9 GB)"
url = "https://zenodo.org/records/10524873/files/AVID.zip?download=1"
filename = "AVID.zip"
filepath = joinpath(dir,filename)
run(`mkdir -p $dir`)
run(`wget $url -O $filepath`)
@info "Download complete, extracting files"
run(`unzip $filepath -d $dir`)
run(`rm $filepath`)
return joinpath(datadir, "/AVID")
end
function avid_prepare(datadir, outputdir)
# Validate the data directory
isdir(datadir) || (datadir = download_avid(datadir))
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
recordings = Array{Dict}(undef, 2)
recordings_path = joinpath(datadir, "Repository 2")
@info "Extracting recordings from $recordings_path"
recordings[1] = avid_recordings(recordings_path)
# Calibration tones
calibtones_path = joinpath(datadir, "Calibration_tones")
@info "Extracting recordings from $calibtones_path"
recordings[2] = avid_recordings(calibtones_path)
for (i, manifestpath) in enumerate([joinpath(outputdir, "recordings.jsonl"), joinpath(outputdir, "calibration_tones.jsonl")])
open(manifestpath, "w") do f
writemanifest(f, recordings[i])
end
end
# Annotations
annotations_path = recordings_path
@info "Extracting annotations from $annotations_path"
annotations = avid_annotations(annotations_path)
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function AVID(datadir, outputdir)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "calibration_tones.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
avid_prepare(datadir, outputdir)
end
dataset(outputdir, "")
end
# SPDX-License-Identifier: CECILL-2.1
function ina_diachrony_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = "ina_diachrony§$filename"
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function ina_diachrony_get_metadata(filename)
metadata = split(filename, "§")
age, sex = split(metadata[2], "_")
Dict(
"speaker" => metadata[3],
"timeperiod" => metadata[1],
"age" => age,
"sex" => sex,
)
end
function ina_diachrony_annotations_whole(dir)
checkdir(dir)
annotations = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from filename
metadata = ina_diachrony_get_metadata(filename)
# extract transcription text (same filename but .txt)
textfilepath = joinpath(root, "$filename.txt")
metadata["text"] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
id = "ina_diachrony§$filename"
annotation_id = id*"§0"
# generate annotation
annotations[annotation_id] = Annotation(
id, # audio id
annotation_id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function ina_diachrony_annotations_csv(dir)
checkdir(dir)
annotations = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".csv" && continue
# extract metadata from filename
metadata = ina_diachrony_get_metadata(filename)
id = "ina_diachrony§$filename"
# generate annotation for each line in csv
open(joinpath(root, file)) do f
header = readline(f)
line = 1
# read till end of file
while ! eof(f)
current_line = readline(f)
start_time, end_time, text = split(current_line, ",", limit=3)
start_time = parse(Float64, start_time)
duration = parse(Float64, end_time)-start_time
metadata["text"] = text
annotation_id = id*$line"
annotations[id] = Annotation(
id, # audio id
annotation_id, # annotation id
start_time, # start
duration, # duration
[1], # only 1 channel (mono recording)
metadata # additional informations
)
line += 1
end
end
end
end
annotations
end
function ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
# Validate the data directory
for d in [ina_wav_dir, ina_csv_dir]
isnothing(d) || checkdir(d)
end
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
@info "Extracting recordings from $ina_wav_dir"
recordings = ina_diachrony_recordings(ina_wav_dir)
manifestpath = joinpath(outputdir, "recordings.jsonl")
open(manifestpath, "w") do f
writemanifest(f, recordings)
end
# Annotations
@info "Extracting annotations from $ina_wav_dir"
annotations = ina_diachrony_annotations_whole(ina_wav_dir)
if ! isnothing(ina_csv_dir)
@info "Extracting annotations from $ina_csv_dir"
csv_annotations = ina_diachrony_annotations_csv(ina_csv_dir)
annotations = merge(annotations, csv_annotations)
end
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function INADIACHRONY(ina_wav_dir, outputdir, ina_csv_dir=nothing)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
ina_diachrony_prepare(ina_wav_dir, ina_csv_dir, outputdir)
end
dataset(outputdir, "")
end
# SPDX-License-Identifier: CECILL-2.1
#######################################################################
const MINILS_URL = Dict(
"dev" => "https://www.openslr.org/resources/31/dev-clean-2.tar.gz",
"train" => "https://www.openslr.org/resources/31/train-clean-5.tar.gz"
)
const MINILS_SUBSETS = Dict(
"train" => "train-clean-5",
"dev" => "dev-clean-2"
)
#######################################################################
struct MINILIBRISPEECH <: SpeechCorpus
recordings
train
dev
test
end
function minils_recordings(dir, subset)
subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
recs = Dict()
for d1 in readdir(subsetdir; join = true)
for d2 in readdir(d1; join = true)
for path in readdir(d2; join = true)
endswith(path, ".flac") || continue
id = replace(basename(path), ".flac" => "")
r = Recording(
id,
CmdAudioSource(`sox $path -t wav -`);
channels = [1],
samplerate = 16000
)
recs[r.id] = r
end
end
end
recs
end
function minils_annotations(dir, subset)
subsetdir = joinpath(dir, "LibriSpeech", MINILS_SUBSETS[subset])
sups = Dict()
for d1 in readdir(subsetdir; join = true)
for d2 in readdir(d1; join = true)
k1 = d1 |> basename
k2 = d2 |> basename
open(joinpath(d2, "$(k1)-$(k2).trans.txt"), "r") do f
for line in eachline(f)
tokens = split(line)
s = Annotation(
tokens[1], # annotation id
tokens[1]; # recording id
channels = [1],
data = Dict("text" => join(tokens[2:end], " "))
)
sups[s.id] = s
end
end
end
end
sups
end
function minils_download(dir)
donefile = joinpath(dir, ".download.done")
if ! isfile(donefile)
run(`mkdir -p $dir`)
@debug "downloading the corpus"
for subset in ["train", "dev"]
run(`wget --no-check-certificate -P $dir $(MINILS_URL[subset])`)
tarpath = joinpath(dir, "$(MINILS_SUBSETS[subset]).tar.gz")
@debug "extracting"
run(`tar -xf $tarpath -C $dir`)
run(`rm $tarpath`)
end
run(pipeline(`date`, stdout = donefile))
end
@debug "dataset in $dir"
end
function minils_prepare(dir)
# 1. Recording manifest.
out = joinpath(dir, "recordings.jsonl")
if ! isfile(out)
open(out, "w") do f
for subset in ["train", "dev"]
@debug "preparing recording manifest ($subset) $out"
recs = minils_recordings(dir, subset)
writemanifest(f, recs)
end
end
end
# 2. Annotation manifests.
for (subset, name) in [("train", "train"), ("dev", "dev"), ("dev", "test")]
out = joinpath(dir, "annotations-$name.jsonl")
if ! isfile(out)
@debug "preparing annotation manifest ($subset) $out"
sups = minils_annotations(dir, subset)
open(out, "w") do f
writemanifest(f, sups)
end
end
end
end
function MINILIBRISPEECH(dir, subset)
minils_download(dir)
minils_prepare(dir)
dataset(dir, subset)
end
# SPDX-License-Identifier: CECILL-2.1
struct MultilingualLibriSpeech <: SpeechCorpus
lang
name
function MultilingualLibriSpeech(lang)
new(lang, "multilingual_librispeech")
end
end
const MLS_LANG_CODE = Dict(
"deu" => "german",
"eng" => "english",
"esp" => "spanish",
"fra" => "french",
"ita" => "italian",
"nld" => "dutch",
"pol" => "polish",
"prt" => "portuguese"
)
const MLS_AUDIO_URLS = Dict(
"deu" => "https://dl.fbaipublicfiles.com/mls/mls_german.tar.gz",
"eng" => "https://dl.fbaipublicfiles.com/mls/mls_english.tar.gz",
"esp" => "https://dl.fbaipublicfiles.com/mls/mls_spanish.tar.gz",
"fra" => "https://dl.fbaipublicfiles.com/mls/mls_french.tar.gz",
"ita" => "https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz",
"nld" => "https://dl.fbaipublicfiles.com/mls/mls_dutch.tar.gz",
"pol" => "https://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz",
"prt" => "https://dl.fbaipublicfiles.com/mls/mls_portuguese.tar.gz"
)
const MLS_LM_URLS = Dict(
"deu" => "https://dl.fbaipublicfiles.com/mls/mls_lm_german.tar.gz",
"eng" => "https://dl.fbaipublicfiles.com/mls/mls_lm_english.tar.gz",
"esp" => "https://dl.fbaipublicfiles.com/mls/mls_lm_spanish.tar.gz",
"fra" => "https://dl.fbaipublicfiles.com/mls/mls_lm_french.tar.gz",
"ita" => "https://dl.fbaipublicfiles.com/mls/mls_lm_italian.tar.gz",
"nld" => "https://dl.fbaipublicfiles.com/mls/mls_lm_dutch.tar.gz",
"pol" => "https://dl.fbaipublicfiles.com/mls/mls_lm_polish.tar.gz",
"prt" => "https://dl.fbaipublicfiles.com/mls/mls_lm_portuguese.tar.gz"
)
function Base.download(corpus::MultilingualLibriSpeech, outdir)
dir = path(corpus, outdir)
donefile = joinpath(dir, ".download.done")
if ! isfile(donefile)
run(`mkdir -p $dir`)
@info "downloading the corpus"
run(`wget -P $dir $(MLS_AUDIO_URLS[corpus.lang])`)
tarpath = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang]).tar.gz")
@info "extracting"
run(`tar -xf $tarpath -C $dir`)
run(`rm $tarpath`)
@info "downloading LM data"
run(`wget -P $dir $(MLS_LM_URLS[corpus.lang])`)
tarpath = joinpath(dir, "mls_lm_$(MLS_LANG_CODE[corpus.lang]).tar.gz")
@info "extracting"
run(`tar -xf $tarpath -C $dir`)
run(`rm $tarpath`)
run(pipeline(`date`, stdout = donefile))
end
@info "dataset in $dir"
corpus
end
function recordings(corpus::MultilingualLibriSpeech, dir, subset)
subsetdir = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "audio")
recs = Dict()
for d1 in readdir(subsetdir; join = true)
for d2 in readdir(d1; join = true)
for path in readdir(d2; join = true)
id = replace(basename(path), ".flac" => "")
r = Recording(
id,
CmdAudioSource(`sox $path -t wav -`);
channels = [1],
samplerate = 16000
)
recs[r.id] = r
end
end
end
recs
end
function annotations(corpus::MultilingualLibriSpeech, dir, subset)
trans = joinpath(dir, "mls_$(MLS_LANG_CODE[corpus.lang])", subset, "transcripts.txt")
sups = Dict()
open(trans, "r") do f
for line in eachline(f)
tokens = split(line)
s = Annotation(tokens[1], tokens[1]; channel = 1,
data = Dict("text" => join(tokens[2:end], " ")))
sups[s.id] = s
end
end
sups
end
function prepare(corpus::MultilingualLibriSpeech, outdir)
dir = path(corpus, outdir)
# 1. Recording manifests.
for subset in ["train", "dev", "test"]
out = joinpath(dir, "recording-manifest-$subset.jsonl")
@info "preparing recording manifest ($subset) $out"
if ! isfile(out)
recs = recordings(corpus, dir, subset)
open(out, "w") do f
writemanifest(f, recs)
end
end
end
# 2. Annotation manifests.
for subset in ["train", "dev", "test"]
out = joinpath(dir, "annotation-manifest-$subset.jsonl")
@info "preparing annotation manifest ($subset) $out"
if ! isfile(out)
sups = annotations(corpus, dir, subset)
open(out, "w") do f
writemanifest(f, sups)
end
end
end
corpus
end
# SPDX-License-Identifier: CECILL-2.1
function speech2tex_recordings(dir::AbstractString)
checkdir(dir)
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
id = filename
path = joinpath(root, file)
audio_src = FileAudioSource(path)
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 48000
)
end
end
recordings
end
extract_digits(str::AbstractString) = filter(c->isdigit(c), str)
isnumber(str::AbstractString) = extract_digits(str)==str
function speech2tex_get_metadata(filename)
# possible cases: line123_p1 line123_124_p1 line123_p1_part2 (not observed but also supported: line123_124_p1_part2)
split_name = split(filename, "_")
metadata = Dict()
if isnumber(split_name[2])
metadata["line"] = extract_digits(split_name[1])*"_"*split_name[2]
metadata["speaker"] = split_name[3]
else
metadata["line"] = extract_digits(split_name[1])
metadata["speaker"] = split_name[2]
end
if occursin("part", split_name[end])
metadata["part"] = extract_digits(split_name[end])
end
metadata
end
function speech2tex_annotations(audiodir, transcriptiondir, texdir)
checkdir.([audiodir, transcriptiondir, texdir])
annotations = Dict()
for (root, subdirs, files) in walkdir(audiodir)
for file in files
filename, ext = splitext(file)
ext != ".wav" && continue
# extract metadata from csv files
metadata = speech2tex_get_metadata(filename)
# extract transcription and tex (same filenames but .txt)
dirdict = Dict(transcriptiondir => "transcription", texdir => "latex")
for (d, label) in dirdict
textfilepath = joinpath(d, "$filename.txt")
metadata[label] = isfile(textfilepath) ? join(readlines(textfilepath), "\n") : ""
end
id = filename
# generate annotation
annotations[id] = Annotation(
id, # audio id
id, # annotation id
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
metadata # additional informations
)
end
end
annotations
end
function speech2tex_prepare(datadir, outputdir)
# Validate the data directory
checkdir(datadir)
# Create the output directory.
outputdir = mkpath(outputdir)
rm(joinpath(outputdir, "recordings.jsonl"), force=true)
# Recordings
recordings = Array{Dict}(undef, 2)
recordings_path = joinpath(datadir, "audio")
@info "Extracting recordings from $recordings_path"
recordings = speech2tex_recordings(recordings_path)
manifestpath = joinpath(outputdir, "recordings.jsonl")
open(manifestpath, "w") do f
writemanifest(f, recordings)
end
# Annotations
transcriptiondir = joinpath(datadir, "sequences")
texdir = joinpath(datadir, "latex")
@info "Extracting annotations from $transcriptiondir and $texdir"
annotations = speech2tex_annotations(recordings_path, transcriptiondir, texdir)
manifestpath = joinpath(outputdir, "annotations.jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, annotations)
end
end
function SPEECH2TEX(datadir, outputdir)
if ! (isfile(joinpath(outputdir, "recordings.jsonl")) &&
isfile(joinpath(outputdir, "annotations.jsonl")))
speech2tex_prepare(datadir, outputdir)
end
dataset(outputdir, "")
end
# SPDX-License-Identifier: CECILL-2.1
#######################################################################
const TIMIT_SUBSETS = Dict(
"train" => "train",
"dev" => "dev",
"test" => "test"
)
const TIMIT_DEV_SPK_LIST = Set([
"faks0",
"fdac1",
"fjem0",
"mgwt0",
"mjar0",
"mmdb1",
"mmdm2",
"mpdf0",
"fcmh0",
"fkms0",
"mbdg0",
"mbwm0",
"mcsh0",
"fadg0",
"fdms0",
"fedw0",
"mgjf0",
"mglb0",
"mrtk0",
"mtaa0",
"mtdt0",
"mthc0",
"mwjg0",
"fnmr0",
"frew0",
"fsem0",
"mbns0",
"mmjr0",
"mdls0",
"mdlf0",
"mdvc0",
"mers0",
"fmah0",
"fdrw0",
"mrcs0",
"mrjm4",
"fcal1",
"mmwh0",
"fjsj0",
"majc0",
"mjsw0",
"mreb0",
"fgjd0",
"fjmg0",
"mroa0",
"mteb0",
"mjfc0",
"mrjr0",
"fmml0",
"mrws1"
])
const TIMIT_TEST_SPK_LIST = Set([
"mdab0",
"mwbt0",
"felc0",
"mtas1",
"mwew0",
"fpas0",
"mjmp0",
"mlnt0",
"fpkt0",
"mlll0",
"mtls0",
"fjlm0",
"mbpm0",
"mklt0",
"fnlp0",
"mcmj0",
"mjdh0",
"fmgd0",
"mgrt0",
"mnjm0",
"fdhc0",
"mjln0",
"mpam0",
"fmld0"
])
TIMIT_PHONE_MAP48 = Dict(
"aa" => "aa",
"ae" => "ae",
"ah" => "ah",
"ao" => "ao",
"aw" => "aw",
"ax" => "ax",
"ax-h" => "ax",
"axr" => "er",
"ay" => "ay",
"b" => "b",
"bcl" => "vcl",
"ch" => "ch",
"d" => "d",
"dcl" => "vcl",
"dh" => "dh",
"dx" => "dx",
"eh" => "eh",
"el" => "el",
"em" => "m",
"en" => "en",
"eng" => "ng",
"epi" => "epi",
"er" => "er",
"ey" => "ey",
"f" => "f",
"g" => "g",
"gcl" => "vcl",
"h#" => "sil",
"hh" => "hh",
"hv" => "hh",
"ih" => "ih",
"ix" => "ix",
"iy" => "iy",
"jh" => "jh",
"k" => "k",
"kcl" => "cl",
"l" => "l",
"m" => "m",
"n" => "n",
"ng" => "ng",
"nx" => "n",
"ow" => "ow",
"oy" => "oy",
"p" => "p",
"pau" => "sil",
"pcl" => "cl",
"q" => "",
"r" => "r",
"s" => "s",
"sh" => "sh",
"t" => "t",
"tcl" => "cl",
"th" => "th",
"uh" => "uh",
"uw" => "uw",
"ux" => "uw",
"v" => "v",
"w" => "w",
"y" => "y",
"z" => "z",
"zh" => "zh"
)
TIMIT_PHONE_MAP39 = Dict(
"aa" => "aa",
"ae" => "ae",
"ah" => "ah",
"ao" => "aa",
"aw" => "aw",
"ax" => "ah",
"ax-h" => "ah",
"axr" => "er",
"ay" => "ay",
"b" => "b",
"bcl" => "sil",
"ch" => "ch",
"d" => "d",
"dcl" => "sil",
"dh" => "dh",
"dx" => "dx",
"eh" => "eh",
"el" => "l",
"em" => "m",
"en" => "n",
"eng" => "ng",
"epi" => "sil",
"er" => "er",
"ey" => "ey",
"f" => "f",
"g" => "g",
"gcl" => "sil",
"h#" => "sil",
"hh" => "hh",
"hv" => "hh",
"ih" => "ih",
"ix" => "ih",
"iy" => "iy",
"jh" => "jh",
"k" => "k",
"kcl" => "sil",
"l" => "l",
"m" => "m",
"n" => "n",
"ng" => "ng",
"nx" => "n",
"ow" => "ow",
"oy" => "oy",
"p" => "p",
"pau" => "sil",
"pcl" => "sil",
"q" => "",
"r" => "r",
"s" => "s",
"sh" => "sh",
"t" => "t",
"tcl" => "sil",
"th" => "th",
"uh" => "uh",
"uw" => "uw",
"ux" => "uw",
"v" => "v",
"w" => "w",
"y" => "y",
"z" => "z",
"zh" => "sh"
)
#######################################################################
function timit_prepare(timitdir, dir; audio_fmt="SPHERE")
# Validate the data directory
! isdir(timitdir) && throw(ArgumentError("invalid path $(timitdir)"))
# Create the output directory.
dir = mkpath(dir)
rm(joinpath(dir, "recordings.jsonl"), force=true)
## Recordings
@info "Extracting recordings from $timitdir/train"
train_recordings = timit_recordings(joinpath(timitdir, "train"); fmt=audio_fmt)
# We extract the name of speakers that are not in the dev set
TIMIT_TRAIN_SPK_LIST = Set()
for id in keys(train_recordings)
_, spk, _ = split(id, "_")
if spk TIMIT_DEV_SPK_LIST
push!(TIMIT_TRAIN_SPK_LIST, spk)
end
end
@info "Extracting recordings from $timitdir/test"
test_recordings = timit_recordings(joinpath(timitdir, "test"); fmt=audio_fmt)
recordings = merge(train_recordings, test_recordings)
manifestpath = joinpath(dir, "recordings.jsonl")
open(manifestpath, "a") do f
writemanifest(f, recordings)
end
# Annotations
@info "Extracting annotations from $timitdir/train"
train_annotations = timit_annotations(joinpath(timitdir, "train"))
@info "Extracting annotations from $timitdir/test"
test_annotations = timit_annotations(joinpath(timitdir, "test"))
annotations = merge(train_annotations, test_annotations)
train_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_TRAIN_SPK_LIST
)
end
dev_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_DEV_SPK_LIST
)
end
test_annotations = filter(annotations) do (k, v)
stype = v.data["sentence type"]
spk = v.data["speaker"]
(
(stype == "compact" || stype == "diverse") &&
spk TIMIT_TEST_SPK_LIST
)
end
for (x, y) in ("train" => train_annotations,
"dev" => dev_annotations,
"test" => test_annotations)
manifestpath = joinpath(dir, "annotations-$(x).jsonl")
@info "Creating $manifestpath"
open(manifestpath, "w") do f
writemanifest(f, y)
end
end
end
function timit_recordings(dir::AbstractString; fmt="SPHERE")
! isdir(dir) && throw(ArgumentError("expected directory $dir"))
recordings = Dict()
for (root, subdirs, files) in walkdir(dir)
for file in files
name, ext = splitext(file)
ext != ".wav" && continue
spk = basename(root)
path = joinpath(root, file)
id = "timit_$(spk)_$(name)"
audio_src = if fmt == "SPHERE"
CmdAudioSource(`sph2pipe -f wav $path`)
else
FileAudioSource(path)
end
recordings[id] = Recording(
id,
audio_src;
channels = [1],
samplerate = 16000
)
end
end
recordings
end
function timit_annotations(dir)
! isdir(dir) && throw(ArgumentError("expected directory $dir"))
splitline(line) = rsplit(line, limit=3)
annotations = Dict()
processed = Set()
for (root, subdirs, files) in walkdir(dir)
for file in files
name, ext = splitext(file)
_, dialect, spk = rsplit(root, "/", limit=3)
# Annotation files already processed (".wrd" and ".phn")
idtuple = (dialect, spk, name)
(idtuple in processed) && continue
push!(processed, (dialect, spk, name))
# Words
wpath = joinpath(root, name * ".wrd")
words = [last(split(line)) for line in eachline(wpath)]
# Phones
ppath = joinpath(root, name * ".phn")
palign = Tuple{Int,Int,String}[]
for line in eachline(ppath)
t0, t1, p = split(line)
push!(palign, (parse(Int, t0), parse(Int, t1), String(p)))
end
sentence_type = if startswith(name, "sa")
"dialect"
elseif startswith(name, "sx")
"compact"
else # startswith(name, "si")
"diverse"
end
id = "timit_$(spk)_$(name)"
annotations[id] = Annotation(
id, # recording id and annotation id are the same since we have
id, # a one-to-one mapping
-1, # start and duration is -1 means that we take the whole
-1, # recording
[1], # only 1 channel (mono recording)
Dict(
"text" => join(words, " "),
"sentence type" => sentence_type,
"alignment" => palign,
"dialect" => dialect,
"speaker" => spk,
"sex" => string(first(spk)),
)
)
end
end
annotations
end
function TIMIT(timitdir, dir, subset)
if ! (isfile(joinpath(dir, "recordings.jsonl")) &&
isfile(joinpath(dir, "annotations-train.jsonl")) &&
isfile(joinpath(dir, "annotations-dev.jsonl")) &&
isfile(joinpath(dir, "annotations-test.jsonl")))
timit_prepare(timitdir, dir)
end
dataset(dir, subset)
end
# SPDX-License-Identifier: CECILL-2.1
struct SpeechDataset <: MLUtils.AbstractDataContainer
idxs::Vector{AbstractString}
annotations::Dict{AbstractString, Annotation}
recordings::Dict{AbstractString, Recording}
end
"""
dataset(manifestroot)
Load `SpeechDataset` from manifest files stored in `manifestroot`.
Each item of the dataset is a nested tuple `((samples, sampling_rate), Annotation.data)`.
See also [`Annotation`](@ref).
# Examples
```julia-repl
julia> ds = dataset("./manifests", :train)
SpeechDataset(
...
)
julia> ds[1]
(
(samples=[...], sampling_rate=16_000),
Dict(
"text" => "Annotation text here"
)
)
```
"""
function dataset(manifestroot::AbstractString, partition)
partition_name = partition == "" ? "" : "-$(partition)"
annot_path = joinpath(manifestroot, "annotations$(partition_name).jsonl")
rec_path = joinpath(manifestroot, "recordings.jsonl")
annotations = load(Annotation, annot_path)
recordings = load(Recording, rec_path)
dataset(annotations, recordings)
end
function dataset(annotations::AbstractDict, recordings::AbstractDict)
idxs = collect(keys(annotations))
SpeechDataset(idxs, annotations, recordings)
end
Base.getindex(d::SpeechDataset, key::AbstractString) = d.recordings[key], d.annotations[key]
Base.getindex(d::SpeechDataset, idx::Integer) = getindex(d, d.idxs[idx])
# Fix1 -> partial funcion with fixed 1st argument
Base.getindex(d::SpeechDataset, idxs::AbstractVector) = map(Base.Fix1(getindex, d), idxs)
Base.length(d::SpeechDataset) = length(d.idxs)
function Base.filter(fn, d::SpeechDataset)
fidxs = filter(d.idxs) do i
fn((d.recordings[i], d.annotations[i]))
end
idset = Set(fidxs)
fannotations = filter(d.annotations) do (k, v)
k idset
end
frecs = filter(d.recordings) do (k, v)
k idset
end
SpeechDataset(fidxs, fannotations, frecs)
end
# SPDX-License-Identifier: CECILL-2.1
const CMUDICT_URL = "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40"
const FRMFA_DICT_URL = "https://raw.githubusercontent.com/MontrealCorpusTools/mfa-models/main/dictionary/french/mfa/french_mfa.dict"
function normalizeword(word)
String(uppercase(word))
end
function normalizephoneme(phoneme)
String(uppercase(phoneme))
end
"""
CMUDICT(path)
Return the dictionary of pronunciation loaded from the CMU sphinx dictionary.
The CMU dicionaty will be donwloaded and stored into to `path`. Subsequent
calls will only read the file `path` without downloading again the data.
"""
function CMUDICT(path)
if ! isfile(path)
mkpath(dirname(path))
dir = mktempdir()
run(`wget -P $dir $CMUDICT_URL`)
mv(joinpath(dir, "cmudict_SPHINX_40"), path)
end
lexicon = Dict()
open(path, "r") do f
for line in eachline(f)
word, pron... = split(line)
word = replace(word, "(1)" => "", "(2)" => "", "(3)" => "", "(4)" => "")
prononciations = get(lexicon, word, [])
push!(prononciations, pron)
lexicon[word] = prononciations
end
end
lexicon
end
"""
TIMITDICT(timitdir)
Return the dictionary of pronunciation as provided by TIMIT corpus (located
in `timitdir`).
"""
function TIMITDICT(timitdir)
dictfile = joinpath(timitdir, "doc", "timitdic.txt")
iscomment(line) = first(line) == ';'
lexicon = Dict{String,Vector{Vector{String}}}()
for line in eachline(dictfile)
iscomment(line) && continue
word, pron = split(line, limit=2)
pron = strip(pron, ['/', '\t', ' '])
word = '~' in word ? split(word, "~", limit=2)[1] : word
word = normalizeword(word)
pron = normalizephoneme.(split(pron))
prononciations = get(lexicon, word, Vector{String}[])
push!(prononciations, pron)
lexicon[word] = prononciations
end
lexicon
end
"""
MFAFRDICT(path)
Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a)
"""
function MFAFRDICT(path)
if ! isfile(path)
mkpath(dirname(path))
dir = mktempdir()
run(`wget -P $dir $FRMFA_DICT_URL`)
mv(joinpath(dir, "french_mfa.dict"), path)
end
lexicon = Dict()
open(path, "r") do f
for line in eachline(f)
word, pron... = split(line)
prononciations = get(lexicon, word, [])
push!(prononciations, pron)
lexicon[word] = prononciations
end
end
lexicon
end
\ No newline at end of file
# SPDX-License-Identifier: CECILL-2.1
#=====================================================================#
# JSON serialization of a manifest item
function Base.show(io::IO, m::MIME"application/json", s::FileAudioSource)
compact = get(io, :compact, false)
indent = get(io, :indent, 0)
printfn = compact ? print : println
printfn(io, "{")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"path\", ")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"", s.path, "\"")
print(io, repeat(" ", indent), "}")
end
function Base.show(io::IO, m::MIME"application/json", s::URLAudioSource)
compact = get(io, :compact, false)
indent = get(io, :indent, 0)
printfn = compact ? print : println
printfn(io, repeat(" ", indent), "{")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"url\", ")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"", s.url, "\"")
print(io, repeat(" ", indent), "}")
end
function Base.show(io::IO, m::MIME"application/json", s::CmdAudioSource)
compact = get(io, :compact, false)
indent = get(io, :indent, 0)
printfn = compact ? print : println
printfn(io, "{")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"type\": \"cmd\", ")
strcmd = replace("$(s.cmd)", "`" => "")
printfn(io, repeat(" ", indent + (compact ? 0 : 2)), "\"data\": \"$(strcmd)\"")
print(io, repeat(" ", indent), "}")
end
function Base.show(io::IO, m::MIME"application/json", r::Recording)
compact = get(io, :compact, false)
indent = compact ? 0 : 2
printfn = compact ? print : println
printfn(io, "{")
printfn(io, repeat(" ", indent), "\"id\": \"", r.id, "\", ")
print(io, repeat(" ", indent), "\"src\": ")
show(IOContext(io, :indent => compact ? 0 : 2), m, r.source)
printfn(io, ", ")
print(io, repeat(" ", indent), "\"channels\": [")
for (i, c) in enumerate(r.channels)
print(io, c)
i < length(r.channels) && print(io, ",")
end
printfn(io, "], ")
printfn(io, repeat(" ", indent), "\"samplerate\": ", r.samplerate)
print(io, "}")
end
function Base.show(io::IO, m::MIME"application/json", a::Annotation)
compact = get(io, :compact, false)
indent = compact ? 0 : 2
printfn = compact ? print : println
printfn(io, "{")
printfn(io, repeat(" ", indent), "\"id\": \"", a.id, "\", ")
printfn(io, repeat(" ", indent), "\"recording_id\": \"", a.recording_id, "\", ")
printfn(io, repeat(" ", indent), "\"start\": ", a.start, ", ")
printfn(io, repeat(" ", indent), "\"duration\": ", a.duration, ", ")
printfn(io, repeat(" ", indent), "\"channels\": ", a.channels |> json, ", ")
printfn(io, repeat(" ", indent), "\"data\": ", a.data |> json)
print(io, "}")
end
function JSON.json(r::Union{Recording, Annotation}; compact = true)
out = IOBuffer()
show(IOContext(out, :compact => compact), MIME("application/json"), r)
String(take!(out))
end
#=====================================================================#
# Converting a dictionary to a manifest item.
function AudioSource(d::Dict)
if d["type"] == "path"
T = FileAudioSource
elseif d["type"] == "url"
T = URLAudioSource
elseif d["type"] == "cmd"
T = CmdAudioSource
else
throw(ArgumentError("invalid type: $(d["type"])"))
end
T(d["data"])
end
Recording(d::Dict) = Recording(
d["id"],
AudioSource(d["src"]),
convert(Vector{Int}, d["channels"]),
d["samplerate"]
)
Annotation(d::Dict) = Annotation(
d["id"],
d["recording_id"],
d["start"],
d["duration"],
d["channels"],
d["data"]
)
#=====================================================================#
# Writing / reading manifest from file.
function writemanifest(io::IO, manifest::Dict)
writefn = x -> println(io, x)
for item in values(manifest)
item |> json |> writefn
end
end
function readmanifest(io::IO, T)
manifest = Dict()
for line in eachline(io)
item = JSON.parse(line) |> T
manifest[item.id] = item
end
manifest
end
# Some utilities
manifestname(::Type{<:Recording}, name) = "recordings.jsonl"
manifestname(::Type{<:Annotation}, name) = "annotations-$name.jsonl"
"""
load(Annotation, path)
load(Recording, path)
Load Recording/Annotation manifest from `path`.
"""
load(T::Type{<:Union{Recording, Annotation}}, path) = open(f -> readmanifest(f, T), path, "r")
function checkdir(dir::AbstractString)
isdir(dir) || throw(ArgumentError("$dir is not an existing directory"))
end
# SPDX-License-Identifier: CECILL-2.1
"""
abstract type ManifestItem end
Base class for all manifest item. Every manifest item should have an
`id` attribute.
"""
abstract type ManifestItem end
"""
struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
id::AbstractString
source::Ts
channels::Vector{Int}
samplerate::Int
end
A recording is an audio source associated with and id.
# Constructors
Recording(id, source, channels, samplerate)
Recording(id, source[; channels = missing, samplerate = missing])
If the channels or the sample rate are not provided then they will be
read from `source`.
!!! warning
When preparing large corpus, not providing the channes and/or the
sample rate can drastically reduce the speed as it forces to read
source.
"""
struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
id::AbstractString
source::Ts
channels::Vector{Int}
samplerate::Int
end
function Recording(uttid, s::AbstractAudioSource; channels = missing, samplerate = missing)
if ismissing(channels) || ismissing(samplerate)
x, sr = loadaudio(s)
samplerate = ismissing(samplerate) ? Int(sr) : samplerate
channels = ismissing(channels) ? collect(1:size(x,2)) : channels
end
Recording(uttid, s, channels, samplerate)
end
"""
struct Annotation <: ManifestItem
id::AbstractString
recording_id::AbstractString
start::Float64
duration::Float64
channel::Union{Vector, Colon}
data::Dict
end
An "annotation" defines a segment of a recording on a single channel.
The `data` field is an arbitrary dictionary holdin the nature of the
annotation. `start` and `duration` (in seconds) defines,
where the segment is locatated within the recoding `recording_id`.
# Constructor
Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
If `start` and/or `duration` are negative, the segment is considered to
be the whole sequence length of the recording.
"""
struct Annotation <: ManifestItem
id::AbstractString
recording_id::AbstractString
start::Float64
duration::Float64
channels::Union{Vector, Colon}
data::Dict
end
Annotation(id, recid; channels = missing, start = -1, duration = -1, data = missing) =
Annotation(id, recid, start, duration, channels, data)
"""
load(recording[; start = -1, duration = -1, channels = recording.channels])
load(recording, annotation)
Load the signal from a recording. `start`, `duration` (in seconds) can
be used to load only a segment. If an `annotation` is given, function
will return on the portion of the signal corresponding to the
annotation segment.
The function returns a tuple `(x, sr)` where `x` is a ``NxC`` array
- ``N`` is the length of the signal and ``C`` is the number of channels
- and `sr` is the sampling rate of the signal.
"""
function load(r::Recording; start = -1, duration = -1, channels = r.channels)
if start >= 0 && duration >= 0
s = Int(floor(start * r.samplerate + 1))
e = Int(ceil(duration * r.samplerate))
subrange = (s:e)
else
subrange = (:)
end
x, sr = loadaudio(r.source, subrange)
x[:,channels], sr
end
load(r::Recording, a::Annotation) = load(r; start = a.start, duration = a.duration, channels = a.channels)
# SPDX-License-Identifier: CECILL-2.1
"""
abstract type SpeechCorpus end
Abstract type for all speech corpora.
"""
abstract type SpeechCorpus end
"""
lang(corpus)
Return the ISO 639-3 code of the language of the corpus.
"""
lang
"""
name(corpus)
Return the name identifier of the corpus.
"""
name
"""
download(corpus, rootdir)
Download the data of the corpus to `dir`.
"""
Base.download
"""
prepare(corpus, rootdir)
Prepare the manifests of corpus.
"""
prepare