Readtextgrid
Read in a 'Praat' 'TextGrid' File
Install / Use
/learn @tjmahr/ReadtextgridREADME
readtextgrid <img src="man/figures/logo.png" width = "150" align="right" />
<!-- badges: start --> <!-- badges: end -->readtextgrid parses Praat textgrids into tidy R dataframes.
Features
- Simple: Minimal package with two core functions (
read_textgrid()andread_textgrid_lines()). - Tidy: Returns rectangular tibbles ready for downstream processing with dplyr and tidyr.
- Flexible: Supports both long and short textgrid file formats.
- Fast: Uses a compiled C++ tokenizer for high-throughput parsing.
Installation
Install readtextgrid from CRAN:
install.packages("readtextgrid")
Development version. Install precompiled version of readtextgrid from R-universe:
install.packages(
"readtextgrid",
repos = c("https://tjmahr.r-universe.dev", "https://cloud.r-project.org")
)
Basic usage
Here is the example textgrid created by Praat. It was created using
New > Create TextGrid... with default settings in Praat.
This textgrid is bundled with this R package. We can locate the file
with example_textgrid(). We read in the textgrid with
read_textgrid().
library(readtextgrid)
# Locates path to an example textgrid bundled with this package
tg <- example_textgrid()
read_textgrid(path = tg)
#> # A tibble: 3 × 10
#> file tier_num tier_name tier_type tier_xmin tier_xmax
#> <chr> <int> <chr> <chr> <dbl> <dbl>
#> 1 Mary_John_bell.TextGrid 1 Mary IntervalTier 0 1
#> 2 Mary_John_bell.TextGrid 2 John IntervalTier 0 1
#> 3 Mary_John_bell.TextGrid 3 bell TextTier 0 1
#> xmin xmax text annotation_num
#> <dbl> <dbl> <chr> <int>
#> 1 0 1 "" 1
#> 2 0 1 "" 1
#> 3 NA NA <NA> NA
The dataframe contains one row per annotation: one row for each interval
on an interval tier and one row for each point on a point tier. If a
point tier has no points, it is represented with single row with NA
values.
The columns encode the following information:
filefilename of the textgrid. By default this column uses the filename inpath. A user can override this value by setting thefileargument inread_textgrid(path, file), which can be useful if textgrids are stored in speaker-specific folders.tier_numthe number of the tier (as in the left margin of Praat’s textgrid editor)tier_namethe name of the tier (as in the right margin of Praat’s textgrid editor)tier_typethe type of the tier."IntervalTier"for interval tiers and"TextTier"for point tiers (this is the terminology used inside of the textgrid file format).tier_xmin,tier_xmaxstart and end times of the tier in secondsxmin,xmaxstart and end times of the textgrid interval or point tier annotation in secondstextthe text in the annotationannotation_numthe number of the annotation in that tier (1 for the first annotation, etc.)
Reading in directories of textgrids
Suppose we have data on multiple speakers with one folder of textgrids
per speaker. As an example, this package has a folder called
speaker_data bundled with it representing 5 five textgrids from 2
speakers.
📂 speaker-data
├── 📂 speaker001
│ ├── s2T01.TextGrid
│ ├── s2T02.TextGrid
│ ├── s2T03.TextGrid
│ ├── s2T04.TextGrid
│ └── s2T05.TextGrid
└── 📂 speaker002
├── s2T01.TextGrid
├── s2T02.TextGrid
├── s2T03.TextGrid
├── s2T04.TextGrid
└── s2T05.TextGrid
First, we create a vector of file-paths to read into R.
# Get the path of the folder bundled with the package
data_dir <- system.file(package = "readtextgrid", "speaker-data")
# Get the full paths to all the textgrids
paths <- list.files(
path = data_dir,
pattern = "TextGrid$",
full.names = TRUE,
recursive = TRUE
)
We can use purrr::map()–map the read_textgrid() function over the
paths—to read all these textgrids into R and combine them from a list
to a single dataframe with purrr::list_rbind(). But note that this way
doesn’t track any speaker information.
library(purrr)
paths |>
map(read_textgrid) |>
list_rbind()
#> # A tibble: 150 × 10
#> file tier_num tier_name tier_type tier_xmin tier_xmax xmin
#> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 s2T01.TextGrid 1 words IntervalTier 0 1.35 0
#> 2 s2T01.TextGrid 1 words IntervalTier 0 1.35 0.297
#> 3 s2T01.TextGrid 1 words IntervalTier 0 1.35 0.522
#> 4 s2T01.TextGrid 1 words IntervalTier 0 1.35 0.972
#> 5 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0
#> 6 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0.297
#> 7 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0.36
#> 8 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0.495
#> 9 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0.522
#> 10 s2T01.TextGrid 2 phones IntervalTier 0 1.35 0.621
#> xmax text annotation_num
#> <dbl> <chr> <int>
#> 1 0.297 "" 1
#> 2 0.522 "bird" 2
#> 3 0.972 "house" 3
#> 4 1.35 "" 4
#> 5 0.297 "sil" 1
#> 6 0.36 "B" 2
#> 7 0.495 "ER1" 3
#> 8 0.522 "D" 4
#> 9 0.621 "HH" 5
#> 10 0.783 "AW1" 6
#> # ℹ 140 more rows
By default, read_textgrid() uses the file basename (the file-path
minus the directory part) for the file column. But we can manually set
the file value. Here, we use purrr::map2() to map the function over
read_textgrid(path, file) over path and file pairs. Then we add
the speaker information with some dataframe manipulation functions.
library(dplyr)
# This tells read_textgrid() to set the file column to the full path
data <- map2(paths, paths, read_textgrid) |>
list_rbind() |>
mutate(
# basename() removes the folder part from a path,
# dirname() removes the file part from a path
speaker = basename(dirname(file)),
file = basename(file),
) |>
select(
speaker, everything()
)
data
#> # A tibble: 150 × 11
#> speaker file tier_num tier_name tier_type tier_xmin tier_xmax
#> <chr> <chr> <int> <chr> <chr> <dbl> <dbl>
#> 1 speaker001 s2T01.TextGrid 1 words IntervalTier 0 1.35
#> 2 speaker001 s2T01.TextGrid 1 words IntervalTier 0 1.35
#> 3 speaker001 s2T01.TextGrid 1 words IntervalTier 0 1.35
#> 4 speaker001 s2T01.TextGrid 1 words IntervalTier 0 1.35
#> 5 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> 6 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> 7 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> 8 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> 9 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> 10 speaker001 s2T01.TextGrid 2 phones IntervalTier 0 1.35
#> xmin xmax text annotation_num
#> <dbl> <dbl> <chr> <int>
#> 1 0 0.297 "" 1
#> 2 0.297 0.522 "bird" 2
#> 3 0.522 0.972 "house" 3
#> 4 0.972 1.35 "" 4
#> 5 0 0.297 "sil" 1
#> 6 0.297 0.36 "B" 2
#> 7 0.36 0.495 "ER1" 3
#> 8 0.495 0.522 "D" 4
#> 9 0.522 0.621 "HH" 5
#> 10 0.621 0.783 "AW1" 6
#> # ℹ 140 more rows
Another strategy would be to read the textgrid dataframes into a list
column and tidyr::unnest() them.
# Read dataframes into a list column
data_nested <- tibble(
speaker = basename(dirname(paths)),
data = map(paths, read_textgrid)
)
# We have one row per textgrid dataframe because `data` is a list column
data_nested
#> # A tibble: 10 × 2
#> speaker data
#> <chr> <list>
#> 1 speaker001 <tibble [13 × 10]>
#> 2 speaker001 <tibble [15 × 10]>
#> 3 speaker001 <tibble [16 × 10]>
#> 4 speaker001 <tibble [12 × 10]>
#> 5 speaker001 <tibble [19 × 10]>
#> 6 speaker002 <tibble [13 × 10]>
#> 7 speaker002 <tibble [15 × 10]>
#> 8 speaker002 <tibble [16 × 10]>
#> 9 speaker002 <tibble [12 × 10]>
#> 10 speaker002 <tibble [19 × 10]>
# promote the nested dataframes into the main dataframe
tidyr::unnest(data_nested, "data")
#> # A tibble: 150 × 11
#> speaker file tier_num tier_name tier_type tier_xmin tier_xmax xmin xmax
#> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 speaker001 s2T0… 1 words Interval… 0 1.35 0 0.297
#> 2 speaker001 s2T0… 1 words Interval… 0 1.35 0.297 0.522
#> 3 speaker001 s2T0… 1 words
