BabySLM

Behavioral probing of language acquisition models at the lexical and syntactic level

Generate Convert Improve

Install / Use

/learn @MarvinLvn/BabySLM

About this skill

Quality Score

0/100

README

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Welcome to this repository where you'll find all you need to evaluate your language model at:

the lexical level using a spot-the-word task (available in audio or phonetic form; see Table 1)
the syntactic level using a grammatical acceptability judgment task (available in audio, phonetic or orthographic form; see Table 2)

Getting started

You'll probably want to start from there:

Examples of stimuli

Stimuli examples can be listened to on this web page.

| Word | Pseudo-word | Word | Pseudo-word | |--------|-------------------------------------------------------------|--------|-------------------------------------------------------------| | hello | lello pello sero dello sello | cookie | kootie koonie roodie rootie boonie |

Table 1: Minimal pairs of real and pseudo-words used in the spot-the-word lexical task.

</center> <center>

| Phenomenon | Sentence example | |---------------------------|-----------------------------------------------------------------------| | Adjective-noun order | ✓ The good mom. ✗ The mom good. | | Noun-verb order | ✓ The dragon says. ✗ The says dragon. | | Anaphor-gender agreement | ✓ The dad cuts himself. ✗ The dad cuts herself. | | Anaphor-number agreement | ✓The boys told themselves. ✗ The boys told himself. | | Determiner-noun agreement | ✓ Each good sister. ✗ Many good sister. | | Noun-verb agreement | ✓ The prince needs the princess. ✗ The prince need the princess. |

Table 2: Minimal pairs of grammatical (✓) and ungrammatical (✗) sentences used in the syntactic task.

</center>

Reproduce the BabySLM benchmark

If you want to go further:

How to cite?

@inproceedings{lavechin2023baby,
title={BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models},
author={Lavechin, Marvin and Sy, Yaya and Titeux, Hadrien and Bland{\'o}n, Mar{\'\i}a Andrea Cruz and R{\"a}s{\"a}nen, Okko and Bredin, Herv{\'e} and Dupoux, Emmanuel and Cristia, Alejandrina},
year={2023},
booktitle = {Interspeech}
}

Additionnally, if you use BabyBERTa, please cite:

@inproceedings{huebner2021babyberta,
  title={BabyBERTa: Learning more grammar with small-scale child-directed language},
  author={Huebner, Philip A and Sulem, Elior and Cynthia, Fisher and Roth, Dan},
  booktitle={Proceedings of the 25th conference on computational natural language learning},
  pages={624--646},
  year={2021}
}

If you use the Providence corpus, please cite:

@inproceedings{borschinger2013joint,
  title={A joint model of word segmentation and phonological variation for English word-final/t/-deletion},
  author={B{\"o}rschinger, Benjamin and Johnson, Mark and Demuth, Katherine},
  booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={1508--1516},
  year={2013}
}

If you use the LibriVox corpus, please cite:

@article{kearns2014librivox,
  title={Librivox: Free public domain audiobooks},
  author={Kearns, Jodi},
  journal={Reference Reviews},
  volume={28},
  number={1},
  pages={7--8},
  year={2014},
  publisher={Emerald Group Publishing Limited}
}

Related Skills

node-connect

347.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.6k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.6k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。