CoSEM

The Corpus of Singapore English Messages (CoSEM) is a monitor corpus of online text messages collected between 2016 and 2022.

Generate Convert Improve

Install / Use

/learn @wdwgonzales/CoSEM

About this skill

Quality Score

0/100

README

The Corpus of Singapore English Messages (CoSEM)

Overview

The Corpus of Singapore English Messages (CoSEM) is a corpus of online text messages collected between 2016 and 2022, compiled and managed by a group of scholars who share an interest in Colloquial Singapore English (CSE) research.

Features

Contains the following metadata (as reflected in tags)
- age
- gender
- race
- nationality
- year of collection
- year of utterance
in hierarchical text format: primed for concordance software including AntConc, CasualConc

License

The CoSEM follows the CC BY-NC-SA 4.0 License, meaning that you are free to:

Share — copy and redistribute the material in any medium or format;
Adapt — remix, transform, and build upon the material.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Non-Commercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Disclaimer

The public version of the CoSEM (see 'Corpus Version' below) has some limitations. For example, based on our initial assessment, the scrubber does its work for the most part. However, we note that it scrubs non-name information that might be important for analyses. For example, (a) below becomes (b). Furthermore, we observed that it only scrubs information from the scrubadub package. As such, we recommend that you exercise caution in using the public version of the corpus. We put forward our intention of contributing to the scholarly community by showing whatever we can at this time. We thank you for your patience.

(a) <COSEM:18MF02-9431-23CHF-2014> Why u still dunwan go homr
(b) <COSEM:18MF02-9431-23CHF-2014> Why u still {{NAME}} go {{NAME}}

Overview

Collection methodology

Please check out our overview paper.

Personal Data Scrubbing

Ensuring the privacy for messaging data is paramount, and as such, we try our best to adopt a strict standard in dealing with sensitive data in the message itself. We utilize the scrubadub package in Python as well as a customized RegEx script to remove sensitive data as much as we can. For example, any detected email address will be replaced by the code {{EMAIL}}. This process gives at least some protection against publishing sensitive data. While we remove such conﬁdential information, it is generally impossible to remove all sensitive data with any custom RegEx-based script or computation-based, especially one trained with machine learning methods like scrubadub. We did this to lower the anxiety of contributors and protect their privacy.

We thank Tao Chen and Min-Yen Kan for the disclaimer template.

Corpus metadata or tag format

Every line of utterance has been tagged with an identifier tag. The tag format allows for easy identification of a line of utterance within the corpus, and for easy interpretation of relevant metadata. For example, the tag < 17CF15-40341-20CHF-2016 > shows that the utterance was collected in the year 2017 by a Chinese female with identification number 15; the utterance is line 40341 in the corpus; and the line was produced by a 20-year-old Chinese Singaporean female in 2016.

Corpus Version

The current public version stands at 6.9 million tokens (as of November 8, 2024). This version is cleaned for duplicates and does not treat punctuation as a token/word. It is anonymized using the scrubadub package as well as customized RegEx scripts. As with any type of automated scrubbing, there is a chance that some utterances are not properly scrubbed, in which case, we ask the corpus users to exercise tact in using the data, anonymizing unscrubbed private information if they intend to present examples that contain them.

The corpus can be accessed in the main directory. You need to click on 'Download' from the upper right menu or "View raw" in the center. The file is in zip format. It needs to be decompressed/unzipped before use.

CoSEM to AntConc Metadata Mapper

The CoSEM to AntConc Metadata Mapper, developed by Hugo Andersson, is a third-party tool designed to facilitate the integration of metadata from the Corpus of Singapore English Messages (CoSEM) into AntConc for enhanced linguistic analysis. This tool provides a streamlined approach to utilizing CoSEM metadata—such as gender, nationality, and age—for advanced searches within AntConc.

Purpose and Functionality

The CoSEM corpus comprises nearly 900,000 lines of text messages, accompanied by metadata that captures key demographic attributes of the sender. However, importing and utilizing this metadata within AntConc has been challenging. Andersson’s tool automates the reformatting required to load the corpus into AntConc while preserving demographic metadata, enabling users to filter search results based on sender characteristics.

Key Features

Automated Metadata Processing: Reformats CoSEM files for seamless AntConc integration.
SQL Query Generation: Simplifies the creation of queries to filter messages by age, nationality, race, gender, and year.
Shell Script Execution: A single command (sh run_formatting.sh) processes the corpus efficiently.
AntConc Compatibility: Ensures metadata can be utilized for advanced search functionalities.

This tool significantly improves metadata utilization, making CoSEM a more powerful resource for linguistic research.

Repository

For more details and access to the tool, visit the GitHub repository: CoSEM to AntConc Metadata Mapper.

Citation

Andersson, H. (n.d.). *CoSEM to AntConc Metadata Mapper* [Computer software]. GitHub. Retrieved from https://github.com/Roborhugo/CoSEM-to-AntConc-Metadata-Mapper

Team

Assoc. Prof. Mie HIRAMOTO (National University of Singapore, Singapore) - Principal Investigator
Prof. Jakob LEIMGRUBER (University of Regensburg, Germany)
Asst. Prof. Wilkinson Daniel Wong GONZALES (The Chinese University of Hong Kong, Hong Kong SAR, People's Republic of China) - Database Manager, Data Analyst
Jun Jie LIM (University of California, San Diego, USA)
Mohamed Hafiz Bin MOHAMED JURAIMI (National University of Singapore, Singapore)
Asst. Prof. Nick HUANG (National University of Singapore, Singapore)

Overview paper

We have published a paper that explains the motivations behind developing a new corpus for the investigation of CSE in 2021. It documents the process of compiling and organizing CoSEM and describes the corpus’s initial structure and composition. We further discuss the social variables used in tagging the data, as well as ethical challenges, advantages, and disadvantages unique to online message datasets. In addition, we present preliminary analyses of two selected CSE features: (1) the Hokkien-derived expression (bo)jio and (2) sentence-final adverbs (already, also, only). We concluded the article with notes on future directions.

The paper can be found here.

To Cite

Please cite the overview paper if you use our corpus or mention it in your work.

Gonzales, Wilkinson Daniel Wong.; Mie Hiramoto.; Jakob R.E. Leimgruber.; and Jun Jie Lim. 2023. The Corpus of Singapore English Messages (CoSEM). World Englishes 42.371–388. doi:10.1111/weng.12534.

@article{gonzales_corpus_2023,
	title = {The {Corpus} of {Singapore} {English} {Messages} ({CoSEM})},
	volume = {42},
	copyright = {CC0 1.0 Universal Public Domain Dedication},
	issn = {0883-2919, 1467-971X},
	url = {https://onlinelibrary.wiley.com/doi/10.1111/weng.12534},
	doi = {10.1111/weng.12534},
	language = {en},
	number = {2},
	urldate = {2022-02-19},
	journal = {World Englishes},
	author = {Gonzales, Wilkinson Daniel Wong and Hiramoto, Mie and Leimgruber, Jakob R.E. and Lim, Jun Jie},
	year = {2023},
	pages = {371--388},
}

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

wdwgonzales

View profile

View on GitHub

GitHub Stars7

CategoryDevelopment

Updated2mo ago

Forks0

wdwgonzales/CoSEM

Languages

TeX

Security Score

70/100

Audited on Jan 15, 2026

No findings

CoSEM

Install / Use

README

The Corpus of Singapore English Messages (CoSEM)

Overview

Features

License

Disclaimer

Overview

Collection methodology

Personal Data Scrubbing

Corpus metadata or tag format

Corpus Version

CoSEM to AntConc Metadata Mapper

Purpose and Functionality

Key Features

Repository

Citation

Team

Overview paper

To Cite

Related papers

Related Skills