SkillAgentSearch skills...

CoSEM

The Corpus of Singapore English Messages (CoSEM) is a monitor corpus of online text messages collected between 2016 and 2022.

Install / Use

/learn @wdwgonzales/CoSEM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

The Corpus of Singapore English Messages (CoSEM)

Overview

The Corpus of Singapore English Messages (CoSEM) is a corpus of online text messages collected between 2016 and 2022, compiled and managed by a group of scholars who share an interest in Colloquial Singapore English (CSE) research. <br />

Features

  • Contains the following metadata (as reflected in tags)
    • age
    • gender
    • race
    • nationality
    • year of collection
    • year of utterance
  • in hierarchical text format: primed for concordance software including AntConc, CasualConc <br />
<br />

License

The CoSEM follows the CC BY-NC-SA 4.0 License, meaning that you are free to:

  • Share — copy and redistribute the material in any medium or format;
  • Adapt — remix, transform, and build upon the material.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • Non-Commercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

<br /> <br />

Disclaimer

The public version of the CoSEM (see 'Corpus Version' below) has some limitations. For example, based on our initial assessment, the scrubber does its work for the most part. However, we note that it scrubs non-name information that might be important for analyses. For example, (a) below becomes (b). Furthermore, we observed that it only scrubs information from the scrubadub package. As such, we recommend that you exercise caution in using the public version of the corpus. We put forward our intention of contributing to the scholarly community by showing whatever we can at this time. We thank you for your patience.

<br />
(a) <COSEM:18MF02-9431-23CHF-2014> Why u still dunwan go homr
(b) <COSEM:18MF02-9431-23CHF-2014> Why u still {{NAME}} go {{NAME}}
<br /> <br /> <br />

Overview

Collection methodology

Please check out our overview paper. <br /> <br />

Personal Data Scrubbing

Ensuring the privacy for messaging data is paramount, and as such, we try our best to adopt a strict standard in dealing with sensitive data in the message itself. We utilize the scrubadub package in Python as well as a customized RegEx script to remove sensitive data as much as we can. For example, any detected email address will be replaced by the code {{EMAIL}}. This process gives at least some protection against publishing sensitive data. While we remove such confidential information, it is generally impossible to remove all sensitive data with any custom RegEx-based script or computation-based, especially one trained with machine learning methods like scrubadub. We did this to lower the anxiety of contributors and protect their privacy.

We thank Tao Chen and Min-Yen Kan for the disclaimer template.

<br />

Corpus metadata or tag format

Every line of utterance has been tagged with an identifier tag. The tag format allows for easy identification of a line of utterance within the corpus, and for easy interpretation of relevant metadata. For example, the tag < 17CF15-40341-20CHF-2016 > shows that the utterance was collected in the year 2017 by a Chinese female with identification number 15; the utterance is line 40341 in the corpus; and the line was produced by a 20-year-old Chinese Singaporean female in 2016.

<br /> <br />

Corpus Version

The current public version stands at 6.9 million tokens (as of November 8, 2024). This version is cleaned for duplicates and does not treat punctuation as a token/word. It is anonymized using the scrubadub package as well as customized RegEx scripts. As with any type of automated scrubbing, there is a chance that some utterances are not properly scrubbed, in which case, we ask the corpus users to exercise tact in using the data, anonymizing unscrubbed private information if they intend to present examples that contain them.

<br />

The corpus can be accessed in the main directory. You need to click on 'Download' from the upper right menu or "View raw" in the center. The file is in zip format. It needs to be decompressed/unzipped before use.

<br />

CoSEM to AntConc Metadata Mapper

The CoSEM to AntConc Metadata Mapper, developed by Hugo Andersson, is a third-party tool designed to facilitate the integration of metadata from the Corpus of Singapore English Messages (CoSEM) into AntConc for enhanced linguistic analysis. This tool provides a streamlined approach to utilizing CoSEM metadata—such as gender, nationality, and age—for advanced searches within AntConc.

Purpose and Functionality

The CoSEM corpus comprises nearly 900,000 lines of text messages, accompanied by metadata that captures key demographic attributes of the sender. However, importing and utilizing this metadata within AntConc has been challenging. Andersson’s tool automates the reformatting required to load the corpus into AntConc while preserving demographic metadata, enabling users to filter search results based on sender characteristics.

Key Features

  • Automated Metadata Processing: Reformats CoSEM files for seamless AntConc integration.
  • SQL Query Generation: Simplifies the creation of queries to filter messages by age, nationality, race, gender, and year.
  • Shell Script Execution: A single command (sh run_formatting.sh) processes the corpus efficiently.
  • AntConc Compatibility: Ensures metadata can be utilized for advanced search functionalities.

This tool significantly improves metadata utilization, making CoSEM a more powerful resource for linguistic research.

Repository

For more details and access to the tool, visit the GitHub repository: CoSEM to AntConc Metadata Mapper.

Citation

Andersson, H. (n.d.). *CoSEM to AntConc Metadata Mapper* [Computer software]. GitHub. Retrieved from https://github.com/Roborhugo/CoSEM-to-AntConc-Metadata-Mapper
<br />

Team

<br /> <br />

Overview paper

We have published a paper that explains the motivations behind developing a new corpus for the investigation of CSE in 2021. It documents the process of compiling and organizing CoSEM and describes the corpus’s initial structure and composition. We further discuss the social variables used in tagging the data, as well as ethical challenges, advantages, and disadvantages unique to online message datasets. In addition, we present preliminary analyses of two selected CSE features: (1) the Hokkien-derived expression (bo)jio and (2) sentence-final adverbs (already, also, only). We concluded the article with notes on future directions.

The paper can be found here. <br /> <br /> <br />

To Cite

Please cite the overview paper if you use our corpus or mention it in your work.

<br /> <br />

Gonzales, Wilkinson Daniel Wong.; Mie Hiramoto.; Jakob R.E. Leimgruber.; and Jun Jie Lim. 2023. The Corpus of Singapore English Messages (CoSEM). World Englishes 42.371–388. doi:10.1111/weng.12534.

<br /> <br />
@article{gonzales_corpus_2023,
	title = {The {Corpus} of {Singapore} {English} {Messages} ({CoSEM})},
	volume = {42},
	copyright = {CC0 1.0 Universal Public Domain Dedication},
	issn = {0883-2919, 1467-971X},
	url = {https://onlinelibrary.wiley.com/doi/10.1111/weng.12534},
	doi = {10.1111/weng.12534},
	language = {en},
	number = {2},
	urldate = {2022-02-19},
	journal = {World Englishes},
	author = {Gonzales, Wilkinson Daniel Wong and Hiramoto, Mie and Leimgruber, Jakob R.E. and Lim, Jun Jie},
	year = {2023},
	pages = {371--388},
}
<br /> <br />

Related papers

We would highly appreciate it if you can cite these papers (available here) alongside the overview paper, if you decide to use or mention our corpus:

  1. Gonzales, Wilkinson Daniel Wong, Mie Hiramoto, Jakob Leimgruber, Jun Jie Lim. 2022. Is it in Colloquial Singapore English: What variation can tell us about its conventions and development. English Today, Cambridge University Press. https://www.doi.org/10.1017/S0266078422000141

  2. Leimgruber, Jakob, Jun Jie Lim, Wilkinson Daniel Wong Gonzales, Mie Hiramoto. 2020. Ethnic and gender variation in the use of Colloquial Singap

Related Skills

View on GitHub
GitHub Stars7
CategoryDevelopment
Updated2mo ago
Forks0

Languages

TeX

Security Score

70/100

Audited on Jan 15, 2026

No findings