Icealign
Search Tool for the Icelandic Parallel Corpus
Install / Use
/learn @egillanton/IcealignREADME
Table of Contents
<!-- ⛔️ MD-MAGIC-EXAMPLE:START (TOC:collapse=true&collapseText=Click to expand) --> <details> <summary>Click to expand</summary> </details> <!-- ⛔️ MD-MAGIC-EXAMPLE:END -->1 Introduction
IceAlign is a Django web application, which can easily be deployed to Heroku.
This web application serves as a corpus search tool for the Icelandic Parallel Corpus (IPC).
It improves upon the already existing corpus tool available at http://malheildir.arnastofnun.is/
<img src="docs/images/icealign.herokuapp.com_samhlida_fornritin_.png" alt="IceAlign Screenshot" align="middle"/> <p align="center"> IceAlign - Corpus Search Tool Screenshot </p>2 The Dataset
The data for the IPC is can be obtained here: ParIce-1.0
More details about the content of the corpus is available at malfong.is.
The preprocessed data entries for version ParIce-1.0 can be found in the parice_dataset directory.
2.1 Preprocess
Skip this section if you want to use the the dataset that is already provided. Follow these steps if you want to generate your own .txt files (the same as under parice_dataset ) from the .tmx files in the IPC ParIce-1.0.zip.
Step 1
Unzip the content of your received IPC dataset from malfong.is
$ unzip ParIce-1.0.zip -d ParIce-1.0
Make sure the ParIce-1.0 directory is in the root of this project.
Step 2
Move our tmx2txt.sh script to the ParIce-1.0 directory.
~/icealign
$ mv scripts/tmx2txt ParIce-1.0/
Step 3
Make the script executable and run it.
~/icealign
$ cd ParIce-1.0/
~/icealign/ParIce-1.0
$ chmod 777 tmx2txt.sh
~/icealign/ParIce-1.0
$ bash tmx2txt.sh
The script extracts all the line entries for each .tmx file, and then splits up the larger files into smaller files with 50000 line entries each. This is done so that Heroku Free or Hobby server does not run out of memory and crash during the population of the Heroku database.
Step 4
Rename the txt directory and move it to the root of this project.
~/icealign/ParIce-1.0
$ mv text parice_dataset
~/icealign/ParIce-1.0
$ mv parice_dataset ../
3 Setup
Make sure you have Python 3.7 installed locally. To push to Heroku, you'll need to install the Heroku CLI, as well as Postgres.
Also, you will need to add the following Environment values for SECRET_KEY, DEBUG_VALUE on your local device and as well for your Heroku App. Read More Here
~/icealign/
$ python -m venv getting-started
$ pip install -r requirements.txt
$ createdb icealign
$ python manage.py migrate
$ python manage.py collectstatic
$ heroku local
The IceAlign app should now be running on localhost:5000.
To populate our Postgres database with the line entries within the parice_dataset directory run:
~/icealign/
$ mv scripts/populate_samhlida.py ./
$ python populate_samhlida.py
Populating Samhlida database!
This might take some time to run.
4 Authors
- Egill Anton Hlöðversson - MSc. Language Technology Student
5 License
This project is licensed under the MIT License - see the LICENSE file for details.
6 References
<p align="center"> 🌟 PLEASE STAR THIS REPO IF YOU FOUND SOMETHING INTERESTING 🌟 </p>Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
