Icealign

Search Tool for the Icelandic Parallel Corpus

Generate Convert Improve

Install / Use

/learn @egillanton/Icealign

About this skill

Quality Score

0/100

README

Project 1 MLT201F Language resources for software development and research, 2019-1 University of Iceland - School of Humanitie , Sæmundargata 2, IS-101 Reykjavik, Iceland <img src="docs/images/header.png" alt="Reykjavik University Logo" align="middle"/> Search Tool for the Icelandic Parallel Corpus http://icealign.herokuapp.com/

<details> <summary>Click to expand</summary>

Introduction
- Preprocess
The Dataset
Setup
Authors
License
References

</details>

1 Introduction

IceAlign is a Django web application, which can easily be deployed to Heroku.

This web application serves as a corpus search tool for the Icelandic Parallel Corpus (IPC).

It improves upon the already existing corpus tool available at http://malheildir.arnastofnun.is/

<img src="docs/images/icealign.herokuapp.com_samhlida_fornritin_.png" alt="IceAlign Screenshot" align="middle"/> IceAlign - Corpus Search Tool Screenshot

2 The Dataset

The data for the IPC is can be obtained here: ParIce-1.0

More details about the content of the corpus is available at malfong.is.

The preprocessed data entries for version ParIce-1.0 can be found in the parice_dataset directory.

2.1 Preprocess

Skip this section if you want to use the the dataset that is already provided. Follow these steps if you want to generate your own .txt files (the same as under parice_dataset ) from the .tmx files in the IPC ParIce-1.0.zip.

Step 1

Unzip the content of your received IPC dataset from malfong.is

$ unzip ParIce-1.0.zip -d ParIce-1.0

Make sure the ParIce-1.0 directory is in the root of this project.

Step 2

Move our tmx2txt.sh script to the ParIce-1.0 directory.

~/icealign
$ mv scripts/tmx2txt ParIce-1.0/

Step 3

Make the script executable and run it.

~/icealign
$ cd ParIce-1.0/

~/icealign/ParIce-1.0
$ chmod 777 tmx2txt.sh

~/icealign/ParIce-1.0
$ bash tmx2txt.sh

The script extracts all the line entries for each .tmx file, and then splits up the larger files into smaller files with 50000 line entries each. This is done so that Heroku Free or Hobby server does not run out of memory and crash during the population of the Heroku database.

Step 4

Rename the txt directory and move it to the root of this project.

~/icealign/ParIce-1.0
$ mv text parice_dataset

~/icealign/ParIce-1.0
$ mv parice_dataset ../

3 Setup

Make sure you have Python 3.7 installed locally. To push to Heroku, you'll need to install the Heroku CLI, as well as Postgres.

Also, you will need to add the following Environment values for SECRET_KEY, DEBUG_VALUE on your local device and as well for your Heroku App. Read More Here

~/icealign/
$ python -m venv getting-started
$ pip install -r requirements.txt

$ createdb icealign

$ python manage.py migrate
$ python manage.py collectstatic

$ heroku local

The IceAlign app should now be running on localhost:5000.

To populate our Postgres database with the line entries within the parice_dataset directory run:

~/icealign/
$ mv scripts/populate_samhlida.py ./
$ python populate_samhlida.py
Populating Samhlida database!

This might take some time to run.

4 Authors

Egill Anton Hlöðversson - MSc. Language Technology Student

5 License

This project is licensed under the MIT License - see the LICENSE file for details.

6 References

Malfong.is

🌟 PLEASE STAR THIS REPO IF YOU FOUND SOMETHING INTERESTING 🌟

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

egillanton

View profile

View on GitHub

GitHub Stars5

CategoryDevelopment

Updated1y ago

Forks0

egillanton/icealign

Languages

Python

Security Score

70/100

Audited on Apr 23, 2024

No findings

Icealign

Install / Use

README

Table of Contents

1 Introduction

2 The Dataset

2.1 Preprocess

Step 1

Step 2

Step 3

Step 4

3 Setup

4 Authors

5 License

6 References

Related Skills