A Cantonese-English parallel corpus extracted from words.hk

Overview

Words.hk Cantonese-English Parallel Corpus

Design

TODO

Project Structure

all (41859) -> minus15 (29487)
            |
            -> plus15 -> train (9372)
                      |
                      -> dev (1500)
                      |
                      -> test (1500)

Build

Download the latest version of words.hk data from the download page. Then run:

gzip -d all-*.csv.gz
python extract.py
python split_train_dev_test.py
python split_15.py

Special Credits

You might also like...

A zero-shot neural semantic parser without using annotated parallel training data.

On the Ingredients of an Effective Zero-shot Semantic Parser This repo contains the implementation of our ACL 2022 paper On the Ingredients of an Effe

Nov 24, 2022

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Nov 19, 2022

Official code for ECCV2022 paper: Learning Series-Parallel Lookup Tables for Efficient Image Super-Resolution

SPLUT Official code for ECCV2022 paper: Learning Series-Parallel Lookup Tables for Efficient Image Super-Resolution The folder training_testing_code c

Nov 25, 2022

Bamboo is a system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances.

Afforable deep learning through resilient preemptible instances. v0.1 - 01/20/22 Summary of Bamboo Bamboo is a system for running large scale DNNs usi

Nov 10, 2022

[ICLR 2022] Accelerated Policy Learning with Parallel Differentiable Simulation

[ICLR 2022] Accelerated Policy Learning with Parallel Differentiable Simulation

SHAC This repository contains the implementation for the paper Accelerated Policy Learning with Parallel Differentiable Simulation (ICLR 2022). In thi

Nov 17, 2022

Grain growth in polycrystal, described by multi-phase field model, implemented by cross-platform parallel (CPU/GPU) computing language of Taichi

Grain growth in polycrystal, described by multi-phase field model, implemented by cross-platform parallel (CPU/GPU) computing language of Taichi

grain growth in polycrystal Grain growth described by multi-phase field model, implemented by cross-platform parallel (CPU/GPU) computing language of

Oct 13, 2022

A Massively Parallel Large Scale Self-Play Framework

A Massively Parallel Large Scale Self-Play Framework

TimeChamber: A Massively Parallel Large Scale Self-Play Framework TimeChamber is a large scale self-play framework running on parallel simulation. Run

Nov 11, 2022

Crystalline growth of dendrite snow simulated by phase field with Taichi-based cross-platform parallel (CPU/GPU) computing

Crystalline growth of dendrite snow simulated by phase field with Taichi-based cross-platform parallel (CPU/GPU) computing

Snowflake by Phase Field Crystalline growth of dendrite snow simulated by phase field with Taichi-based cross-platform parallel (CPU/GPU) computing In

Oct 20, 2022

Parallel Bayesian Optimization of Multi-agent Systems

Parallel Bayesian Optimization of Multi-agent Systems

Parallel Bayesian Optimization of Agent-based Transportation Simulation Kiran Chhatre1*, Sidney Feygin2, Colin Sheppard1,2, and Rashid Waraich1,2 1Ene

Oct 3, 2022
Releases(v1)
Owner
Ayaka
A 23-year-old computer science artist dedicated to NLP
Ayaka
A solution to the problem of finding five English words with 25 distinct characters, using graph theory.

A solution to the problem of finding five English words with 25 distinct characters, using graph theory.

Scott Mansell 8 Nov 6, 2022
Stop slowly googling words you need to spell, quickly spell check and correct words in the terminal.

Simple Terminal Spell Check Stop slowly googling words you need to spell, quickly spell check and correct words in the terminal. No Internet connectio

null 1 Sep 17, 2022
gzip middleware for ASGI applications, extracted from Starlette

asgi-gzip gzip middleware for ASGI applications, extracted from Starlette Installation Install this library using pip: pip install asgi-gzip Usage fr

Simon Willison 10 Nov 14, 2022
Python scripts for merging the CGs extracted from the games of tone work's

ToneWorks_CG_Merge Python scripts for merging the CGs extracted from the games of tone work's 这个项目是用来合并tone work's社的游戏中的CG的,目前支持:初恋1/1、星织梦未来(仅Perfect

M.Huang 2 Jun 3, 2022
Phonix generator is a tool to generate discord nitro promotion codes and checks them! valid codes will automaticlly be extracted to a .txt file

Phonix generator is a tool to generate discord nitro promotion codes and checks them! valid codes will automaticlly be extracted to a .txt file.

Baibers :) 4 Nov 10, 2022
Collection of different source of TTS api for generating corpus

Collection of different source of TTS api for generating corpus

Eric Lam 2 Mar 21, 2022
The corpus of Japanese spam messages of invitation Mama Katu.

ママ活DMコーパス ダウンロード Mama_katu_DM_corpus.txt 概要 ママ活の勧誘DMを集めてコーパスにしたものです 仕様 文字コードはUTF-8、改行コードはLFです 一行に一つのママ活DMのテキストです 改行は「__br__」という記号に変換しています 送り先ユーザー名は「__

null 36 Aug 6, 2022
Frequencies of pinyin initials and finals computed with a large Zhihu Q&A corpus

Pinyin Frequencies Motivations Knowing the frequencies of Pinyin initials and finals is important for the design of ergonomic Shuangpin input methods.

Xiang (Kevin) Li 1 Aug 15, 2022
This repository provides details and links to the ACL anthology corpus/collection including .bib, .pdf and grobid extractions of the pdfs

ACL Anthology Corpus - Full Text ?? This repository provides full-text and metadata to the ACL anthology collection including .pdf files and grobid ex

Shaurya Rohatgi 113 Nov 10, 2022
korean corpus preprocessor for PLM Pre-training

PLM 사전 학습을 위한 한국어 말뭉치 전처리 툴 사전 학습을 위한 한국어 말뭉치를 전처리 하기 위한 툴입니다. 병렬처리 라이브러리인 ray를 사용해 전처리 속도를 향상시켰습니다. 말뭉치를 kss를 통해 문장 분리 후 html 태그 제거, 맞춤법 처리, 특수 문자를 정

Kim, Chan 2 Oct 11, 2022