REMERGE - Multi-Word Expression discovery algorithm

Overview

REMERGE - Multi-Word Expression discovery algorithm

REMERGE is a Multi-Word Expression (MWE) discovery algorithm, which started as a re-implementation and simplification of a similar algorithm called MERGE, detailed in a publication and PhD thesis12. The primary benefit of this algorithm is that it's non-parametric in regards to the size of the n-grams that constitute a MWE—you do not need to specify a priori how many n-grams comprise a MWE—you only need to specify the number of iterations you want the algorithm to run.

The code was originally derived from an existing implementation from the original author3 that I reviewed, converted from python 2 to 3, then modified and updated with the following:

  • a correction of the log-likelihood calculation; previously it was not using the correct values for the contingency table
  • the removal of gapsize / discontinuous bigrams (see below for issues with the prior implementation)
  • an overall reduction in codebase size and complexity
    • ~60% reduction in loc
    • removed pandas and nltk dependencies
  • type annotations
  • the inclusion of additional metrics (Frequency, NPMI4) for selecting the winning bigram.
  • corrections for merging sequential bigrams greedily and completely.
    • e.g. 'ya ya ya ya' -> '(ya ya) (ya ya)' -> '(ya ya ya ya)'. Previously the merge order was non-deterministic, and you could end up with 'ya (ya ya) ya'
  • An overall simplification of the algorithm.
    • As a tradeoff, this version may be less efficient. After a bigram is merged into a single lexeme in the original implementation, new bigrams and conflicting (old) bigrams were respectively added and subtracted from a mutable counter of bigrams. The counts of this object were difficult to track and validate, and resulted in errors in certain cases, so I removed this step. Instead, only the lexeme data is updated with the new merged lexemes. Then, we track which lines contain the merged lexeme and create an update counter that subtracts the old bigrams from the new bigrams and updates the bigram data using that counter.
  • Clarified license with the original author and licensed as MIT.

Usage

import remerge

corpus = [
    ["a", "list", "of", "already", "tokenized", "texts"],
    ["where", "each", "item", "is", "a", "list", "of", "tokens"],
    ["isn't", "a", "list", "nice"]
]

winners = remerge.run(
    corpus, iterations=1, method=remerge.SelectionMethod.frequency, progress_bar="all"
)
# winners[0].merged_lexeme.word == ('a', 'list')

There are 3 bigram winner selection methods: Log-Likelihood (G²)5, NPMI4, and raw frequency. They are available under the SelectionMethod enum. The default is log-likelihood, which was used in the original implementation.

If using NPMI (SelectionMethod.npmi), you likely want to provide a min_count parameter, "as infrequent word pairs tend to dominate the top of bigramme lists that are ranked after PMI". (p. 44)

winners = remerge.run(corpus, 100, method=remerge.SelectionMethod.npmi, min_count=25)

API - remerge.run

Argument Type Description
corpus List[List[str]] A corpus of already tokenized texts.
iterations int The number of iterations to run the algorithm. Papers typically use >500.
method SelectionMethod, optional One of "frequency", "log_likelihood", or "npmi". Defaults to "log_likelihood".
min_count int, optional The minimum count required for a bigram to be included in the winner calculations. If choosing NPMI ("npmi") as the selection method, prefer using min_count because this measure is biased towards infrequent word pairs. Defaults to 0.
output Optional[Path], optional A file path to output the winning merged lexemes as JSON. Defaults to None.
progress_bar ProgressBarOptions, optional Verbosity of progress bar. "all" will display the lexeme and bigram construction progress each iteration plus total iteration progress. "iterations" will display progress on the total number of iterations. "none" has no output. Defaults to "iterations".

Install

Latest release:

pip install -U remerge-mwe

For latest from github:

pip install git+https://github.com/pmbaumgartner/remerge-mwe.git 

How it works

The algorithm operates iteratively in two stages: first, it collects all bigrams of co-occurring lexemes in the corpus. A measure is calculated on the set of all bigrams to determine a winner. The two lexemes that comprise the winning bigram are merged into a single lexeme. Instances of that bigram (lexeme pair) in the corpus are replaced with the merged lexeme. Outdated bigrams, i.e. those that don't exist anymore because one of their elements is now a merged lexeme, are subtracted from the bigram data. New bigrams, i.e. those where one element is now a merged lexeme, are added to the bigram data. With this new set of bigram data, the process repeats and a new winner is selected.

At initialization, a lexeme consists of only a single token, but as the algorithm iterates lexemes become multi-word expressions formed from the winning bigrams. Lexemes contain two parts: a word which is a tuple of strings, and an index which represents the position of that specific token in a MWE. For example, if the winning bigram is (you, know), occurrences of that sequence of lexemes will be replaced with [(you, know), 0] and [(you, know), 1] in the corpus. When bigrams are counted, only a root lexeme (where the index is 0) can form a bigram, so merged tokens don't get double counted. For a more visual explanation of a few iterations assuming specific winners, see the image below.

An explanation of the remerge algorithm

Limitations

No tie-breaking logic - I found this while testing and comparing to the original reference implementation. If two bigrams are tied for the winning metric, there is no tie-breaking mechanism. Both this implementation and the original implementation simply pick the first bigram from the index with the maximum value. We have slightly different implementation of how the bigram statistics table is created (i.e., the ordering of the index), which makes direct comparisons between implementations difficult.

Issues with Original Algorithm

Single Bigrams with discontinuities forming from distinct Lexeme positions

One issue with discontinuities or gaps in the original algorithm is that it did not distinguish the position of a satellite lexeme occurring to the left or right of a bigram with a gap.

Take for example these two example sentences, using - to represent an arbitrary token:

a b c -
a - c b

Assume in a prior iteration, a winning bigram was (a, _, c), representing the token a, a gap of 1, and then the token c. with a gapsize of 1. The past algorithm, run on the above corpus, would count the token b twice towards the same n-gram (a, b, c), despite there being two distinct n-grams represented here: (a, b, c) and (a, _, c, b).

I think the algorithm is counting on the fact that it would be very rare to encounter this sequence of lexemes in a realistic corpus, where the same word would appear within the gap and after the gap. I think this is more of an artifact of this specific example with an unrealistically small vocabulary.

References

Footnotes

  1. A. Wahl and S. Th. Gries, “Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction,” in Lexical Collocation Analysis, P. Cantos-Gómez and M. Almela-Sánchez, Eds. Cham: Springer International Publishing, 2018, pp. 85–109. doi: 10.1007/978-3-319-92582-0_5.

  2. A. Wahl, “The Distributional Learning of Multi-Word Expressions: A Computational Approach,” p. 190.

  3. awahl1, MERGE. 2017. Accessed: Jul. 11, 2022. [Online]. Available: https://github.com/awahl1/MERGE

  4. G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” p. 11. 2 3

  5. T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Computational Linguistics, vol. 19, no. 1, p. 14.

You might also like...

Educational, animated regular expression engine

Educational, animated regular expression engine

Regex-Engine This is a regular expression engine that functions by converting a regular expression to a Nondeterministic Finite State Automata (NFA).

Oct 19, 2022

Official implementation for ECCV 2022 paper "CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition"

CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition Project structure ├── README.md ├── comer

Nov 17, 2022

Official Pytorch Implementation of SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

Official Pytorch Implementation of SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos Our method performs visual-speech aware 3D reconstruction so t

Nov 14, 2022

A generative model of 3D facial details that can perform expression, age and wrinkle line editing (ECCV 2022).

A generative model of 3D facial details that can perform expression, age and wrinkle line editing (ECCV 2022).

Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation Code for our ECCV 2022 paper "Structure-aware Editable Morpha

Nov 18, 2022

When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition (ECCV’2022 Poster).

When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition (ECCV’2022 Poster).

When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition This is the official pytorch implementation of CA

Nov 20, 2022

A human-readable regular expression module for Python.

Humre A human-readable regular expression module for Python. Humre handles regex syntax for you and creates regex strings to pass to Python's re.compi

Nov 25, 2022

This Project uses emotion detected from facial expression to recommend music to users.

Music-Recommendation-Using-Emotion-Detection-From-Facial-Expression This Project uses emotion detected from facial expression to recommend music to us

Aug 16, 2022

Differential expression analysis.

Differential expression analysis.

Differential Expression Latch Verified Reveal statistically significant genes and transcripts from count matrices. Hosted Interface · SDK Documentatio

Sep 14, 2022

Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network (AIWIN 2022)

Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network (AIWIN 2022)

Speech-Driven Expression Blendshape Based on Single-Layer Self-attention Network HelloWorld: 9th (Third Prize) For more contest details, please refer

Nov 22, 2022
Comments
  • License?

    License?

    I don't see a license file or a license mentioned anywhere, am I missing it?

    Would be nice to know under what terms one can use / modify the package.

    Looks like a really useful tool.

    opened by rmflight 1
  • spaCy Integration

    spaCy Integration

    Sketching out some ideas for how this might work with spaCy. I think there's two stages:

    MWE discovery I think first I could modify the run function take a List[Doc]. From there, what I think we would do is use the sentences and convert those to lines. Without that, we might have weird bigrams being created from between the sentence boundaries. One other factor here is whether to include punctuation. I think the default might be False - I think we just want any raw tokens. Another option might be lowercase, which I think this default might be True.

    MWE application to a corpus What would be nice then is adding a component that takes the serialized json file of output, then identifies those MWEs in a corpus through spaCy. One challenge will be translating the options (like punctuation) to a corpus. I think we can translate these to match rules, then use the operator on the IS_PUNCT token. We could take these and store them on a span group like doc.spans['mwe'].

    opened by pmbaumgartner 0
Releases(v0.2.0)
Owner
Peter Baumgartner
Data Scientist by day, sleeping by night
Peter Baumgartner
An extremely fast (0.001s) "algorithm" to check if a word is scrambled in another word

Bruhscrambler An extremely fast (0.001s) "algorithm" to check if a word is scrambled in another word Use To find scrambled words inside a scrambled wo

LegenDrags 1 Sep 27, 2022
[NeurIPS 2022] The official repository of Expression Learning with Identity Matching for Facial Expression Recognition

ELIM_FER Optimal Transport-based Identity Matching for Identity-invariant Facial Expression Recognition (NeurIPS 2022) Daeha Kim, Byung Cheol Song CVI

Daeha Kim 16 Nov 23, 2022
Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models This is the official codebase for paper Word Discovery in Visually Grounded, Self-S

null 15 Sep 29, 2022
Based on a given word (usually a theme of interest), anime with a synopsis that fit the word's lexical field will be recommended and sorted by Bayesian score, as to make recommendation more accurate

Anime recommendation with lexical field and reduced bias by using Bayesian score Source of the algorithm: https://towardsdatascience.com/bayesian-rank

null 1 Apr 11, 2022
Code for "Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem" (NAACL 2022)

Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem (NAACL 2022) We proposed one-dimensional word embeddings. Paper: https:/

joisino 42 Nov 13, 2022
As alternative heuristic techniques; genetic algorithm, simulated annealing algorithm and city swap algorithm are implemented in Python for Travelling Salesman Problem

As alternative heuristic techniques; genetic algorithm, simulated annealing algorithm and city swap algorithm are implemented in Python for Travelling Salesman Problem. Details on implementation and test results can be found in this repository.

Selçuk Eşkil 4 Nov 7, 2022
Realtime micro-expression recognition using OpenCV and PyTorch

Micro-expression Recognition Realtime micro-expression recognition from scratch using OpenCV and PyTorch Try it out with a webcam or video using the e

Irfan 34 Oct 26, 2022
Facial expression recognition on FER+ dataset using CNN(VGG16, EfficientNet)

facial-expression-recognition Facial expression recognition on FER+ dataset using CNN(VGG16, EfficientNet-B0) Emotion: neutral, happiness, surprise, s

EJ LEE 0 May 11, 2022
Confluence OGNL expression injected RCE(CVE-2022-26134) poc and exp

CVE-2022-26134 Confluence OGNL expression injected RCE(CVE-2022-26134) poc and exp Usage Edit the python script. if __name__ == '__main__': taget

SNCKER 29 Nov 2, 2022
Facial expression analysis -> mood-based music

N(eural)N(etwork)ew Taste Soon to be fully deployed! Technologies Frontend: ReactJS TensorflowJS (facial landmarking) Backend: Flask OpenCV (image pro

TJ Bai 1 Jun 21, 2022