Regular Expressions for Syllabaries

Daniel Yacob

The Ge'ez Frontier Foundation

Overview

Review of Syllabaries

Review of Syllabaries

English as a Syllabary

  a e i o u
b (ba) (be) (bi) (bo) (bu)
c (ca) (ce) (ci) (co) (cu)
d (da) (de) (di) (do) (du)
f (fa) (fe) (fi) (fo) (fu)
g (ga) (ge) (gi) (go) (gu)
h (ha) (he) (hi) (ho) (hu)
  • 21 x 5 = 105 symbols.
  • More for European languages.
  • Not all CV intersects necessarily occur.
  • Sounds of a C or V may change between languages.
  • There may be more than one symbol per syllable.
  • The columns may have names.
  • Likewise the rows.

Why Regex for Syllabaries?

Why Regex for Syllabaries?

For example...

We don't really need class expressions for digits, but it makes writing REs easier:

Amharic Use Cases

Amharic Use Cases

The Matrix

  [#1#] [#2#] [#3#] [#4#] [#5#] [#6#] [#7#]
[#ሀ#]
[#ለ#]
[#ሐ#]
[#መ#]
[#ሠ#]
[#ረ#]
[#ሰ#]
[#ሸ#]
[#ቀ#]
[#በ#]
 

Amharic Use Cases

The Matrix

  [#1#] [#2#] [#3#] [#4#] [#5#] [#6#] [#7#]
[#ሀ#]
[#ለ#]
[#ሐ#]
[#መ#]
[#ሠ#]
[#ረ#]
[#ሰ#]
[#ሸ#]
[#ቀ#]
[#በ#]
  [#1#] ≡ [:ግዕዝ:]
  [#2#] ≡ [:ካዕብ:]
  [#3#] ≡ [:ሣልስ:]
  [#4#] ≡ [:ራዕብ:]
  [#5#] ≡ [:ኃምስ:]
  [#6#] ≡ [:ሳድስ:]
  [#7#] ≡ [:ሳብዕ:]

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

  • ቤተመንግሥት
  • ቤቱ
  • ቤታችን
  • ቤቴ
  • ቤት
  • ቤቶ
  • ቤቷ

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

  • ቤተመንግሥት
  • ቤቱ
  • ቤታችን
  • ቤቴ
  • ቤት
  • ቤቶ
  • ቤቷ
[#ተ#]

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

  • ቤተመንግሥት
  • ቤቱ
  • ቤታችን
  • ቤቴ
  • ቤት
  • ቤቶ
  • ቤቷ
[#ተ#]

vs /ቤ[ተ-ቷ]/

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

  • ቤተመንግሥት
  • ቤቱ
  • ቤታችን
  • ቤቴ
  • ቤት
  • ቤቶ
  • ቤቷ
[#ተ#]

vs /ቤ[ተ-ቷ]/


[#መ#] vs /[መ-ሟፙᎀ-ᎃ]/

[#ፈ#] vs /[ፈ-ፏፚᎈ-ᎋ]/

etc.

Amharic Use Cases

Find All Plurals

/[#7#]ች/

Amharic Use Cases

Find All Plurals

/[#7#]ች/


vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

Amharic Use Cases

Find All Plurals

/[#7#]ች/


vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች


Negation: /[^#7#]/

equivalent to [#1-6#]
and not [^ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]
we want to maintain the syllabic context and only not match other syllables

Amharic Use Cases

Match Phonetic Mispellings

/[#3,6#]ያ/

Amharic Use Cases

Match Phonological Mispellings

/[ምን][#በፈ#]/

Amharic Use Cases

Restricted Order Matches

/[መ-ቀ]{#6#}/

Amharic Use Cases

Restricted Order Matches

/[መ-ቀ]{#6#}/

/[መ-ቀ]{#4,6#}/

Amharic Use Cases

Symbol Redundancy / Phonemic Decay

The Ge'ez Syllabary

  ə u i a e ɨ o
h
l
ħ
m
s
r
ʃ

Amharic Use Cases

Symbol Redundancy / Phonemic Decay

The Amharic Syllabary

  ə u i a e ɨ o
h ሁ,ሑ,ኁ,ኹ ሂ,ሒ,ኂ,ኺ ሀ,ሃ,ሐ,ሓ,ኃ,ኻ ሄ,ሄ,ኄ,ኼ ህ,ሕ,ኅ,ኽ ሆ,ሖ,ኆ,ኾ
l
ħ * * * * * * *
m
s ሠ,ሰ ሡ,ሱ ሢ,ሲ ሣ,ሳ ሤ,ሴ ሥ,ስ ሦ,ሶ
r
ʃ

Amharic Use Cases

Homophonic Equivalence

  Amharic   Tigrinya
[=ሀ=] [ሀሃሐሓኀኃኻ]   [ሀሃኀኃ]
[=ሰ=] [ሰሠ]   [ሰሠ]
[=አ=] [አኣዐዓ]   [አኣ]
[=ጸ=] [ጸፀ]   [ጸፀ]
[=ጎ=] [ጎጐ]   [ጎጐ]
[=ኮ=] [ኮኰ]   [ኮኰ]
[=ቍ=] [ቁቍ]   [ቁቍ]
etc.  

Amharic Use Cases

Homophonic Equivalence

[=ሀ=] [ሀሃሐሓኀኃኻ]
[=ሁ=] [ሁሑኁኹ]
[=ሂ=] [ሂሒኂኺ]
[=ሃ=] [ሀሃሐሓኀኃኻ]
[=ሄ=] [ሄሔኄኼ]
[=ህ=] [ህሕኅኽ]
[=ሆ=] [ሆሖኆኾ]
[=#ሀ#=] [ሀ-ሆሐ-ሗኀ-ኍኸ-ዅ]

Amharic Use Cases

Homophonic Equivalence

[=ጎ=]ንዳር ጎንዳር & ጐንዳር
[=ቍ=]ጥር ቍጥር & ቁጥር
ታ[=ህ=][=ሳ=][=ስ=]
  • ታህሳስ & ታህሣስ & ታህሳሥ & ታህሣሥ &
  • ታሕሳስ & ታሕሣስ & ታሕሳሥ & ታሕሣሥ &
  • ታኅሳስ & ታኅሣስ & ታኅሳሥ & ታኅሣሥ &
  • ታኽሳስ & ታኽሣስ & ታኽሳሥ & ታኽሣሥ

Amharic Use Cases

Homophonic Equivalence

ዓለምፀሐይ Patterns

Amharic Use Cases

Homophonic Equivalence

ዓለምፀሐይ Patterns

Amahric: /[=አ=]ለም[=ጸ=][=ሃ=]ይ/
Tigrinya: /[=ዓ=]ለም[=ጸ=][=ሓ=]ይ/
Ge'ez: /[=ዓ=]ለምፀ[=ሐ=]ይ/

Regexp::Ethiopic Usage & Limitations

#!/usr/bin/perl -w

use Regexp::Ethiopic::Amharic 'overload';
⋮
	if ( /[#መየበከ#]{#1,4#}\w+[=ሀ=]\w+?[#7#]ች/ ) {
	⋮
	}
⋮
Anonymous:   m/$x{#5#}/;
Substitutions:   s/[#7#]([#ከ#])/[#2#]$1/g;
Transliterations:   tr/[#1-3#]/[#4-6#]/;

The Men Who Say "ꑳ"

  • Yi is a language, syllbary and people.
  • The syllbary is used by between 2 and 5.5 million people in the Chinese provinces of Yunnan and Sichuan.
  • Once had over 8,000 characters, now standardized to 819.
  • Has 44 consonants... (x-axis)
  • 10 vowels... (y-axis)
  • ...and 4 tones (z-axis)!
  • 10 x 44 x 4 = 1760 positions
 

The Men Who Say "ꑳ"

  • Yi is a language, syllbary and people.
  • The syllbary is used by between 2 and 5.5 million people in the Chinese provinces of Yunnan and Sichuan.
  • Once had over 8,000 characters, now standardized to 819.
  • Has 44 consonants... (x-axis)
  • 10 vowels... (y-axis)
  • ...and 4 tones (z-axis)!
  • 10 x 44 x 4 = 1760 positions

The Men Who Say "ꑳ"

High Tone (T) [#:t#]
Middle High Tone (X) [#:x#]
Middle Low Tone [#:z#]
Low Tone (P) [#:p#]

  [#1#] [#2#] [#3#] [#4#] [#5#] [#6#] [#7#] [#8#] [#9#] [#10#]
[#ꀀ#] ꀀ














       
[#ꀖ#]























[#ꀸ#]




















 
[#ꁖ#]






















 

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}matches [ꀱ]

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}matches [ꀱ]
[#ꀖ#]{#:tx#}matches []

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}matches [ꀱ]
[#ꀖ#]{#:tx#}matches []
[#ꀖ#]{#2-8#}{#:tx#}matches []

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}matches [ꀱ]
[#ꀖ#]{#:tx#}matches []
[#ꀖ#]{#2-8#}{#:tx#}matches []
[#ꀖ#]{#2-8:tx#}matches []

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}matches [ꀱ]
[#ꀖ#]{#:tx#}matches []
[#ꀖ#]{#2-8#}{#:tx#}matches []
[#ꀖ#]{#2-8:tx#}matches []
[#ꀖ:tx#]invalid

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

// Folds katakana and hiragana together
class KanaFolder implements RegExFolder {
  // from RegExFolder, must be overridden in subclasses
  String fold(String source) {...}

  // from RegExFolder, may be overridden for efficiency

  RegExFolder clone(String parameter, Locale locale) {...}
  int fold(int source) {...}
  UnicodeSet fold(UnicodeSet source) {...}
}
  ...

  RegExFolder.registerFolding("k_h", new KanaFolder());

  ...

  p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode Technical Standard #18

p = Pattern.compile("(\F{k_h=argument}マルク (\s)* ダ (ヸ | ビ) ス \E : \s+)*");

Unicode 4.1, 5.0 and Beyond

Unicode 4.1, 5.0 and Beyond

Encoding of Vai (284 Characters)

Unicode 4.1, 5.0 and Beyond

~ · fini · ~

Where do we go from here?

~ · fini · ~