Regular Expressions for Syllabaries

Daniel Yacob

The Ge'ez Frontier Foundation

Overview

Review of Syllabaries
Why Regex for Syllabaries?
Amharic Use Cases
The Men Who Say "ꑳ"
Unicode 4.1, 5.0 and Beyond

Review of Syllabaries

A syllabary is a writing sytem comprised of syllables.
A syllable is a CVC pattern.
Most syllabaries use "open syllables" - CV only.
We are concerned with (CV) syllabaries not (C)(V).
For example 'ካ' vs 'കാ'.
Syllabaries of Unicode are: Canadian, Cherokee, Ethiopic, Hiragana, Katakana & Yi.

Review of Syllabaries

English as a Syllabary

	a	e	i	o	u
b	(ba)	(be)	(bi)	(bo)	(bu)
c	(ca)	(ce)	(ci)	(co)	(cu)
d	(da)	(de)	(di)	(do)	(du)
f	(fa)	(fe)	(fi)	(fo)	(fu)
g	(ga)	(ge)	(gi)	(go)	(gu)
h	(ha)	(he)	(hi)	(ho)	(hu)
⋮	⋮	⋮	⋮	⋮	⋮

21 x 5 = 105 symbols.
More for European languages.
Not all CV intersects necessarily occur.
Sounds of a C or V may change between languages.
There may be more than one symbol per syllable.
The columns may have names.
Likewise the rows.

Why Regex for Syllabaries?

Support for thinking in syllabaries.
More consise, easier to read and maintain syntax for syllabic patterns.
Express and match inherent properties of syllabaries.
-need a way to match the "C" or "V" part of a (CV)
e.g. how do you match the "can" of (ca)(na)(da) - "ᗺᘇᑕ"

Why Regex for Syllabaries?

For example...

We don't really need class expressions for digits, but it makes writing REs easier:

(0|1|2|3|4|5|6|7|8|9)
[0123456789]
[0-9]
[:digit:]
\p{IsDigit}
\d

Amharic Use Cases

Rather than devise an abstract syllabary, lets use a real one.
Examples are easily found.
Quite possibly the worst case scenario for a syllabary.
We have Regexp::Ethiopic::Amharic to work with.

Amharic Use Cases

The Matrix

	[#1#]	[#2#]	[#3#]	[#4#]	[#5#]	[#6#]	[#7#]
[#ሀ#]	ሀ	ሁ	ሂ	ሃ	ሄ	ህ	ሆ
[#ለ#]	ለ	ሉ	ሊ	ላ	ሌ	ል	ሎ
[#ሐ#]	ሐ	ሑ	ሒ	ሓ	ሔ	ሕ	ሖ
[#መ#]	መ	ሙ	ሚ	ማ	ሜ	ም	ሞ
[#ሠ#]	ሠ	ሡ	ሢ	ሣ	ሤ	ሥ	ሦ
[#ረ#]	ረ	ሩ	ሪ	ራ	ሬ	ር	ሮ
[#ሰ#]	ሰ	ሱ	ሲ	ሳ	ሴ	ስ	ሶ
[#ሸ#]	ሸ	ሹ	ሺ	ሻ	ሼ	ሽ	ሾ
[#ቀ#]	ቀ	ቁ	ቂ	ቃ	ቄ	ቅ	ቆ
[#በ#]	በ	ቡ	ቢ	ባ	ቤ	ብ	ቦ
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Amharic Use Cases

The Matrix

	[#1#]	[#2#]	[#3#]	[#4#]	[#5#]	[#6#]	[#7#]
[#ሀ#]	ሀ	ሁ	ሂ	ሃ	ሄ	ህ	ሆ
[#ለ#]	ለ	ሉ	ሊ	ላ	ሌ	ል	ሎ
[#ሐ#]	ሐ	ሑ	ሒ	ሓ	ሔ	ሕ	ሖ
[#መ#]	መ	ሙ	ሚ	ማ	ሜ	ም	ሞ
[#ሠ#]	ሠ	ሡ	ሢ	ሣ	ሤ	ሥ	ሦ
[#ረ#]	ረ	ሩ	ሪ	ራ	ሬ	ር	ሮ
[#ሰ#]	ሰ	ሱ	ሲ	ሳ	ሴ	ስ	ሶ
[#ሸ#]	ሸ	ሹ	ሺ	ሻ	ሼ	ሽ	ሾ
[#ቀ#]	ቀ	ቁ	ቂ	ቃ	ቄ	ቅ	ቆ
[#በ#]	በ	ቡ	ቢ	ባ	ቤ	ብ	ቦ
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

  [#1#] ≡ [:ግዕዝ:]
  [#2#] ≡ [:ካዕብ:]
  [#3#] ≡ [:ሣልስ:]
  [#4#] ≡ [:ራዕብ:]
  [#5#] ≡ [:ኃምስ:]
  [#6#] ≡ [:ሳድስ:]
  [#7#] ≡ [:ሳብዕ:]

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

ቤተመንግሥት
ቤቱን
ቤታችን
ቤቴ
ቤት
ቤቶች
ቤቷ

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

ቤተመንግሥት
ቤቱን
ቤታችን
ቤቴ
ቤት
ቤቶች
ቤቷ

[#ተ#]

ተ

ቱ

ቲ

ታ

ቴ

ት

ቶ

ቷ

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

ቤተመንግሥት
ቤቱን
ቤታችን
ቤቴ
ቤት
ቤቶች
ቤቷ

[#ተ#]

ተ

ቱ

ቲ

ታ

ቴ

ት

ቶ

ቷ

vs /ቤ[ተ-ቷ]/

Amharic Use Cases

Find All The Houses

/ቤ[#ተ#]/

ቤተመንግሥት
ቤቱን
ቤታችን
ቤቴ
ቤት
ቤቶች
ቤቷ

[#ተ#]

ተ

ቱ

ቲ

ታ

ቴ

ት

ቶ

ቷ

vs /ቤ[ተ-ቷ]/

[#መ#] vs /[መ-ሟፙᎀ-ᎃ]/

[#ፈ#] vs /[ፈ-ፏፚᎈ-ᎋ]/

etc.

Amharic Use Cases

Find All Plurals

/[#7#]ች/

Amharic Use Cases

Find All Plurals

/[#7#]ች/

vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

Amharic Use Cases

Find All Plurals

/[#7#]ች/

vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

Negation: /[^#7#]/

equivalent to [#1-6#]
and not [^ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]
we want to maintain the syllabic context and only not match other syllables

Amharic Use Cases

Match Phonetic Mispellings

/[#3,6#]ያ/

አንባቢያን & አንባብያን
ሚያዚያ & ሚያዝያ
ኢትዮጵያዊያን & ኢትዮጵያውያን

Amharic Use Cases

Match Phonological Mispellings

/[ምን][#በፈ#]/

ላምፋ & ላንፋ
ግምፎ & ግንፎ
አምፋር & አንፋር
ወምበር & ወንበር
ዝምብ & ዝንብ

Amharic Use Cases

Restricted Order Matches

/[መ-ቀ]{#6#}/

ላምፋ
ቅኔ
ሳድስ

Amharic Use Cases

Restricted Order Matches

/[መ-ቀ]{#6#}/

ላምፋ
ቅኔ
ሳድስ

/[መ-ቀ]{#4,6#}/

መሬት
ደቂቃ
ሳድስ

Amharic Use Cases

Symbol Redundancy / Phonemic Decay

The Ge'ez Syllabary

	ə	u	i	a	e	ɨ	o
h	ሀ	ሁ	ሂ	ሃ	ሄ	ህ	ሆ
l	ለ	ሉ	ሊ	ላ	ሌ	ል	ሎ
ħ	ሐ	ሑ	ሒ	ሓ	ሔ	ሕ	ሖ
m	መ	ሙ	ሚ	ማ	ሜ	ም	ሞ
s	ሠ	ሡ	ሢ	ሣ	ሤ	ሥ	ሦ
r	ረ	ሩ	ሪ	ራ	ሬ	ር	ሮ
ʃ	ሰ	ሱ	ሲ	ሳ	ሴ	ስ	ሶ
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Amharic Use Cases

Symbol Redundancy / Phonemic Decay

The Amharic Syllabary

	ə	u	i	a	e	ɨ	o
h	ኸ	ሁ,ሑ,ኁ,ኹ	ሂ,ሒ,ኂ,ኺ	ሀ,ሃ,ሐ,ሓ,ኃ,ኻ	ሄ,ሄ,ኄ,ኼ	ህ,ሕ,ኅ,ኽ	ሆ,ሖ,ኆ,ኾ
l	ለ	ሉ	ሊ	ላ	ሌ	ል	ሎ
ħ	*	*	*	*	*	*	*
m	መ	ሙ	ሚ	ማ	ሜ	ም	ሞ
s	ሠ,ሰ	ሡ,ሱ	ሢ,ሲ	ሣ,ሳ	ሤ,ሴ	ሥ,ስ	ሦ,ሶ
r	ረ	ሩ	ሪ	ራ	ሬ	ር	ሮ
ʃ	ሸ	ሹ	ሺ	ሻ	ሼ	ሽ	ሾ
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

Amharic Use Cases

Homophonic Equivalence

	Amharic	Tigrinya
[=ሀ=]	[ሀሃሐሓኀኃኻ]	[ሀሃኀኃ]
[=ሰ=]	[ሰሠ]	[ሰሠ]
[=አ=]	[አኣዐዓ]	[አኣ]
[=ጸ=]	[ጸፀ]	[ጸፀ]
[=ጎ=]	[ጎጐ]	[ጎጐ]
[=ኮ=]	[ኮኰ]	[ኮኰ]
[=ቍ=]	[ቁቍ]	[ቁቍ]
etc.	⋮	⋮

Amharic Use Cases

Homophonic Equivalence

[=ሀ=]	[ሀሃሐሓኀኃኻ]
[=ሁ=]	[ሁሑኁኹ]
[=ሂ=]	[ሂሒኂኺ]
[=ሃ=]	[ሀሃሐሓኀኃኻ]
[=ሄ=]	[ሄሔኄኼ]
[=ህ=]	[ህሕኅኽ]
[=ሆ=]	[ሆሖኆኾ]
[=#ሀ#=]	[ሀ-ሆሐ-ሗኀ-ኍኸ-ዅ]

Amharic Use Cases

Homophonic Equivalence

[=ጎ=]ንዳር	⟹	ጎንዳር & ጐንዳር
[=ቍ=]ጥር	⟹	ቍጥር & ቁጥር
ታ[=ህ=][=ሳ=][=ስ=]	⟹	ታህሳስ & ታህሣስ & ታህሳሥ & ታህሣሥ & ታሕሳስ & ታሕሣስ & ታሕሳሥ & ታሕሣሥ & ታኅሳስ & ታኅሣስ & ታኅሳሥ & ታኅሣሥ & ታኽሳስ & ታኽሣስ & ታኽሳሥ & ታኽሣሥ

Amharic Use Cases

Homophonic Equivalence

ዓለምፀሐይ Patterns

Amharic has 56 Possible Spellings
-only 18 are probable: ዓለምፀሐይ ዓለምጸሐይ ዓለምጸሃይ ዓለምፀሃይ ዓለምፀሀይ ዓለምጸሀይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምፀሀይ አለምጸሀይ ዐለምፀሐይ ዐለምጸሐይ ዐለምጸሃይ ዐለምፀሃይ ዐለምፀሀይ ዐለምጸሀይ
Tigrinya has 8 Possible Spellings
-only 4 are probable: ዓለምፀሐይ ዓለምጸሐይ ዓለምጸሓይ ዓለምፀሓአይ
Ge'ez has 4 Possible Spellings
-only 2 are probable: ዓለምፀሐይ ዐለምፀሐይ

Amharic Use Cases

Homophonic Equivalence

ዓለምፀሐይ Patterns

Amahric:		/[=አ=]ለም[=ጸ=][=ሃ=]ይ/
Tigrinya:		/[=ዓ=]ለም[=ጸ=][=ሓ=]ይ/
Ge'ez:		/[=ዓ=]ለምፀ[=ሐ=]ይ/

Regexp::Ethiopic Usage & Limitations

#!/usr/bin/perl -w

use Regexp::Ethiopic::Amharic 'overload';
⋮
	if ( /[#መየበከ#]{#1,4#}\w+[=ሀ=]\w+?[#7#]ች/ ) {
	⋮
	}
⋮

Anonymous:		`m/$x{#5#}/;`
Substitutions:		`s/[#7#]([#ከ#])/[#2#]$1/g;`
Transliterations:		`tr/[#1-3#]/[#4-6#]/;`

The Men Who Say "ꑳ"

Yi is a language, syllbary and people.
The syllbary is used by between 2 and 5.5 million people in the Chinese provinces of Yunnan and Sichuan.
Once had over 8,000 characters, now standardized to 819.
Has 44 consonants... (x-axis)
10 vowels... (y-axis)
...and 4 tones (z-axis)!
10 x 44 x 4 = 1760 positions

The Men Who Say "ꑳ"

Yi is a language, syllbary and people.
The syllbary is used by between 2 and 5.5 million people in the Chinese provinces of Yunnan and Sichuan.
Once had over 8,000 characters, now standardized to 819.
Has 44 consonants... (x-axis)
10 vowels... (y-axis)
...and 4 tones (z-axis)!
10 x 44 x 4 = 1760 positions

The Men Who Say "ꑳ"

High Tone (T)	[#:t#]
Middle High Tone (X)	[#:x#]
Middle Low Tone	[#:z#]
Low Tone (P)	[#:p#]

	[#1#]	[#2#]	[#3#]	[#4#]	[#5#]	[#6#]	[#7#]	[#8#]	[#9#]	[#10#]
[#ꀀ#]	ꀀ ꀁ ꀂ ꀃ	ꀄ ꀅ ꀆ ꀇ	ꀈ ꀉ ꀊ ꀋ	ꀌ ꀍ ꀎ	ꀏ ꀐ ꀑ ꀒ	ꀓ ꀔ
[#ꀖ#]	ꀖ ꀗ ꀘ ꀙ	ꀚ ꀛ ꀜ ꀝ	ꀞ ꀟ ꀠ ꀡ	ꀢ ꀣ ꀤ	ꀥ ꀦ ꀧ ꀨ	ꀩ ꀪ ꀫ	ꀬ ꀭ ꀮ ꀯ	ꀰ ꀱ	ꀲ ꀳ ꀴ ꀵ	ꀶ ꀷ
[#ꀸ#]	ꀸ ꀹ ꀺ ꀻ	ꀼ ꀽ ꀾ	ꀿ ꁀ ꁁ ꁂ	ꁃ ꁄ ꁅ	ꁆ ꁇ ꁈ ꁉ	ꁊ ꁋ ꁌ ꁍ	ꁎ ꁏ	ꁐ ꁑ ꁒ ꁓ	ꁔ ꁕ
[#ꁖ#]	ꁖ ꁗ ꁘ ꁙ	ꁚ ꁛ ꁜ ꁝ	ꁞ ꁟ ꁠ ꁡ	ꁢ ꁣ ꁤ	ꁥ ꁦ ꁧ ꁨ	ꁩ ꁪ ꁫ	ꁬ ꁭ ꁮ ꁯ	ꁰ ꁱ	ꁲ ꁳ ꁴ ꁵ
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}

matches

[ꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱ]

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}	matches	[ꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱ]
[#ꀖ#]{#:tx#}	matches	[ꀖꀙꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯꀲꀵ]

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}	matches	[ꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱ]
[#ꀖ#]{#:tx#}	matches	[ꀖꀙꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯꀲꀵ]
[#ꀖ#]{#2-8#}{#:tx#}	matches	[ꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯ]

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}	matches	[ꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱ]
[#ꀖ#]{#:tx#}	matches	[ꀖꀙꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯꀲꀵ]
~~[#ꀖ#]{#2-8#}{#:tx#}~~	~~matches~~	~~[ꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯ]~~
[#ꀖ#]{#2-8:tx#}	matches	[ꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯ]

The Men Who Say "ꑳ"

Match all ꀖ in the orders 2-8 and the high and middle high tones

[#ꀖ#]{#2-8#}	matches	[ꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱ]
[#ꀖ#]{#:tx#}	matches	[ꀖꀙꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯꀲꀵ]
~~[#ꀖ#]{#2-8#}{#:tx#}~~	~~matches~~	~~[ꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯ]~~
[#ꀖ#]{#2-8:tx#}	matches	[ꀚꀝꀞꀡꀤꀥꀨꀫꀬꀯ]
[#ꀖ:tx#]	invalid

Unicode 4.1, 5.0 and Beyond