Counting Syllables

Two days ago, I released syllables, a Go package that counts the syllables in a string of text. lynx, the static generator that builds this website, imports the readability package from BluntSporks. It is used to calculate the “rough” Flesch-Kincaid reading level of the article text in each post. The formula that BluntSporks’s readability package uses to count the syllables in a text string is: letter_count / 3 + digit_count. I thought that this formula could be easily improved. I thought wrong.

It turns out that accurately counting syllables is actually quite difficult. English letter combinations contain irregular patterns. Some combinations that parse as one syllable should actually count as two. Other combinations parse as two syllables but should actually count as one. These irregularities can be more than one syllable (up to probably four). Letter combinations that parse as two and three are actually one or three and two or four, respectively.

Most syllable counters will return the same results for common words. Strings of more than 50 characters often gave different results. Therefore, the test cases that are included in the current syllables package do not test for syllable counting accuracy over a certain input string length. Beyond that length, it can only test for regular-expression matching integrity over time with code changes.

I ended up turning to a Javascript package by Github user wooorm for the syllable-counting algorithm. wooorm’s work on syllable counting is the most complete I have encountered so far. His package also contains a list of “problematic” words that do not pass the defined regular-expression patterns. I don’t think it’s a complete list. Some words in there can have prefixes or suffixes, and the current algorithm only accounts for the plural-category forms of those words. The final syllables package is essentially a translation of that algorithm from Javascript into Go. The Javascript’s package regular-expressions used a back-reference, which is not allowed in Go’s standard regexp package. It has been replaced by the curly brace quantifier, which may need to be fine-tuned.

Swift has made me pay more attention to function calling syntax. I’ve tried to make calling the functions in syllables as clear and expressive as possible. The exported function func In(string) int returns the integer number of syllables in the input string. Assume a string variable text that contains the input text and the syllables package is imported. Calling the counting In function looks like this: syllables.In(text). Go does not make it easy for the implementer to use the same function name for different datatype parameters (which would be its own topic for a different day). Thus, a convenience function is given to count the syllables in a byte array. Its prototype is func InBytes([]byte) int, and syllables.InBytes(buffer) invokes it on a byte array buffer through the package. This convenience function just converts the bytes into a string and returns the result of invoking In on that string. This syntax is documented at godoc.org.

After the release, I found the syllables package written by ernestas-poskus, which does much the same with incrementing and decrementing the syllable count depending on specially matched expressions. It was still worth the learning experience of translating an algorithm from Javascript into Go. I wonder what it would take to trans-pile Javascript into Go source code. I am curious, considering the recent release of Grumpy.