How does our Malay grammar checker work?

Chez Sigmund, nous misons sur une équipe composée de spécialistes et non de généralistes. En combinant cette expertise optimisée à votre unique compréhension du domaine dans lequel vous évoluez, nous créons ensemble des solutions numériques clé en main qui font une réelle différence.

Developing a grammar checker was a lot more harder than developing a speller. A spell checker needs only check the correct order of alphabets. Making a grammar checker work, on the other hand, requires knowledge of sentence structures, as well as the classification of each and every word in the lexicon. How should these words fit in a sentence? How do we know what sort of syntactical mistake it is when a word is out of sequence?

Tagging ALL the words

To build a grammar checker, we had to first go through the process of tagging all the words in our lexicon, more than 100,000 of them. Some of the work can be automated. For example, we know words with imbuhan me- would be verbs. The rest is mostly manual hard work.

It is not a straightforward job either. Words can have multiple meanings. They can be ambiguous, and have to be classified differently depending on contexts. Consider this:

Dia membeli dua kaki payung

“Kaki”, by itself, belongs to “Penjodoh Bilangan” and “Am – Benda” word classes. How should it be treated in this example? To find out, we must analyze the sentence.

So, how do we analyze a sentence?

First of all, we must know what we are analyzing for. We are analyzing for the types of grammatical mistakes that we want to detect. We don’t have to cover all possibilities.

We did our research and identified 40 types of mistakes that can be confidently processed. For each type of sentence, we build a rule for it, expressed using Recursive Transition Network (RTN). These rules form a Rule Base.

Examples of Recursive Transition Network

The sequence of words in a sentence will be matched with the said rules. The grammar-checking engine will perform a syntactical analysis by attempting to traverse each RTN in the Rule Base.

If the sentence is able to traverse completely through a RTN, it is considered syntactically and grammatically correct. If a sentence failed to match one of the nodes in the RTN, it is considered incorrect at the point of the node. The error description and suggestion can then be determined by the position of the node and the tagging of the words.

Rules and beyond

We must ensure the above process happens as fast as the user types. Efficiency is of utmost importance. Apart from the core grammar checker engine, the interface with MS Office must also be built. Furthermore, we went through rigorous testing to ensure accuracy of detection, aiming to keep false positive below 10%.

This was the first attempt in the world to develop a grammar checking engine for the Malay language using our methodology. It is equivalent to defining the whole structure of the Malay language in a computer-interpretable way. We have patented the technology. This is a basis for further development of the technology into the semantic level and artificial intelligence, and can be applied with other language related technologies such as speech recognition and handwriting recognition.

The resulting product is Dewan Eja Pro, the first grammar checker for Malay. It is free to try. Download it now.