A Hybrid Deep Neural Network and Morphological Knowledge to Enhance Arabic Lemmatization
DOI:
https://doi.org/10.15849/ijasca.v18i1.33Keywords:
NLP, Arabic Language, Morphological Analysis, Lemmatization, Bidirectional Long Short-Term Memory networkAbstract
Lemmatization is performed in many natural language processing applications during the preprocessing stage, enabling more efficient text analysis and the extraction of relevant information. The Arabic language is associated with the following challenges of lemmatization: the morphological richness of the language, the high usage of concatenations and the omission of the diacritical marks. To overcome those issues, we will introduce a new lemmatizer to improve the functionality of a deep learning framework. The latter will be a BLSTM network, with an additional filtering layer based on morphological features. To alleviate the effect of out-of-vocabulary words, a statistical layer built on Hidden Markov Models was introduced. Tests done on a reference corpus showed that accuracy was enhanced by a margin of over nine percentage points on the use of the two layers (filtering and statistical). In addition, the results were compared with state-of-the-art lemmatizers on an independent corpus which proved that the proposed approach was superior.