The Language Processing Unit
The Language Processing Unit was launched by the National Institute for Testing and Evaluation (NITE) in 2000 to work on the development of computerized tools for analyzing and assessing texts in Hebrew and Arabic. To date many tools and applications have been developed under the auspices of the Unit, and these tools are continuously being updated and improved.
The principal tools are listed below (a detailed description of the tools, in Hebrew, may be found on the Unit’s website).
(1) Language Repositories
- Morphological dictionary including about 31,000 base entries and about 1,109,000 derived forms with morphological analysis.
- Manually tagged corpus containing about 250,000 text strings.
- Computer-tagged corpora containing about 60 million words.
- N-Gram models representing statistical information on Hebrew word sequences.
(2) Linguistic Analysis Tools
- Tokenizer used to distinguish and define language strings.
- Computerized morphological analyzer that produces all the possible morphological interpretations for a given string based on the morphological dictionary and a collection of rules for combinations not given in the dictionary.
- Computerized morphological tagger that selects the most likely morphological interpretation of a given string based on a statistical model.
- Rule-based spell checker.
- Content analyzer based on LSA (Latent Semantic Analysis). Assists in resolving semantic ambiguity and tests semantic categories in different text registers.
- Statistical linguistic analyzer that outputs about 200 linguistic features of a text (surface features, morphological and morpho-syntactic features, lexical features, and semantic features).
- Language model analysis tool that allows the creation and analysis of an n-gram model, and production of statistical data for new texts.
- Corpus analysis tool that allows linguistic analysis and processing of a collection of tagged texts.
(3) System for Computerized Text Assessment (NiteRater)
A system based on the linguistic tools mentioned above and additional specialized components that allows computerized assessment of texts.
Researchers active in the field of linguistics interested in using these applications for research purposes are welcome to contact the Unit team by emailing firstname.lastname@example.org.