HNK continues the tradition of compiling Croatian corpora in the Institute of Linguistics, Faculty of Philosophy at the University of Zagreb. This tradition was started with the first Croatian computer corpus (Concordance of Gundulić's Osman by Željko Bujas, 1967, published 1974). It has its continuation in the first usage of computer corpora in contrastive linguistic research in the history of linguistics (Zagreb contrastive projects led by Rudolf Filipović from 1968) and in the project Computer processing of old Croatian writers by Milan Moguš ('70 and '80).
After the completion of One-million Corpus of Croatian Literary Language (so called "Moguš's Corpus") by Milan Moguš (1976-1996) and after publishing Hrvatski čestotni rječnik (Croatian Frequency Dictionary) (Moguš, Bratanić, Tadić 1999) on its basis, the need for a multimillion representative Croatian corpus emerged. It was proposed that it should be used as the primary source of linguistic data for lexicographical, orthographical, morphological, syntactical and semantical research of contemporary Croatian.
Although the first ideas about multimillion Croatian corpus were mentioned at the end of '80 (Tadić 1990); they have reached its maturity in several fundamental papers where the outline of the structure of the HNK, its size, time-span, textual and genre typology as well as relation to already existing Croatian corpora has been discussed and proposed (Tadić 1996., 1997., 1998., 1999., 2002.).
HNK v 1.0
Although the initial idea was to compile HNK as a representative corpus of contemporary Croatian, soon it became obvious that older texts also need the corpus linguistic processing. Therefore the first version of HNK, realized within the project Computational Processing of Croatian Language (MZT RH 130718) from December 1998, was divided in two constituents:
- 30-million corpus of contemporary Croatian (30m)
- Croatian Electronic Text Archive (HETA)
With these two constituents, HNK was available for querying via web-interface since the end of 1998 to 2004 with limited search options.
HNK v 2.0
Since the beginning of 2005, within the project Development of Croatian Linguistics Resources (MZOŠ RH 0130418), HNK was transferred to a new server platform Manatee with the usage of corpus manager by Pavel Rychlý. This server is coupled with freely available client program (Bonito) used for accessing and querying corpora with following features:
- queries with more than one word
- qureies with additonal linguistic data (e.g. lemma, POS, MSD)
- regular expressions (icluding the query design using graphs))
- generation of ad hoc subcorpora according to selected criteria
- selective concordances from word-list or their reduction according to part, percentage etc.
- automatic collocation detection
- statistical data ranging from simple frequency up to frequency distribution within the corpus, subcorpus, sample etc.
- user friendy interface for Windows, Linux/Unix, MacOS
HNK v 2.5 (announced!)
New version of HNK (spring 2008) will support querying using lemmas and MSDs (e.g. [lemma="glava"] or [msd="S.*"] [msd="Nc."] for getting all combinations of preposition+common noun).
This type of querying can already be tested on the limited test corpus (in Bonito select the subcorpus cw2000).
The Croatian MSD tagset used in HNK is fully MulTextEast compliant.