Corpus

Previous research

HNK continues the tradition of compiling Croatian corpora in the Institute of Linguistics, Faculty of Philosophy at the University of Zagreb. This tradition was started with the first Croatian computer corpus (Concordance of Gundulić's Osman by Željko Bujas, 1967, published 1974). It has its continuation in the first usage of computer corpora in contrastive linguistic research in the history of linguistics (Zagreb contrastive projects led by Rudolf Filipović from 1968) and in the project Computer processing of old Croatian writers by Milan Moguš ('70 and '80).

After the completion of One-million Corpus of Croatian Literary Language (so called "Moguš's Corpus") by Milan Moguš (1976-1996) and after publishing Hrvatski čestotni rječnik (Croatian Frequency Dictionary) (Moguš, Bratanić, Tadić 1999) on its basis, the need for a multimillion representative Croatian corpus emerged. It was proposed that it should be used as the primary source of linguistic data for lexicographical, orthographical, morphological, syntactical and semantical research of contemporary Croatian.

Theoretical foundations

Although the first ideas about multimillion Croatian corpus were mentioned at the end of '80 (Tadić 1990); they have reached its maturity in several fundamental papers where the outline of the structure of the HNK, its size, time-span, textual and genre typology as well as relation to already existing Croatian corpora has been discussed and proposed (Tadić 1996., 1997., 1998., 1999., 2002.).

HNK v 1.0

Although the initial idea was to compile HNK as a representative corpus of contemporary Croatian, soon it became obvious that older texts also need the corpus linguistic processing. Therefore the first version of HNK, realized within the project Computational Processing of Croatian Language (MZT RH 130718) from December 1998, was divided in two constituents:

With these two constituents, HNK was available for querying via web-interface since the end of 1998 to 2004 with limited search options.

HNK v 2.0

Since the beginning of 2005, within the project Development of Croatian Linguistics Resources (MZOŠ RH 0130418), HNK was transferred to a new server platform Manatee with the usage of corpus manager by Pavel Rychlý. This server is coupled with freely available client program (Bonito) used for accessing and querying corpora with following features:

HNK v 2.5 (announced!)

New version of HNK (spring 2008) will support querying using lemmas and MSDs (e.g. [lemma="glava"] or [msd="S.*"] [msd="Nc."] for getting all combinations of preposition+common noun).

This type of querying can already be tested on the limited test corpus (in Bonito select the subcorpus cw2000).

The Croatian MSD tagset used in HNK is fully MulTextEast compliant.