AnCora Basic Kit
AnCora consists of a Catalan corpus and a Spanish corpus, each of them of 500,000 words. The corpora are annotated at different levels:
- Morphological categories (PoS) (Penn Tree Bank format [.tbf] [EAGELS-PAROLE-FREELING]) (tree brackets, tgrep parsing)
- Syntactic constituents and functions
- Argument structure and thematic roles (PropBank) (Semi-automatic, possible errors)
- Semantic classes of the verb
- Nouns related to WordNet synsets
- Named Entities
Two verbal lexicons are also available as the result of this annotation process. The Spanish verbal lexicon consists of 2.580 entries and the Catalan lexicon of 2.142. Each verb sense is detailed with the following information: semantic classes, syntactic subcategories, argumental structure and thematic roles.
History
1. [Origin] Corpus ClicTalp –> 100000 Pos
2. [Old version] Corpus 3LB –> 200000 PoS, syntax, WN
3. [Old version] Corpus CESS –> 300000 + semantica
4. [Current version] Corpus AnCora –> 500000 + roles + anafora
Annotataion
morphological: N,A,…, Z (number), W (date), F (punct).
NE: date, person, organization, location, misc.
syntactic: (S (sn (conj…
semantic: roles (argN – tmp…
Tools
Parse it using tgrep. tgrep2 precompiled version.
Example use
Extract sentences tagging temporal roles.
arg=argM (adjuncts)
tem=tmp (temporal role)