Cognitionis
Hector Llorens Portfolio

NLP


Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages (Wikipedia).

Major tasks in NLP:

An interesting way to standarization: LMF

Interesting resources:

Wikipedia – WikiPrep see also in sourceforge.

WordNet Similarity – WNSimilarity

Links resources (Stanford University)

Links resources (Proxem)

NLP Companies:

Bitext

Interesting manuals:

Unix for poets (Kenneth Church) shows how to use this OS to NLP tasks.

CONFERENCES

http://aclweb.org/

Journals CL, AI, JAIR, LRE, IPM, IJIS, IS

 

Evaluation forums

TREC
CLEF
TEMPEVAL http://www.timeml.org/tempeval/ (TEMPEVAL-2 2010)

Dead Lines list

  • http://www.wikicfp.com/ (bonissim)
  • http://citeseer.ist.psu.edu/impact.html

Ranking

JOURNAL RANKINGS / IMPACT FACTORS

Journal Citation Report (JCR) http://www.accesowok.fecyt.es/jcr/

Search by SUBJECT CATEGORY: COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE sort by IMPACT FACTOR

CONFERENCES

http://aclweb.org/

Journals CL, AI, JAIR, LRE, IPM, IJIS, IS

 

Evaluation forums

TREC
CLEF
TEMPEVAL http://www.timeml.org/tempeval/ (TEMPEVAL-2 2010)

Dead Lines list

  • http://www.wikicfp.com/ (bonissim)
  • http://citeseer.ist.psu.edu/impact.html

Ranking

JOURNAL RANKINGS / IMPACT FACTORS

Journal Citation Report (JCR) http://www.accesowok.fecyt.es/jcr/

Search by SUBJECT CATEGORY: COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE sort by IMPACT FACTOR

Lexical Level / Word level / Morphological…

Stem is the invariable part of a word (volver –> volv).

Lemma is the canonical form of a lexeme (volvieron –> volver). Lexeme is, in english, more or less the same thingh but not in spanish. Therfore do not use Lexeme to refer Stem nor Lemma.

Syntactic Level

Chunk: Phrase chunking is a natural language process that separates and segments sentences into its subconstituents, i.e. noun, verb and prepositional phrases.

Syntactic phrases: NP, ADJP, ADVP, PP.

Semantic Level

WSD Semantic roles

Corpus

XML. Many corpus use this format.

PTF/TBF (Penn Treebank Format). One of the most common NLP tree formats.

Use tgrep to parse it. See TBF tags

Another common tags are parenthical

Finally, IOB tags, particularly IOB2 are widely used. The difference between the versions are the usage of B- tag in the first case it is only used when a tag is followed by another tag of the same type without O tokens between them. While in the second B- is used in every first word of a tag. Therefore, nowadays the format most common and used is IOB2. This is also called by Spanish people (BIO format) but it is better to maintain IOB2 name in publications.

Evaluation mesures

T-test is a mesure that indicates if an improvement is statistically significant(example).

Precision is the number of correct instances divided by the system annotated instances.

Recall is the number of correct instances divided by the total real instances.

F1-mesure is F1= (2*precision*recall)/(precision+recall) – Normalizes the previous measures (1 is because B=1)

F-measure

Imagin you have 1000 words (100 aremarked as relevant for you and 900 irrelevant). Now you have an algorithm that marks words. Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases.  You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong (TN,TP,FN,FP):

  1. Total cases: total elements to make prediction over in the sample
  2. Real positives: total elements in the sample pre-evaluated as positive
  3. Real negatives: total elements in the sample pre-evaluated as negative
  4. Total positives: total elemets your algorithem evaluated as positive
  5. Total negatives: total elements your algorithem evaluated as negative
  6. TN / True Negative: the word wasn’t marked (case neg.) and  your alg. doesn’t marked it (pred negative)
  7. TP / True Positive: the word was marked (case positive) and your algorithm mareked it (predicted positive)
  8. FN / False Negative: the word was marked (case positive) but the algorithm doesn’t marked it (pred negative)
  9. FP / False Positive: the word wasn’t marked (case negative) but the algorithm marked it (predicted positive)

Now, your boss asks you three questions:

  1. What percent of your predictions were correct? You answer: the “accuracy” was (9,760+60) out of 10,000 = 98.2%
  2. What percent of the positive cases did you catch? You answer: the “recall” was 60 out of 100 = 60%
  3. What percent of positive predictions were correct? You answer: the “precision” was 60 out of 200 = 30%

Then: Missing (FN) … could be also said not classified or unknown Spurious (FP) … could be also called incorrect Correct (TP): No doubt with them Incorrect (other level)… As we can see the false negatives are not appreciatted sometimes… Therefore it has to be always highlighted. People, I’m detectig TEs but I do not count when my algoritehm says over a word that it is not a TE and indeed it’s not a TE.

Precision: TP / TP + FP –> TP / Total positives

Recall: TP / TP + FN –> TP / Real positives

New measure can be calculated if we take into account all the predictions made (negative predictions).

Accuracy: TP [+ TN] / Total cases (note that not always true negative have to be considered…) In QA it happens when the correct answer can be “unknown”. In some studies it is specified that and then they calculate recall as accuracy, and for precision consider positives only when there is an answer.

MRR is a measure for lists of ranked predictions over elements only useful for algorithems that output more than one prediction for each element. And useful if the application is not sensible to errors…, that is to say, if the correct one is ranked as second another agent would appreciate it and make the final correct desicion. MRR (Mean Reciprocal Rank) is a mesure used in TREC for QA.The score for an individual question was the reciprocal of the rank at which the first correct answer was returned or 0 of no correct response was returned. is a statistic for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the correct answer. The mean reciprocal rank is the average of the reciprocal ranks of a sample of queries. Example For example, suppose we have the following three sample queries for a system that tries to translate English words to their plurals. In each case, the system makes three guesses, with the first one being the one it thinks is most likely correct: Query Results Correct response Rank Reciprocal rank cat catten, cati, cats cats 3 1/3 torus torii, tori, toruses tori 2 1/2 virus viruses, virii, viri viruses 1 1 Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 = 11/18 or about 0.61. This basic definition does not specify what to do if (1) none of the proposed results are correct (use reciprocal rank 0), or if (2) there are multiple correct answers in the list (consider using Mean Average Precision: MAP).

Improvement Measure The improvement of one value over another is calculated: (better*100)/worse

Error Reduction Measure The error (100-score) reduction of one value over another is calculated: (((100-worse)-(100-better))*100)/(100-worse)

K-fold cross-validation … (Divide the corpus in n parts & evaluate n times switching train/test sets) In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used.

kappa (Wikipedia) Cohen’s kappa coefficient is a statistical measure of inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. However, some researchers[who?] have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement. Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first evidence of Cohen’s Kappa in print can be attributed to Galton (1892). The equation for κ is:

\kappa = \frac{\Pr(a) - \Pr(e)}{1 - \Pr(e)}, \!

where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance) then κ ≤ 0.

κ Interpretation
< 0 No agreement
0.0 — 0.20 Slight agreement
0.21 — 0.40 Fair agreement
0.41 — 0.60 Moderate agreement
0.61 — 0.80 Substantial agreement
0.81 — 1.00 Almost perfect agreement

(see wikipedia example) Monte Carlo methods: Useful for determine when something is afected or not by other something making randomly set tests. (…read more, go further…) There should be other measures… (Wikipedia)are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used when simulating physical and mathematical systems. Because of their reliance on repeated computation and random or pseudo-random numbers, Monte Carlo methods are most suited to calculation by a computer. Monte Carlo methods tend to be used when it is infeasible or impossible to compute an exact result with a deterministic algorithm.

Corpora

Why make and use corpora? (empirism vs rationalism)

The making and usage of a corpus is related to the empiric view of language study.

  • Empiricism (McEnry and Wilson): an approach to a subject (linguistics) which is based upon analysis of external data (text/corpora). (Samson , 2001)
  • Rationalism (Chomsky): an approach to a subject (linguistics) which is based upon introspection rather than external data analysis.

Rationalism is in favor to make rules based on individuals knowledge and empirism is in favor of infere this rules from annotated examples. Why not make classified examples of general language rules rather than just leave unsorted annotated examples?

What is a corpus?

Wilson: Any collection of more than one text can be called corpus. But… a finite collection of machine-readable text in electronic form, sampled to be maximally machine-readablerepresentative of a language or variety.

  • Support: Electronic
  • Representativity: maximum
  • Size: finite, but enough to be representative
  • Aim: Make linguisitic information explicit for linguistic and computational applications.

Sinclair (EAGLES):

  • The corpus should be as large as the technology of the time permits.
  • It has to have enouth representativeness (samples from a broad range of material)

Typology

oral – textual annotated – not annotated general – specific monolingual – multilingual

Applications

cognitively inplausible systems (at all linguistic levels) Annotation Manual, Automatic, Semi-automatic

Books

Handbook of Knowledge Representation (Foundations of Artificial Intelligence) (2008)
Commonsense Reasoning, Mueller (2006)
IA, Norvig y Rusell (2ed 2003)
NLU, Allen (2ed 1995)

http://www.amazon.com/Foundations-Statistical-Natural-Language-Processing/dp/0262133601/ref=pd_sim_b_img_2

http://www.amazon.com/Natural-Language-Processing-Knowledge-Representation/dp/0262590212

http://www.amazon.com/Knowledge-Representation-Semantics-Cognitive-Technologies/dp/3540244611

http://www.amazon.com/Probabilistic-Reasoning-Intelligent-Systems-Plausible/dp/1558604790

http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325

TimeML analysis (CaVaT)

The cavat software by Leon Derczynski is a python script that checks TimeML annotations (including link consistency by temporal closure).

Before running cavat install:

sudo apt-get install python-nltk
sudo apt-get install python-pyparsin
sudo apt-get install python-sqlite

sudo apt-get install python-chardet

python

import nltk

nltk.download()

identifier: all for all packages…

 

 

Text Categorization (Language Detection)

TextCat: An open source Language Categorizator (guesser/detector). Fast-Perl-70Langs (But not useful for similar languages and short texts) Sourceforge. (Tenkle)(Implemented in Corpus Interface). READ [1994.CANVAR-TRENKLE]

Stemming/Stemmer/Stem/Lemmatizer-Inflector-Lematizador/Flexionador

Lovins 1968,Porter 1980, Lancaster (Paice/Husk) 1990,Snowball (programing lang) Porter Stemmer. See Wikipedia.

A lemmatizer is much more convenient and NLP aware (ie., FreeLing para es/ca/en, TreeTagger/Minipar/Maltparser para en).

Part-of-Speech taggers (POS TAGGERS)

TreeTagger: multi-lingual POS-Tagger/lematizer/inflector (English, German, French, Spanish,…). TreeTagger is a famouse POS-Tagger by Schmid [1994.Schmid] which uses decision trees to learn taggers/lemmatizer/inflector for any language. A tagger+ is not a big deal to implement before being a doctor try it (You need a corpus, a morphologic information, rules, lexicon…). TreeTagger How-to.
Standford (Java), CLAWS, JTextPro (java see&copy), Jitar(java)
There are many Java implementations based on different ML techniques (HMM [Trigarams], CRF,…) trained over different corpus (WSJ,…). I can get for ES/CA using AnCora.
Download some of them and build my oun NLP tool.
Other that include POS: Freeling, MINIPAR (see parsers, and dependency parsers)

Syntactic parsers (PARSERS)

FreeLing: An open source lexical/morfosyntactic analyzer and POS-tagger (EAGLES based, PAROLE tags) uses ISO-8859-1 encoding, if you got UTF-8 use iconv before and after calling Freeling (C++) (ES/CA/EN)(NEW: Use .deb package: sudo apt-get install libdb4.6++ libpcre3 libboost-filesystem1.34.1; sudo dpkg -i FreeLing-2.1.deb) (EXECUTION: echo “Me gustan los gatos” | analyze -f /usr/share/FreeLing/config/es.cfg)(Edit configuration files to get Named Entity classification and if needed a better NE recognition [corpus-trained ML-bio...])(DEPRECATED-install how-to)(userman)( Freeling Basic Kit)(DEPENDENCIES: libpcre, libdb,libcfg+0,libfries&omelet(deprecated: at the moment 0.98 requiere g++-3.4 because of the deprecated namespaces support)). HOW TO DISABLE THE  AL, DEL splitting to a+el and de +el …:

The only way is to change these entries in the dictionary:
1.- Edit /usr/local/share/FreeLing/es/dicc.src and locate the entries
al a+el SP+DA
del de+el SP+DA
2.- Replace the entries with
al al SPCMS
del del SPCMS
3.- re-index the dictionary with (the parenthesis should be a less-than sign)
indexdict mydicc.db (dicc.src
chmod ugo+r mydicc.db
4.- Adjust your config file to use this new dictionary, or either overwrite your old maco.db file with your new mydicc.db dictionary

Useful when you already have the tokenized input. Then disable for plain text.
Charniak PARSER: Morfosyntactic parser for English (one of the best 2005). Full syntactic parsing not shallow. Best than a simple chunker. (install how-to).

Dependency parsers (PARSERS)

MINIPAR: An open source Sintactical Dependency Analyzer (C++) (also gives POS and lemmas) Minipar Basic Kit – (Mirror download)
MaltParser: Included in NLTK, very good and wide used dependency parser. But ONLY parser, works over Tagged/Lemmatized text (i.e., over TreeTagger).  I can get for ES/CA using AnCora.

Name Entity Recognizers (NERs)

FreeLing – Multilingual. Best for ES/CA, but not too good for EN.
LingPipe: Best for English. An open source Named Entity Recognizer (Java) NOTE: Also detects correference between entities through id attribute. (kind of anaforical anafora resolution). LingPipe Basic Kit
GATE: A General Architecture for Text Engineering. NER. IDE. … (interesting, have a look)

Semantic Role Labelers (SRLs)

SRL tool (CCG Roth): Best SRL tool for English (winner of SemEval). PropBank role set. (install how-to).

TimeML Software

TARSQI TOOL KIT (TTK): Annotate all TimeML elements (ttk how-to)

ML tools

CRF++: Regular installation (configure, make, sudo make install)

Yamcha+TinySVM: Tiny SVM regular installation and Yamcha downgrade compiler to 4.1 (./configure CXX=g++-4.1) and regular installation. If crashes on strange symbols just RE-INSTALL.

Tolkits and large tools

NLTK (Natural Language Tool Kit): An open source Language Language Tool Kit in python (python-nltk). Tagger, Parser, … Requiere phython-pywordnet.  (http://www.nltk.org/book).

GATE

BART tool

Open-NLP tools

Ubuntu NLP Repositori: bonissim

Ubuntu NLP Repository: Packages for NLP.

GuiTAR (Anaphora resolution)

Web as corpus toolkit

WEKA 3

Other Tools

InTime: Use version > 0.0.30. in configuration file intimeC.xml (server: http://piolin.dlsi.ua.es:8080/intime/InTiMe?wsdl). Make a symbolic link in /usr/bin intime –> /localinallation/intime (sudo ln -s /home/hector/InTiMe-0.0.30/intime intime)


2Explore

http://www.wolframalpha.com/, Wikipedia, Google…

Useful corpus parsers

XML: Xerces, JAXP, XPath
Penn Treebank Format: tgrep

Afner (NER)+ AnswerFinder (QA system) (DiegoMolla,Pizzato).Pizzato

 

Ontologies/DB/Semi-structured

Something to do here in this mess…

Sowa has a good summary

http://www.jfsowa.com/

http://www.jfsowa.com/ontology/

http://www.jfsowa.com/ontology/ontoshar.htm

And also has a good bock to buy (Borja)

Cyc

SUMO

WordNet

Try text2onto software

WordNet: Lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
(Fellbaum book)

Wikipedia: An open source Web Encyclopedia (TODO: LINK TO cognitionis.com/it/wikipedia/)
(How to use) (mirror download)

Corpora

Available corpus

TreeBank Corpus

Brown

Multext?? descargar…

Available corpus for Spanish languages

Annotated:

  • Ancora (AnCora Basic Kit)(Official web)
  • Europarl (European Parallel corpora)(http://www.statmt.org/europarl/)
  • TimeML?
  • TERN-2004 (English & Spanish)
  • CREA y CORDE (RAE)
  • UAM-treebank
  • LexESP CLiC-TALP
  • Cast3LB

Unannotated:

  • ECI/MCI Corpus (European Corpus) (Many languages) www.elsnet.org/resourecs/eciCorpus.html
  • Elaleph (textos literarios) www.elaleph.com
  • BEC (religion/cristianismo)
  • TimeBank Basic Kit (The same as PropBank in LDC same as Penn TreeBank same as WSJ and Brown corpus)
  • AnCora Basic Kit
  • Wikipedia Basic Kit
  • http://infomotions.com/alex/ (electronic documents of english classic literature)
  • The holy bible (zip, link)
  • Fary Tales and kid’s stories
    • The little red ridding hood
    • Lily and the lion
  • http://www.inf.ed.ac.uk/resources/corpora/ (Edinburg, many corpus)
  • Brown corpus, Penn Treebank (http://www.cis.upenn.edu/~treebank/)… do specialized pages for that.

Question sets

  • Common name based answer questions (zip, link)
  • TREC (link): Questionsets with answers and human judjements
  • CLEF (link): Questionsets with answers and human judgements (source Wikipedia… interesting)
  • OpenTrivia.com (link)

PoS (http://en.wikipedia.org/wiki/Part-of-speech_tagging) tags EAGLES, PAROLE …

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank. The term Parsed Corpus is often used interchangeably with Treebank: with the emphasis on the primacy of sentences rather than trees.

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.

Penn Treebank II tags: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

Detects numeric quantities even word spelled ones.

Alphabetical list of part-of-speech tags used in the Penn Treebank, CHARNINAK Project:

Number
Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb

PAROLE TAGSET (FREELING)

ABBREVIATION ABBREVIATED WORD
ADJ Adjective
ADP Adposition
ADV Adverb
ART Article
CON Conjunction
DET Determiner
INT Interjection
NOU Noun
NUM Numeral
PRN Pronoun
RES Residual
UNIQUE Unique Membership Class
VRB Verb

GEO INFO

In order to conduct geo-retrieval well, you may need resources such as gazetteers or ontologies. Here is a brief list of resources that we know about. Please contact Mark Sanderson, if you have other resources you want added to this list.