NLP Basic Concepts
Lexical Level / Word level / Morphological…(fer ben fet)
Stem is the invariable part of a word (volver –> volv). Lemma is the canonical form of a lexeme (volvieron –> volver). Lexeme is, in english, more or less the same thingh but not in spanish. Therfore do not use Lexeme to refer Stem nor Lemma. It is very useful to know what a sentence is.
Syntactic Level
Chunk: Phrase chunking is a natural language process that separates and segments sentences into its subconstituents, i.e. noun, verb and prepositional phrases. Syntactic Tree: NP, ADJP, ADVP, PP … (Diferences among them??? PP??) ( Prepositional phrases modify nouns and verbs while indicating various relationships between subjects and verbs. They are used to color and inform sentences in powerful ways.
Parts of PP
In simplest terms, prepositional phrases consist of a preposition and an object of a preposition. Prepositions are indeclinable words that introduce the object of a prepositional phrase. Indeclinable words are words that have only one possible form. For example, below is a preposition, but belows or belowing are not possible forms of below. The noun phrase or pronoun that follows the preposition is called the object of the preposition. For example, behind the couch is a prepositional phrase where behind is the preposition and the noun phrase the couch acts as the object of the preposition. Sometimes adjectives are used to further modify the object of the preposition, as in behind the big old smelly green couch. ) Noun Phrases (TIMEX3 UNIT) Adjective Phrases (TIMEX3 UNIT) Adverbial Phrases (TIMEX3 UNIT) Prepositional Phrsases
Semantic Level
WSD Semantic roles
Corpus
XML. Many corpus use this format. PTF/TBF (Penn Treebank Format). One of the most common NLP tree formats. Use tgrep to parse it. See TBF tags
.
Another common tags are parenthical
Finally, IOB tags, particularly IOB2 are widely used. The difference between the versions are the usage of B- tag in the first case it is only used when a tag is followed by another tag of the same type without O tokens between them. While in the second B- is used in every first word of a tag. Therefore, nowadays the format most common and used is IOB2. This is also called by Spanish people (BIO format) but it is better to maintain IOB2 name in publications.
Evaluation mesures
T-test is a mesure that indicates if an improvement is statistically significant (EXTEND WIKIPEDIA)(example). Precision is the number of correct answer instances divided by all answered instances.
Recall is the number of correct answer instances divided by all instances. Here I have an interesting dileme because if we define the recall that way it can not be higher than the precesion (at best the same). Then how can I measure the real recall. I mean, if I have an algorithem that can make a classification or not and it can be correct or not. How can I measure if the recall (the classified instances) is high without taking into account if they are correct or not… I think this is the important thing (my personal definition of recall). You can have a high recall but a low precision (oviousely lower than the recall). MAY BE I WAS DOING WRON? Answer: The answer begins with another questions. Why this measures are unuseful for QA tests? Because you can not have more answered instances than the total instances (questions). Exception: Unless you have questions whose answer is none (unknown). In other case the presicion is always higher.
no no no, no no, review review. It must exist a measure that accounts the algorithm classification rate. That is to say, given X elements, how much are clasified (rigth or wrong) –> Total positives / Total elements,
And also a measure of the corpus population… -> Real positives / Total elements
Conclusion -> depends on the task and has to be clearly defined in the evaluation section of the paper.
It is confusing because positive sometimes is equivalent to classified an sometimes is equivalent to correct (CLARIFY IN THE EXAMPLES TODO TODO TODO)
F1-mesure is F1= (2*precision*recall)/(precision+recall) – Normalizes the previous measures (1 is because B=1)
![]()
Solution: Make a more general definition. Imagin you have 1000 words (100 aremarked as relevant for you and 900 irrelevant). Now you have an algorithm that marks words. Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong (TN,TP,FN,FP):
- Total cases: total elements to make prediction over in the sample
- Real positives: total elements in the sample pre-evaluated as positive
- Real negatives: total elements in the sample pre-evaluated as negative
- Total positives: total elemets your algorithem evaluated as positive
- Total negatives: total elements your algorithem evaluated as negative
- TN / True Negative: the word wasn’t marked (case neg.) and your alg. doesn’t marked it (pred negative)
- TP / True Positive: the word was marked (case positive) and your algorithm mareked it (predicted positive)
- FN / False Negative: the word was marked (case positive) but the algorithm doesn’t marked it (pred negative)
- FP / False Positive: the word wasn’t marked (case negative) but the algorithm marked it (predicted positive)
Now, your boss asks you three questions:
- What percent of your predictions were correct? You answer: the “accuracy” was (9,760+60) out of 10,000 = 98.2%
- What percent of the positive cases did you catch? You answer: the “recall” was 60 out of 100 = 60%
- What percent of positive predictions were correct? You answer: the “precision” was 60 out of 200 = 30%
Then: Missing (FN) … could be also said not classified or unknown Spurious (FP) … could be also called incorrect Correct (TP): No doubt with them Incorrect (other level)… As we can see the false negatives are not appreciatted sometimes… Therefore it has to be always highlighted. People, I’m detectig TEs but I do not count when my algoritehm says over a word that it is not a TE and indeed it’s not a TE.
Precision: TP / TP + FP –> TP / Total positives
Recall: TP / TP + FN –> TP / Real positives
New measure can be calculated if we take into account all the predictions made (negative predictions).
Accuracy: TP [+ TN] / Total cases (note that not always true negative have to be considered…) In QA it happens when the correct answer can be “unknown”. In some studies it is specified that and then they calculate recall as accuracy, and for precision consider positives only when there is an answer.
MRR is a measure for lists of ranked predictions over elements only useful for algorithems that output more than one prediction for each element. And useful if the application is not sensible to errors…, that is to say, if the correct one is ranked as second another agent would appreciate it and make the final correct desicion. MRR (Mean Reciprocal Rank) is a mesure used in TREC for QA.The score for an individual question was the reciprocal of the rank at which the first correct answer was returned or 0 of no correct response was returned. is a statistic for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the correct answer. The mean reciprocal rank is the average of the reciprocal ranks of a sample of queries. Example For example, suppose we have the following three sample queries for a system that tries to translate English words to their plurals. In each case, the system makes three guesses, with the first one being the one it thinks is most likely correct: Query Results Correct response Rank Reciprocal rank cat catten, cati, cats cats 3 1/3 torus torii, tori, toruses tori 2 1/2 virus viruses, virii, viri viruses 1 1 Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 = 11/18 or about 0.61. This basic definition does not specify what to do if (1) none of the proposed results are correct (use reciprocal rank 0), or if (2) there are multiple correct answers in the list (consider using Mean Average Precision: MAP).
Improvement Measure The improvement of one value over another is calculated: (better*100)/worse
Error Reduction Measure The error (100-score) reduction of one value over another is calculated: (((100-worse)-(100-better))*100)/(100-worse)
K-fold cross-validation … (Divide the corpus in n parts & evaluate n times switching train/test sets) In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used.
kappa (Wikipedia) Cohen’s kappa coefficient is a statistical measure of inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. However, some researchers[who?] have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement. Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first evidence of Cohen’s Kappa in print can be attributed to Galton (1892). The equation for κ is:
where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance) then κ ≤ 0.
| κ | Interpretation |
|---|---|
| < 0 | No agreement |
| 0.0 — 0.20 | Slight agreement |
| 0.21 — 0.40 | Fair agreement |
| 0.41 — 0.60 | Moderate agreement |
| 0.61 — 0.80 | Substantial agreement |
| 0.81 — 1.00 | Almost perfect agreement |
(see wikipedia example) Monte Carlo methods: Useful for determine when something is afected or not by other something making randomly set tests. (…read more, go further…) There should be other measures… (Wikipedia)are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used when simulating physical and mathematical systems. Because of their reliance on repeated computation and random or pseudo-random numbers, Monte Carlo methods are most suited to calculation by a computer. Monte Carlo methods tend to be used when it is infeasible or impossible to compute an exact result with a deterministic algorithm.
Corpora
Why make and use corpora? (empirism vs rationalism)
The making and usage of a corpus is related to the empiric view of language study.
- Empiricism (McEnry and Wilson): an approach to a subject (linguistics) which is based upon analysis of external data (text/corpora). (Samson , 2001)
- Rationalism (Chomsky): an approach to a subject (linguistics) which is based upon introspection rather than external data analysis.
Rationalism is in favor to make rules based on individuals knowledge and empirism is in favor of infere this rules from annotated examples. Why not make classified examples of general language rules rather than just leave unsorted annotated examples?
What is a corpus?
Wilson: Any collection of more than one text can be called corpus. But… a finite collection of machine-readable text in electronic form, sampled to be maximally machine-readablerepresentative of a language or variety.
- Support: Electronic
- Representativity: maximum
- Size: finite, but enough to be representative
- Aim: Make linguisitic information explicit for linguistic and computational applications.
Sinclair (EAGLES):
- The corpus should be as large as the technology of the time permits.
- It has to have enouth representativeness (samples from a broad range of material)
Typology
oral – textual annotated – not annotated general – specific monolingual – multilingual
Applications
cognitively inplausible systems (at all linguistic levels) Annotation Manual, Automatic, Semi-automatic
