Extracting NLP relationships is a tough topic

University of Potsdam Institute for Computer Science Chair of Machine Learning. NLP pipeline. Tobias Scheffer Christoph Sawade


1 University of Potsdam Institute for Computer Science Chair of Machine Learning NLP Pipeline Tobias Scheffer Christoph Sawade

2 NLP pipeline Sequence of processing steps for information extraction, translation, summary, some other language processing operations. 2

3 NLP (Natural Language Processing) pipeline 1. Tokenization, 2. Part-of-Speech-Tagging, 3. Named Entity Recognition, 4. Shallow parsing, 5. Chunking, 6. Information extraction 3

4 Natural Language Processing (NLP) pipeline Sequential processing of language. Many variants of the pipeline, depending on the problem. Each level builds on the output of the previous level. Some methods are used for several of the stages. 4th

5 NLP (Natural Language Processing) Pipeline Problem: Errors add up over the processing steps. Trend: Formulate the entire processing as a single optimization / learning problem. Individual levels can thus influence each other better. Sometimes it only emerges from the semantic analysis that a certain part of speech is implausible for a term. Problem: The only optimization problem is extremely complex. State of research: two processing steps at the same time. 5

6 POS tagging Assignment of the part of speech of terms. Mostly input for further processing steps (parser, extraction rules, translation). Tagset: Definition of the parts of speech to be distinguished. In principle: nouns, verbs, pronouns, preposition, adverb, conjunction, participle, article, John / NN saw / v her / pn duck / v or John / NN saw / v her / pn duck / n 95-97% correctly set Common during the day. 6th

7 Common Tags for English AT Article BEZ is IN Preposition JJ Adjective NN Noun NNP Proper Noun NNS Noun, plural PERIOD.,?,! PN Personal Pronouns RB Adverb TO to VB Verb VBD Verb, Past 7

8 Part of speech recognition: influencing factors Some POS sequences are commonly used AT, YY, NN. Others are less likely to be AT, PN VB. Most words appear in only a few parts of speech, e.g. Lauf / VB or Lauf / NN, but not Lauf / YY. Prediction of the word form based on the neighboring tags: approx. 77% accuracy. Prediction of the most likely word form of any word without context: approx. 90% accuracy. 8th

9 Part of Speech Recognition: Methods Historical: Transformation-based (rule-based) taggers. N-Gram Models HMMs Discriminative Sequence Models (Conditional Random Fields, Hidden Markov Support Vector Machines). 9

10 Part of speech recognition with Markov models One state per POS tag. Training data: tagged corpus. States are visible. Applying the tagger to a new sentence: states are not yet visible. 10

11 Training the POS tagger Start probabilities: Counting the first days of a sentence. Transition probabilities: counting the frequency of day transitions. Observation probabilities: count frequencies of all words in all states. 11

12 Using the POS tagger Sequence of the most probable individual states given a word sequence. Through the forward-backward algorithm. Most likely overall sequence of states given word sequence. Determined by Viterbi algorithm. 12th

13 Trigram Tagger HMM = Bigram Tagger. Trigram Tagger: Each pair of tags forms a state. State transitions correspond to tag trigrams. 13th

14 Transformation-based taggers E.g. Brill Tagger Starting point: tag each word with the most frequent tag for it. Application of transformation rules: If certain conditions are met in the context of X t, then replace the tag of X t with X t = x. Construction of the rules through manual labor, supported by learning. Less robust, historical technology. 14th

15 Proper name recognition Named Entity Recognition Recognition of: person, company, place names (NAMEX), dates, times (TIMEX), percentages, amounts of money (NUMEX), genes, proteins, loci, viruses, diseases. Standardization of proper names (Named Entity Resolution): Identification of variants: DCAG = Daimler = DaimlerChrysler AG, illustration in taxonomy: Food / Fruit / Apple Company / ComputerHardware / Apple. 15th

16 NER - example 16

17 Difficulties The number of possible proper names is in principle infinitely large. A recognizer sees many proper names for the first time (e.g. 3,3'-diaminobenzidine tetrahydrochlorine). New proper names are constantly being invented (e.g. for new chemical compounds, new companies). Non-uniform variants of the name of an entity (HU, Humboldt-Uni, HU Berlin Humboldt University, Humboldt University Berlin). 17th

18 Named Entity Recognition: Methods Generation of Characteristics for Tokens: Usually a large number of characteristics are generated. Word itself, POS tag, orthographic features, gazetteer features. Recognition of proper names based on the following characteristics: Sliding Window, Hidden Markov Models, Conditional Random Fields. 18th

19 Named Entity Recognition Core: Gazetteer's lists of well-known names, tailored to the area of ​​application. Gazetteer is by no means sufficient. Apple can also mean fruit. Newly invented names do not appear. With Ca, fat genes can be meant, but do not have to be. 19th

20 Generation of features Bag-of-Word-Features: word is eel ,, word is zutzeln. Letter n-gram characteristics, word contains aaa ,, word contains zzz). Orthographic features: begins with capital letters ,, contains special characters, contains number, Gazetteer features: word appears in proper noun list 1. 20th

21 Methods: Sliding Window Features are generated in a window of the set that is placed around a token to be classified. With the help of these features, the classifier decides whether the central token is a proper name. Classifier: Mapping of feature vector to NER classes. Window is pushed over text. Training: The classifier is trained from pairs (vector, NER class). Examples are obtained from tagged example texts. 21

22 methods: Sliding Window, PER,, ... One feature vector per window position. Characteristics: One characteristic per word in the dictionary, POS, orthographic characteristics, element of the Gazetteer ?, 22

23 Methods: HMMs and CRFs b i (O t): Distribution over vector of the previously calculated features. Training by counting from tagged sample texts (no Baum-Welch required). See last lecture. 23

24 Parsing: PCFGs Probabilistic context-free grammar: Statistical language model. Hidden Markov model: probabilistic finite automaton probabilistic regular language Hidden Markov models are special PCFG. Underlying finite automaton replaced by context-free grammar. 24

25 PCFG: Definition of terminal symbols {w k}, k = 1 ... v. Nonterminal {N i}, i = 1 ... n. Start symbol N 1. Rules {N i ξ j}, where ξ j is a sequence of terminals and non-terminals. i j probabilities such that i: j P (N ξ) = 1 P (N i ξ j) stands for P (ξ j N i). Sentence to be parsed: w 1 ,, w m. W ab stands for w a ,, w b. N i * w a w b, if sentence w a w b can be derived from nonterminal N i. N i pq: w pq is derived from nonterminal N i. 25th

26 PCFG: Probability of a sentence Across all pars trees t: Example: P (S bla) = 1/3. P (S S S) = 2/3. P (blah blah blah) =? P (w1 m) = P (w1 m, t) t 26

27 PCFG: Elementary Questions PCFGs (like HMMs) answer three questions: 1. What is the probability of a sentence w 1m given a grammar? 2. What is the most likely pars tree given sentence and grammar? 3. Which probabilities for the grammar rules can we learn from a given training corpus? 27

28 Chomsky normal form Simplification here: Rules have one of the forms: N i N j N k, N i w j. For any context-free grammar, one can find a Chomsky normal grammar that produces the same language. (However, the same pars trees are not created). 28

29 Proabilistic Regular Grammars Special case: Every rule has one of the following forms: N i w j N k, N i w j. For each HMM there is a PCFG (special case PRG) that models the same probability distribution. Proof? 29

30 Forward / Backward Inside / Outside α (i) = P (O1, ..., O, qi λ) ttt = t β (i) P (O 1, ..., O qi, λ) t = t + T t = 30

31 Forward / Backward Inside / Outside j α j (p, q) = P (w1 (p 1), N pq, w (q + 1) m G) j β j (p, q) = P (w N, G ) pq pq 31

32 32 Inside / Outside Outside probability: Inside probability Probability of a word chain:) ,, (), () 1 (1) (1 G w N w P qpmqj pq pj + = α), (), (GN w P qpj pq pq β j =) (1,), (), () (* 1 1 TGN w PG w NPG w PTTTT = β = =

33 Calculation of the inside probability anchoring: Induction step: j β (pq,) = Pw (N, G) j pq pq = = = q 1 rs, d = pq 1 rs, d = pj β (kk,) = Pw ( N, G) jk kk = P (N w G) Pw (, N, w, NN, G) rsj pd pd (d + 1) q (d + 1) q pq jrs ((d + 1) q pq, pd, (d + 1) q, pq,) jk PN (, NN, GPw) (N, N, N, G) rsjjrs pd (d + 1) q pq pq pq pd (d + 1) q P w NNN w G q 1 rs, d = pq 1 rs, d = ps PN (, NN, GPw) (N, G) P w N, G rsjr pd (d + 1) q pq pd pd jrs PN (NN) β (pd, ) β (d + 1, q) rs ((d + 1) q (d + 1) q) 33

34 Calculation of the outside probability anchoring: Induction step: j α j = 1 (p 1) pq (q + 1) m (pq,) Pw (, N, w G) α (1, T) = 1; α (1, T) = 0 for j 1 1 mfjg = Pw (1 (p 1), w (q + 1) m, Npe, Npq, N (q + 1) e) jg, je = q + 1 p 1 fgj + Pw (1 (p 1), w (q + 1) m, Neq, Ne (p 1), Npq) jg, e = 1 = jmfjgfg Pw (1 (p 1), w (e + 1) m, Npe) PN (pqn (q + 1) e Npe) Pw (q + 1) e N (q + 1) e jg, je = q + 1 () p 1 fgjfg + Pw (1 (e 1), w (q + 1) m, Neq) PN (e (p 1), Npq Neq) Pw (e (pq) Ne (p 1)) jg, e = 1 mfjg = α f (pepn,) (NN) βg (q + 1, e) jg, je = q + 1 p 1 fgj + α f (, eqpn) (NN) βg (, ep 1) jg, e = 1 34

35 Most likely pars tree Viterbi algorithm for PCFG. O (m 3 n 3) δ i (p, q) = highest inside probability for a pars of the subtree N i pq. Initialization: Induction: i δ (p, p) = P (N w) Store best path: iijkiqj, kn, pr

36 PCFG: Learning from example corpora Problem: Grammar (CFG) must already be given, the learning algorithm only learns the probabilities PN (i NN j k) Learning from CFG: difficult (research topic). Learning of CFG + probabilities (complete learning of a PCFG from corpus: research topic. 1. If corpus is annotated with pars tree: counting the rule probabilities. 2. If corpus is not annotated: EM algorithm. 36

37 Penn Treebank S simple clause SBARQ Wh- question SQ Yes / no question ADJP adjective phrase NP noun phrase PP prepositional phrase QP quantifier phrase VP verb phrase HWNP Wh- noun phrase -SBJ subject -LOC location CONJP conjunction phrase FRAG fragment INTJ interjection LST list marker NAC not a constituent grouping NX nominal constituent inside NP PRN parenthitical PRT particle RRC reduced relative clause 37

38 Penn Treebank ((S (NP-SBJ The move (VP followed (NP (NP a round) (PP of (NP (NP similar increases) (PP by (NP other lenders))) (PP against (NP Arizona real estate loans ))))) 4.5 million words in total Building tree banks and learning PCFGs is more effective than developing grammars by hand

39 University of Potsdam Institute for Computer Science Chair of Machine Learning Machine Translation Tobias Scheffer Christoph Sawade

40 Greetings Hello dear course participants. Today we will deal with the topic of "machine translation" in the lecture. Translation processes make it possible for us to understand foreign language texts; However, they are not so useful for translating official documents, because they do not just work so furiously. This text has been translated from German into English and back into German. 40

41 Machine translation Given text in source language, generate equivalent text in target language. Applications: translation of websites (accuracy does not have to be very high). Enables understanding of a foreign language text, even if the translation is not accurate. Assistance, proposal for human translator. Correspondence from EU authorities is translated into all EU languages, but the level of accuracy required is higher than that of today's translation software. 41

42 Statistical translation Bayesian formula: Find arg max P (sentence sentence) sentence target source = arg max sentence P (sentence) () target source sentence target P sentence target translation model Analogy speech recognition: target language model sentence source signal; Sentence target searched phrase. Translation model acoustic model. Translation model: word-based, syntax-based, semantics-based Language model: N-gram model, PCFG, ... 42

43 Statistical Translation Decoding: Given the sentence source, find the sentence target. Search in space of all possible sentences, often beam search. Forms of the translation model: Word-based P (apple apple), Syntactic, P (English syntax tree, German syntax tree) Semantic, P (semantic attributes of the English sentence, semantic attributes of the German sentence). Learning the translation model: From a parallel corpus. Translated texts with explicit assignment of terms. 43

44 Decoding Definition of a search area: Every possible word sequence in the target language is an element of the search area. The empty sentence is the starting point. Evaluation function: P (sentence source sentence target) P (sentence target) Find the element in the search space with the highest evaluation! 44

45 Search Search is one of the central topics of AI. Complete / optimal algorithms always find / always the best solution (if it exists). Effort i.a. exponentially in search depth. A *, Branch & Bound. Heuristic, greedy algorithms do not always find the best solution, the search runs faster, and a good solution is often found quickly with good search heuristics. Beam search, gradient search. 45

46 Decoding with beam search. Initialize L = 0. Beam [L] = Empty block. Repeat: For all elements in Beam [L] and for the k most likely next words according to the translation model: Add word to sentence, add in Beam [L + 1]. L = L + 1. For all elements in Beam [L], compute P (SentenceSource SentenceTarget) P (SentenceTarget) Keep only the b best sentences in the beam, the others get lost. As long as the most likely last word = end of sentence. or other termination criterion. 46

47 translation models Interlingua semantic (knowledge representation) translation model language 1 (semantic interpretation) semantic translation model language 1 (syntactically parsed) language 1 (sequence of words) syntactic translation model word-based translation model semantic translation model language 2 (semantic interpretation) language 2 (syntactically parsed) language 2 ( Sequence of words) 47

48 Word-based translation model Model for P (words in source sentence, words in target sentence). P (sentence sentence) = P (Q sentence sentence) Source target j = 1 j Source target Which word in the source sentence corresponds to which word in the target sentence? Total over all assignments. l l m 1 P (sentence sentence) ... P (Q Z) = source destination j a Z a1 = 0 am = 0 j = 0 probabilities P (Q J Z i) must be learned from a parallel corpus. T j 48

49 N-gram translation model P (N-grams in the source sentence N-grams in the target sentence): T source target = j = 1 jj 1 j N + 1 target P (sentence sentence) P (QQ, ..., Q, sentence) Assignment of the N-grams: llm 1 P (sentence sentence) ... P (QQ, ..., Q, Z, ..., Z) = source destination jj 1 j N + 1 ajaj N + 1 Z a = 0 a = 0 j = 0 model must be learned from a parallel corpus. 1 m 49

50 Syntactic translation P (Syntactic construct in sentence source Syntactic construct in sentence target). P (sentence sentence) = P (sentence sentence, P, P) P (P, P) problems: source target P, P source target source target source target source ambiguities in parsing Goal Many syntactic constructs do not have equivalents in every language. It doesn't make the decoding problem any easier. 50

51 Semantic translation Semantic interpretation through parsing. Sentence in target language is generated from semantic representation. Problem: Semantic representation difficult term, parsing difficult problem. Direct translation not necessarily correct (idioms, metaphors, ...) 51

52 Training the translation model Training data: parallel corpora, e.g. Parliamentary speeches from Canada, EU, Hong Kong. Multilingual newspaper reports (somewhat less good). Religious texts (not good, texts inhomogeneous, translations rather free). The parallel corpus must be aligned, assignment of text positions as long as the terms correspond to each other. It can theoretically be done by hand, but there are also processes that align texts; so we get more training data. 52

53 Text-Alignment [According to] [our survey,] [1988] [sales of] [mineral water] [and soft drinks] [were much higher than in 1987], reflecting [the growing popularity of these products.] [Cola drink manufacturers] [in particular] achieved [above average] growth rates. [Our study] [according to] [sales of] [mineral water] [and soft drinks] [in 1988] [increased significantly compared to the previous year.] [Especially] [manufacturers of caffeinated soft drinks] benefited from [the growing popularity of their products] benefit [above average]. 53

54 Text alignment Sentence alignment: 1: 1 sentence in the source document corresponds to a sentence in the target document. n: m n sentences in the source document correspond to m sentences in the target document. n, m in the example area on slide 16 shows 2: 2 alignment. Different approaches to estimating the likelihood of an alignment: Length-based Letter-N-gram-based Lexical 54

55 Text Alignment Length-based Find alignment ((Q 1, Z 1), ..., (Q n, Z n)), where Q i and Z i contain 0, 1 or 2 sentences such that Q i and Z i are as equal as possible. The best alignment is determined with dynamic programming. Letter N-grams Find alignment in which the Q 1, Z 1 contain as many identical letter N-grams as possible. Lexically Based on word-to-word translation; Z i should contain as many literal translations from Q i as possible. 55

56 Learning the translation model Probabilities P (Q J Z i) must now be learned from a parallel corpus with aligned sentences. Q J, Z i: words, word n-grams, ... Problem: Estimating the translation probabilities requires word alignment. For word alignment one needs translation probabilities. EM algorithm. 56

57 Learning the translation model EM algorithm. Start with P (Q J Z i) = # sentence alignments with Q J in the source sentence and Z i in the target sentence / # sentence alignments with Z i in the target sentence. Repeat Compute probabilities for all possible word alignments for given sentence alignments. Estimate P (Q J Z i) based on the probabilities for word alignments. 57

58 Problems with translation Assumptions of independence in the language and translation model. Efficiency: Finding the Most Likely Target Set is a problem, despite heuristic searching. Non-local dependencies (context information) are a problem. No good translation model found yet. Language models are not yet good enough. Automatic translators are still not the best. 58

59 Evaluation of translators NIST (National Institute for Standards and Technology) machine translation evaluation plan. Annual competitions Evaluation Human translators look at the machine translations and award points. N-gram co-occurrence statistics: How many N-grams of a reference translation appear in the machine translation? 59