Overview of the parsing process
In order to parse a sentence, LEOPAR needs a lexicon and a grammar.
For Leopar, a lexicon is a function which maps a wordform to a set of hypertags which describe syntactic and morphological information. Due to both syntactic and morphological ambiguity, the lexicon can give an high number of hypertags (mainly for verbs). For instance, for the wordform donnes, the lexicon return 13 hypertags (one for the noun and 12 for the verbs). Here are two hypertags for the noun and for the verb:
A grammar is set of EPTD which are used to build parse trees during the parsing process. The link with wordforms of the lexicon is handle trough the interface. Each EPTD is given with an hypertag (names its interface) which describes the syntactic and morphological constraints that a wordform must verifies to be associated with the EPTD.
An example of EPTD from ditransitive verb:
The link between lexicon and grammar is controlled by feature structures unification: if the lexicon hypertag unifies with some interface, then an anchored EPTD is produced with the wordform as anchor. For instance, the verbal hypertag for donnes above and the interface are unifiable. Hence the wordform donnes is associated with the EPTD grammatical EPTD below. You can notice that with value sharing (numbers between square brackets), some feature values are updated during the anchoring step (in the yellow anchor node for instance). The resulting EPTD is:
When LEOPAR receives an input string, it is tokenized by the toktok tool which returns a DAG (directed acyclic graph) where each path represents a possible tokenization of the input sentence.
In fact, Toktok returns a DAG of hypertag. So with the anchoring mechanism described above, the first step in LEOPAR is to produce an EPTD DAG. For the sentence,
<font color="green"><i>"Au cours du matin, il neige."</i></font>, the EPTD DAG (with more than 200000 paths) is:
In the raw DAG above, the number of paths grows very quickly. So before entering deep parse, we try to detect as many bad paths as possible, mainly using two kinds of method: polarity filtering and companion filtering.
This method consists in removing paths such that global the global sum of active polarity is not neutral. A path where to EPTD provides an NP and only one NP is expected can not be parsed. This method is described in Coling04.
After polarity filtering, there are 150 paths left:
For each active polarized feature in an EPTD, we call its companions the set of EPTD in the grammar which are able to saturate it. Companions can be computed statically for the whole grammar.
Now, at parsing time, for a given sentence, if some polarized feature cannot find at least one of these companions, we can trough it away. This method is described in details in IWPT 2009.
For our example sentence, the companion filter takes the number of paths from 200000 down to 1752.
Make them collaborate
Fortunately, the two kinds of filtering are different enough and it is interesting to use them both. On the same sentence as above. Combining the two filtering methods keep only 11 paths for the deep parsing step. Those path are given by the DAG below:
on the DAG obtained above, the deep parsing is done by successive use of the node merging operation. Starting from an empty structure, the model is built step by step with two atomic steps:
- shift: add the next word to the current structure
- reduce: if the number of active polarity is above a given bound or if there is no more words to shift, try to saturate some of the active polarity in the current structure.
Finally, the deep parsing produces parse trees like the one below:
Form the parsing process, a dependency structure is extracted (see here for more information). The structures produced are with or without epsilon: