Protocols
Module for implementation-agnostic interface definitions that are shared across submodules.
TokenProtocol
Bases: Protocol
A token in a text. The token contains both of its string literal (i.e. the word) as well as its metadata (e.g. its part of speech, its position in the sentence etc.).
Source code in src/limes/protocols.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
text
property
The text of the token.
morph
property
The morphological analysis of the string token.
pos_
property
The part-of-speech tag of the given token (based on the Universal POS tagset).
fine_pos
property
The part-of-speech tag of the given token based on a language-specific tagset - if available for the given language.
dep_
property
The dependency tag of the given token.
lemma_
property
The lemma of the word contained within the given token.
i
property
The index of the token within the context of the document that contains it, where a document can be considered as a list of tokens.
is_punct
property
Whether or not the given token is punctuation.
head
property
The syntactic parent of the given token.
children
property
All tokens that constitutes descendants of the given token in the dependency tree of the document.
ancestors
property
All tokens that constitute ancestors of the given token in the dependency tree of the document.
subtree
property
The given token as well as all its descendants in the dependency tree of the given document.
__str__()
__eq__(other)
Evaluate equality between the given TokenProtocol instance and an arbitrary other object.
Source code in src/limes/protocols.py
SpanProtocol
Bases: Protocol
A span of tokens in a text. The span contains references to the tokens it contains, as well as information about noun chunks that are part of the given span.
Source code in src/limes/protocols.py
text
property
The actual text of the tokens contained within the span.
noun_chunks
property
All noun chunks contained in the given span; a noun chunk is another span consisting of one or more nouns and - optionally - adjectives and/or auxiliary verbs.
__str__()
__iter__()
Iterate over the Span, one token at a time. Iteration happens in the direction common in reading the language (e.g. "left to right" in German or English).
DocumentProtocol
Bases: Protocol
A document of text. The document contains both its string literal (i.e. its text) as well as related morphosyntactic metadata.
Source code in src/limes/protocols.py
text
property
The text contained in the given document.
noun_chunks
property
All noun chunks contained in the given document; a noun chunk is a span consisting of one or more nouns and - optionally - adjectives and/or auxiliary verbs.
sents
property
All sentences contained in the provided document. The concrete logic of the class implementing DocumentProtocol must contain sentencization logic. Any sentence is considered to be a document.
__str__()
__iter__()
Iterate over the document, one token at a time. Iteration happens in the direction common in reading the language (e.g. "left to right" in German or English).
__getitem__(i)
__len__()
The length of the document, as counted by the number of tokens (i.e. distinct words or punctuation marks) contained within it.
span(start_idx, end_idx)
Create a span of all tokens between the start index and the end index provided.