Skip to content

spaCy-based Parser Implementation

This section covers the spaCy-based implementation of the Parser interface, as well as the required implementation of the respective text element Protocols.

Bases: Parser

A Parser that performs morphosyntactic analysis on a raw string and returns an instance of a concrete implementation of the DocumentProtocol.

Source code in src/limes/parsers/spacy_parser.py
class SpacyParser(Parser):
    """
    A Parser that performs morphosyntactic analysis on a raw string and returns
    an instance of a concrete implementation of the `DocumentProtocol`.
    """

    def __init__(self, model: Language | str = "de_dep_news_trf"):
        if isinstance(model, str):
            try:
                self._nlp = spacy.load(model)
            except OSError as error:
                raise ValueError(
                    f"Model '{model}' not found. Try downloading it by running "
                    f"`python -m spacy download {model}`."
                ) from error
        else:
            self._nlp = model
        if "sentencizer" not in [step[0] for step in self._nlp.pipeline]:
            self._nlp.add_pipe("sentencizer")

    def __call__(self, text: str) -> DocumentProtocol:
        """
        Create a SpacyDocument out of the provided string.
        """
        return SpacyDocumentAdapter(self._nlp(text))

__call__(text)

Create a SpacyDocument out of the provided string.

Source code in src/limes/parsers/spacy_parser.py
def __call__(self, text: str) -> DocumentProtocol:
    """
    Create a SpacyDocument out of the provided string.
    """
    return SpacyDocumentAdapter(self._nlp(text))

Protocol Adapters

Bases: DocumentProtocol

Source code in src/limes/parsers/spacy_parser.py
class SpacyDocumentAdapter(DocumentProtocol):
    def __init__(self, doc: SpacyDoc):
        self._doc = doc

    def __repr__(self) -> str:
        return f"SpacyDocumentAdapter({self.text})"

    def __str__(self) -> str:
        return self.text

    @property
    def text(self) -> str:
        """The text contained in the given document."""
        return self._doc.text

    @property
    def noun_chunks(self) -> Iterable[SpacySpanAdapter]:
        """
        All noun chunks contained in the given document; a noun chunk is a span
        consisting of one or more nouns and - optionally - adjectives and/or
        auxiliary verbs.
        """
        return (SpacySpanAdapter(nc) for nc in self._doc.noun_chunks)

    @property
    def sents(self) -> Iterator["SpacyDocumentAdapter"]:
        """
        All sentences contained in the provided document. Any sentence is
        considered to be a document.
        """
        yield from (
            SpacyDocumentAdapter(sent.as_doc()) for sent in self._doc.sents
        )

    def __iter__(self) -> Iterator[SpacyTokenAdapter]:
        """
        Iterate over the document, one token at a time. Iteration happens in
        the direction common in reading the language (e.g. "left to right" in
        German or English).
        """
        yield from (SpacyTokenAdapter(tok) for tok in self._doc)

    def __getitem__(self, i) -> SpacyTokenAdapter:
        """Retrieve the i-th token in the provided document."""
        return SpacyTokenAdapter(self._doc[i])

    def __len__(self) -> int:
        """
        The length of the document, as counted by the number of tokens (i.e.
        distinct words or punctuation marks) contained within it.
        """
        return len(self._doc)

    def span(self, start_idx: int, end_idx: int) -> SpacySpanAdapter:
        """
        Create a span of all tokens between the start index and the end index
        provided.
        """
        return SpacySpanAdapter(self._doc[start_idx:end_idx])

text property

The text contained in the given document.

noun_chunks property

All noun chunks contained in the given document; a noun chunk is a span consisting of one or more nouns and - optionally - adjectives and/or auxiliary verbs.

sents property

All sentences contained in the provided document. Any sentence is considered to be a document.

__iter__()

Iterate over the document, one token at a time. Iteration happens in the direction common in reading the language (e.g. "left to right" in German or English).

Source code in src/limes/parsers/spacy_parser.py
def __iter__(self) -> Iterator[SpacyTokenAdapter]:
    """
    Iterate over the document, one token at a time. Iteration happens in
    the direction common in reading the language (e.g. "left to right" in
    German or English).
    """
    yield from (SpacyTokenAdapter(tok) for tok in self._doc)

__getitem__(i)

Retrieve the i-th token in the provided document.

Source code in src/limes/parsers/spacy_parser.py
def __getitem__(self, i) -> SpacyTokenAdapter:
    """Retrieve the i-th token in the provided document."""
    return SpacyTokenAdapter(self._doc[i])

__len__()

The length of the document, as counted by the number of tokens (i.e. distinct words or punctuation marks) contained within it.

Source code in src/limes/parsers/spacy_parser.py
def __len__(self) -> int:
    """
    The length of the document, as counted by the number of tokens (i.e.
    distinct words or punctuation marks) contained within it.
    """
    return len(self._doc)

span(start_idx, end_idx)

Create a span of all tokens between the start index and the end index provided.

Source code in src/limes/parsers/spacy_parser.py
def span(self, start_idx: int, end_idx: int) -> SpacySpanAdapter:
    """
    Create a span of all tokens between the start index and the end index
    provided.
    """
    return SpacySpanAdapter(self._doc[start_idx:end_idx])

Bases: SpanProtocol

Source code in src/limes/parsers/spacy_parser.py
class SpacySpanAdapter(SpanProtocol):
    def __init__(self, span: SpacySpan):
        self._span = span

    def __repr__(self) -> str:
        return f"SpacySpanAdapter({self.text})"

    def __str__(self) -> str:
        return self.text

    @property
    def text(self) -> str:
        """The actual text of the tokens contained within the span."""
        return self._span.text

    @property
    def noun_chunks(self) -> Iterable["SpacySpanAdapter"]:
        """
        All noun chunks contained in the given span; a noun chunk is another
        span consisting of one or more nouns and - optionally - adjectives
        and/or auxiliary verbs.
        """
        # spaCy Span.noun_chunks yields raw SpacySpan
        return (SpacySpanAdapter(nc) for nc in self._span.noun_chunks)

    def __iter__(self) -> Iterator[SpacyTokenAdapter]:
        """
        Iterate over the Span, one token at a time. Iteration happens in
        the direction common in reading the language (e.g. "left to right" in
        German or English).
        """
        # iterate over TokenAdapter
        yield from (SpacyTokenAdapter(tok) for tok in self._span)

    def __getitem__(self, i: int) -> SpacyTokenAdapter:
        """Retrieve the i-th token in the provided span."""
        return SpacyTokenAdapter(self._span[i])

text property

The actual text of the tokens contained within the span.

noun_chunks property

All noun chunks contained in the given span; a noun chunk is another span consisting of one or more nouns and - optionally - adjectives and/or auxiliary verbs.

__iter__()

Iterate over the Span, one token at a time. Iteration happens in the direction common in reading the language (e.g. "left to right" in German or English).

Source code in src/limes/parsers/spacy_parser.py
def __iter__(self) -> Iterator[SpacyTokenAdapter]:
    """
    Iterate over the Span, one token at a time. Iteration happens in
    the direction common in reading the language (e.g. "left to right" in
    German or English).
    """
    # iterate over TokenAdapter
    yield from (SpacyTokenAdapter(tok) for tok in self._span)

__getitem__(i)

Retrieve the i-th token in the provided span.

Source code in src/limes/parsers/spacy_parser.py
def __getitem__(self, i: int) -> SpacyTokenAdapter:
    """Retrieve the i-th token in the provided span."""
    return SpacyTokenAdapter(self._span[i])

Bases: TokenProtocol

A token in a text. The token contains both of its string literal (i.e. the word) as well as its metadata (e.g. its part of speech, its position in the sentence etc.). The TokenAdapter wraps around a spaCy Token object to only expose attributes that adhere to the

Source code in src/limes/parsers/spacy_parser.py
class SpacyTokenAdapter(TokenProtocol):
    """
    A token in a text. The token contains both of its string literal (i.e. the
    word) as well as its metadata (e.g. its part of speech, its position in the
    sentence etc.).
    The TokenAdapter wraps around a spaCy Token object to only expose
    attributes that adhere to the
    """

    def __init__(self, token: SpacyToken):
        self._token = token

    def __repr__(self) -> str:
        return f"SpacyToken({str(self)})"

    def __str__(self) -> str:
        """The text of the token."""
        return self._token.text

    @property
    def text(self) -> str:
        """The text of the token."""
        return self._token.text

    @property
    def morph(self) -> dict[str, str]:
        """The morphological analysis of the string token."""
        return self._token.morph.to_dict()

    @property
    def pos_(self) -> str:
        """
        The part-of-speech tag of the given token (based on the Universal POS
        tagset).
        """
        return self._token.pos_

    @property
    def fine_pos(self) -> str:
        """
        The part-of-speech tag of the given token based on a language-specific
        tagset - if available for the given language.
        """
        return self._token.tag_

    @property
    def dep_(self) -> str:
        """The dependency tag of the given token."""
        return self._token.dep_

    @property
    def lemma_(self) -> str:
        """The lemma of the word contained within the given token."""
        # Added robustness in cases where spacy fails to correctly identify the
        # lemma.
        if self._token.lemma_ == "--":
            return str(self._token).lower()
        return self._token.lemma_.lower()

    @property
    def i(self) -> int:
        """
        The index of the token within the context of the document that contains
        it, where a document can be considered as a list of tokens.
        """
        return self._token.i

    @property
    def head(self) -> "SpacyTokenAdapter":
        """The syntactic parent of the given token."""
        return SpacyTokenAdapter(self._token.head)

    @property
    def is_punct(self) -> bool:
        """Whether or not the given token is punctuation."""
        return self._token.is_punct

    @property
    def children(self) -> Iterable["SpacyTokenAdapter"]:
        """
        All tokens that constitutes descendants of the given token in the
        dependency tree of the document.
        """
        return (SpacyTokenAdapter(c) for c in self._token.children)

    @property
    def ancestors(self) -> Iterable["SpacyTokenAdapter"]:
        """
        All tokens that constitute ancestors of the given token in the
        dependency tree of the document.
        """
        return (SpacyTokenAdapter(c) for c in self._token.ancestors)

    @property
    def subtree(self) -> Iterable["SpacyTokenAdapter"]:
        """
        The given token as well as all its descendants in the dependency tree
        of the given document.
        """
        return (SpacyTokenAdapter(c) for c in self._token.subtree)

text property

The text of the token.

morph property

The morphological analysis of the string token.

pos_ property

The part-of-speech tag of the given token (based on the Universal POS tagset).

fine_pos property

The part-of-speech tag of the given token based on a language-specific tagset - if available for the given language.

dep_ property

The dependency tag of the given token.

lemma_ property

The lemma of the word contained within the given token.

i property

The index of the token within the context of the document that contains it, where a document can be considered as a list of tokens.

head property

The syntactic parent of the given token.

is_punct property

Whether or not the given token is punctuation.

children property

All tokens that constitutes descendants of the given token in the dependency tree of the document.

ancestors property

All tokens that constitute ancestors of the given token in the dependency tree of the document.

subtree property

The given token as well as all its descendants in the dependency tree of the given document.

__str__()

The text of the token.

Source code in src/limes/parsers/spacy_parser.py
def __str__(self) -> str:
    """The text of the token."""
    return self._token.text