Universal Parser

Note

It’s required to install reader, pdf and codes from Optional dependencies section.

The first step for using an Universal Parser, to automatically extract information from the pdf-files, is to create a new class that inherits from a UniversalParser.

The next step, in a case that the parser is not working properly or you would like it to work differently, you can implement an Extension, see UniversalParserExtension. The extensions are then provided in the constructor of your Universal Parser, they are automatically selected then based on the context value.

At the end, you just need to implement parse_body() method, that contains overall parsing logic.

class aiviro.modules.universal_parser.UniversalParser(extension_classes: List[Type[UniversalParserExtension]] | None = None)

Main class to inherit from for defining your parsing logic of Universal Parser.

Parameters:: extension_classes – List of universal-parser extensions, see UniversalParserExtension
Example:

>>> from aiviro.modules.universal_parser import UniversalParser, DocumentType
>>> from dataclasses import dataclass
>>> from decimal import Decimal
>>> from typing import Optional
>>>
>>> @dataclass
... class ParserOutput:
...     vat_supplier: Optional[str] = None
...     invoice_date: Optional[str] = None
...     total_amount: Optional[Decimal] = None
...
>>> class MyParser(UniversalParser[ParserOutput]):
...     def parse_body(self, pdf_r: "PDFRobot") -> ParserOutput:
...         output = ParserOutput()
...         output.vat_supplier = self.parse_vat_supplier_number(DocumentType.INVOICE)
...         output.total_amount = self.parse_total_amount().total_amount
...         # for a specific vat-number, change parsing logic to other extension-parser,
...         # which is defined by a context value: "my-extension"
...         if output.vat_supplier == "CZ1234567890":
...             self.context = "my-extension"
...
...         output.invoice_date = self.parse_invoice_date()
...         return output

>>> import aiviro
>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.modules.universal_parser import CSVWriter, QRData
>>> # use defined Universal Parser in a scenario
>>> if __name__ == "__main__":
...     aiviro.init_logging()
...
...     r = create_pdf_robot("path/to/file.pdf")
...     csv_writer = CSVWriter(r)
...     uni_parser = MyParser([MyExtension])  # defined in the example above
...     res = uni_parser.parse(
...         pdf_r=r,
...         qr_data=QRData(r).extract(),
...         csv_writer=csv_writer,
...     )
...     print(res)
...     # ParserOutput(
...     #   vat_supplier='CZ1234567890',
...     #   invoice_date=datetime.date(2020, 1, 21),
...     #   total_amount=Decimal('12100')
...     # )

property default_parser_extension_class: Type[UniversalParserExtension]

The default behaviour class when no context is set.

Override this method to change default behaviour

property context: str

Currently set context for selection of a universal-parser extension.

Getter:: Returns current context
Setter:: Sets new context and selects appropriate parser

parse_currency() → str | None: Parse type of currency.

parse_invoice_date() → date | None: Parse invoice date.

parse_invoice_number() → str | None: Parse invoice number.

parse_order_number(primary_regex: List[str] | None = None, use_predefined_formats: bool = True, primary_keywords: List[str] | None = None) → List[str]

Parse order number.

Parameters:

primary_regex – Primary format options of order-number
use_predefined_formats – If True, also predefined formats are used as possible order-number
primary_keywords – Keywords that are used at first to search for the order number.

Example:

>>> from aiviro.modules.universal_parser.constants.keywords_regex import OrderNumberRegex
>>> from aiviro.modules.universal_parser.constants.keywords_invoice import OrderNumberKeywords
>>> parser = MyParser()
>>> parser.parse_order_number(
...     primary_regex=OrderNumberRegex.PRIMARY_REGEX_9SLASH3 + OrderNumberRegex.PRIMARY_REGEX_3SLASH9,
...     primary_keywords=OrderNumberKeywords.ORDER_NUMBER_DIRECT_SUBSCRIBER_SUBSTRING,
... )

parse_subscriber_id(doc_type: DocumentType | None = None, known_identifiers: OptionalListType = None) → str | None

Parse subscriber id.

Parameters:

doc_type – Type of document, see DocumentType
known_identifiers – List of expected subscriber ids

parse_supplier_id(doc_type: DocumentType | None = None, known_identifiers: OptionalListType = None) → str | None

Parse supplier id.

Parameters:

doc_type – Type of document, see DocumentType
known_identifiers – List of ids to exclude from search

parse_taxable_date() → date | None: Parse tax data.

parse_due_date() → date | None: Parse due date.

parse_total_amount(include_amount_without_vat: bool = True) → AmountValues

Parse total amount with & without vat.

Parameters:: include_amount_without_vat – Option to also parse total amount without vat

parse_variable_symbol() → str | None: Parse variable symbol.

parse_vat_rate() → int | None: Parse vat rate.

parse_vat_subscriber_number(doc_type: DocumentType | None = None, known_identifiers: OptionalListType = None) → str | None

Parse subscriber vat number.

Parameters:

doc_type – Type of document, see DocumentType
known_identifiers – List of expected subscriber vat numbers

parse_vat_supplier_number(doc_type: DocumentType | None = None, known_identifiers: OptionalListType = None) → str | None

Parse supplier vat number.

Parameters:

doc_type – Type of document, see DocumentType
known_identifiers – List of vat numbers to exclude from search

parse_bank_account() → BankAccountValues: Parse bank account information. It parses bank account number, IBAN, SWIFT and bank name.

parse_company_name(company_ico: str = '') → str | None

Parse company name from company IČO, using ARES. Therefore, only Czech companies are supported.

Parameters:: company_ico – Czech IČO

parse_document_items(split_condition: BaseSplitCondition | None = None) → Tuple[DocumentItemsHeader | None, List[DocItem]]

Detects and parse items in the document.

Parameters:: split_condition – Condition to split items, see BaseSplitCondition
Returns:: List of items containing all their text-boxes and overall area, see DocItem

../_images/order_items_detection.png — Example of document items detection

abstract parse_body(pdf_r: PDFRobot) → T: Method containing main logic of the universal parser.

final parse(pdf_r: PDFRobot, qr_data: QRData | None = None, csv_writer: CSVWriter | None = None) → T

Main method to call to parse a provided pdf-file.

Parameters:

pdf_r – PDF Robot with a loaded pdf-file
allow_qr_codes – If True, qr-code will be automatically detected and parsed in the pdf-file
csv_writer – Writer to automatically save parsed values
**kwargs – Any additional arguments that are passed into a parse_body() method

class aiviro.modules.universal_parser.UniversalParserExtension(robot: PDFRobot, qr_data: QRData | None = None, csv_writer: CSVWriter | None = None)

Extension class to inherit from for expanding or changing parsing logic of Universal Parser. Every extension needs to be specified by its context_name value.

Example:

>>> import datetime
>>> from aiviro.modules.universal_parser import UniversalParserExtension
>>> # create custom parser logic for a specific item (invoice_date)
>>> class MyExtension(UniversalParserExtension):
...     @classmethod
...     def get_context_name(cls) -> str:
...         return "my-extension"
...
...     def parse_invoice_date(self) -> Optional[date]:
...         return datetime.date(year=2020, month=1, day=21)

abstract classmethod get_context_name() → str: Unique extension name, used for selecting correct extension class.

class aiviro.modules.universal_parser.DocumentType(value)

Type of documents to parse, used in Universal Parser.

INVOICE = 'invoice': Invoice document

ORDER = 'order': Order document

class aiviro.modules.universal_parser.DocItem(rows: List[List[BoundBox]], area: BoundBox, page_index: int)

A document item is a group of text boxes that are grouped by rows together.

Example:

>>> from aiviro.modules.universal_parser import UniversalParser
>>> header, items = UniversalParser().parse_document_items()
>>> for item in items:
...     # access item one by one
...     for r in item.rows:
...         # process item by rows
...     # access all boxes at once
...     boxes = item.boxes

rows: List[List[BoundBox]]: Text boxes grouped into separate rows

area: BoundBox: The overall area of the item

page_index: int: The page index where the item was detected

property boxes: List[BoundBox]: Return all text-boxes in the item

class aiviro.modules.universal_parser.BaseSplitCondition

Base class for implementing custom split condition for the document items.

Example:

>>> from aiviro.modules.universal_parser.item_parsing import BaseSplitCondition
>>> class MySplitCondition(BaseSplitCondition):
...     def should_split(self, data_row: List["BoundBox"]) -> bool:
...         first_text = self._select_first_box_text(data_row)
...         return first_text.text == "some text"

abstract should_split(data_row: List[BoundBox]) → bool

Implements the logic for splitting the data into a new document’s item.

Returns:: True if the provided row is the first row of the new document’s item, False otherwise.

class aiviro.modules.universal_parser.QRData(robot: PDFRobot)

Detects, decodes and map QR data from pdf-robot using CzechQRInvoiceDecoder, to be accessible in UniversalParser. See section QR, Bar Codes for more info.

extract() → QRData: Extracts QR data from provided pdf-robot.

get(key: DataColumnNames) → Any | None: Returns extracted value based on the key, or None if it doesn’t exist.

class aiviro.modules.universal_parser.CSVWriter(robot: PDFRobot, file_prefix: str = 'reader_export', output_folder: Path | str | None = None)

Writer to save parsed values from Universal Parser into a csv-file. Name of the file is constructed as {prefix}_{year}_{month}.csv.

Parameters:

robot – PDF-Robot containing parsed pdf-file
file_prefix – Prefix of the csv-file
output_folder – Folder where to save csv-files, by default log-folder from Aiviro config-file is used

save_value(column_name: Enum, value: Any) → None

Saves value into a csv-file.

Parameters:

column_name – Type of the column
value – Value to save