Reader

Note

It’s required to install reader, pdf and codes from Optional dependencies section.

The first step for using an Universal Parser, to automatically extract information from the pdf-files, is to create a new class that inherits from a UniversalParser.

The next step, in a case that the parser is not working properly or you would like it to work differently, you can implement an Extension, see UniversalParserExtension. The extensions are then provided in the constructor of your Universal Parser, they are automatically selected then based on the context value.

At the end, you just need to implement parse_body() method, that contains overall parsing logic.

class aiviro.modules.universal_parser.UniversalParser(extension_classes: Optional[List[Type[aiviro.modules.universal_parser.parser.UniversalParserExtension]]] = None)

Main class to inherit from for defining your parsing logic of Universal Parser.

Parameters

extension_classes – List of universal-parser extensions, see UniversalParserExtension

Example

>>> from aiviro.modules.universal_parser import UniversalParser, DocumentType
>>> from dataclasses import dataclass
>>> from decimal import Decimal
>>> from typing import Optional
>>>
>>> @dataclass
... class ParserOutput:
...     vat_supplier: Optional[str] = None
...     invoice_date: Optional[str] = None
...     total_amount: Optional[Decimal] = None
...
>>> class MyParser(UniversalParser[ParserOutput]):
...     def parse_body(self, pdf_r: "PDFRobot") -> ParserOutput:
...         output = ParserOutput()
...         output.vat_supplier = self.parse_vat_supplier_number(DocumentType.INVOICE)
...         output.total_amount = self.parse_total_amount().total_amount
...         # for a specific vat-number, change parsing logic to other extension-parser,
...         # which is defined by a context value: "my-extension"
...         if output.vat_supplier == "CZ1234567890":
...             self.context = "my-extension"
...
...         output.invoice_date = self.parse_invoice_date()
...         return output
>>> import aiviro
>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.modules.universal_parser import CSVWriter, QRData
>>> # use defined Universal Parser in a scenario
>>> if __name__ == "__main__":
...     aiviro.init_logging()
...
...     r = create_pdf_robot("path/to/file.pdf")
...     csv_writer = CSVWriter(r)
...     uni_parser = MyParser([MyExtension])  # defined in the example above
...     res = uni_parser.parse(
...         pdf_r=r,
...         qr_data=QRData(r).extract(),
...         csv_writer=csv_writer,
...     )
...     print(res)
...     # ParserOutput(
...     #   vat_supplier='CZ1234567890',
...     #   invoice_date=datetime.date(2020, 1, 21),
...     #   total_amount=Decimal('12100')
...     # )
property default_parser_extension_class: Type[aiviro.modules.universal_parser.parser.UniversalParserExtension]

The default behaviour class when no context is set.

Override this method to change default behaviour

property context: str

Currently set context for selection of a universal-parser extension.

Getter

Returns current context

Setter

Sets new context and selects appropriate parser

parse_currency() Optional[str]

Parse type of currency.

parse_invoice_date() Optional[datetime.date]

Parse invoice date.

parse_invoice_number() Optional[str]

Parse invoice number.

parse_order_number(primary_regex: Optional[List[str]] = None, use_predefined_formats: bool = True, primary_keywords: Optional[List[str]] = None) List[str]

Parse order number.

Parameters
  • primary_regex – Primary format options of order-number

  • use_predefined_formats – If True, also predefined formats are used as possible order-number

  • primary_keywords – Keywords that are used at first to search for the order number.

Example

>>> from aiviro.modules.universal_parser.constants.keywords_regex import OrderNumberRegex
>>> from aiviro.modules.universal_parser.constants.keywords_invoice import OrderNumberKeywords
>>> parser = MyParser()
>>> parser.parse_order_number(
...     primary_regex=OrderNumberRegex.PRIMARY_REGEX_9SLASH3 + OrderNumberRegex.PRIMARY_REGEX_3SLASH9,
...     primary_keywords=OrderNumberKeywords.ORDER_NUMBER_DIRECT_SUBSCRIBER_SUBSTRING,
... )
parse_subscriber_id(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse subscriber id.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of expected subscriber ids

parse_supplier_id(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse supplier id.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of ids to exclude from search

parse_taxable_date() Optional[datetime.date]

Parse tax data.

parse_due_date() Optional[datetime.date]

Parse due date.

parse_total_amount(include_amount_without_vat: bool = True) aiviro.modules.universal_parser.item_parsing.amount.total_amount_manager.AmountValues

Parse total amount with & without vat.

Parameters

include_amount_without_vat – Option to also parse total amount without vat

parse_variable_symbol() Optional[str]

Parse variable symbol.

parse_vat_rate() Optional[int]

Parse vat rate.

parse_vat_subscriber_number(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse subscriber vat number.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of expected subscriber vat numbers

parse_vat_supplier_number(doc_type: Optional[DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse supplier vat number.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of vat numbers to exclude from search

parse_bank_account() aiviro.modules.universal_parser.item_parsing.bank_account.manager.BankAccountValues

Parse bank account information. It parses bank account number, IBAN, SWIFT and bank name.

parse_company_name(company_ico: str = '') Optional[str]

Parse company name from company IČO, using ARES. Therefore, only Czech companies are supported.

Parameters

company_ico – Czech IČO

parse_document_items(split_condition: Optional[aiviro.modules.universal_parser.item_parsing.document_items.storage.BaseSplitCondition] = None) Tuple[Optional[aiviro.modules.universal_parser.item_parsing.document_items.storage.DocumentItemsHeader], List[aiviro.modules.universal_parser.item_parsing.document_items.storage.DocItem]]

Detects and parse items in the document.

Parameters

split_condition – Condition to split items, see BaseSplitCondition

Returns

List of items containing all their text-boxes and overall area, see DocItem

../_images/order_items_detection.png

Example of document items detection

abstract parse_body(pdf_r: PDFRobot) aiviro.modules.universal_parser.parser.T

Method containing main logic of the universal parser.

final parse(pdf_r: PDFRobot, qr_data: Optional[QRData] = None, csv_writer: Optional[CSVWriter] = None) aiviro.modules.universal_parser.parser.T

Main method to call to parse a provided pdf-file.

Parameters
  • pdf_r – PDF Robot with a loaded pdf-file

  • allow_qr_codes – If True, qr-code will be automatically detected and parsed in the pdf-file

  • csv_writer – Writer to automatically save parsed values

  • **kwargs – Any additional arguments that are passed into a parse_body() method

class aiviro.modules.universal_parser.UniversalParserExtension(robot: PDFRobot, qr_data: Optional[QRData] = None, csv_writer: Optional[CSVWriter] = None)

Extension class to inherit from for expanding or changing parsing logic of Universal Parser. Every extension needs to be specified by its context_name value.

Example

>>> import datetime
>>> from aiviro.modules.universal_parser import UniversalParserExtension
>>> # create custom parser logic for a specific item (invoice_date)
>>> class MyExtension(UniversalParserExtension):
...     @classmethod
...     def get_context_name(cls) -> str:
...         return "my-extension"
...
...     def parse_invoice_date(self) -> Optional[date]:
...         return datetime.date(year=2020, month=1, day=21)
abstract classmethod get_context_name() str

Unique extension name, used for selecting correct extension class.

class aiviro.modules.universal_parser.DocumentType(value)

Type of documents to parse, used in Universal Parser.

INVOICE = 'invoice'

Invoice document

ORDER = 'order'

Order document

class aiviro.modules.universal_parser.DocItem(rows: List[List[BoundBox]], area: BoundBox, page_index: int)

A document item is a group of text boxes that are grouped by rows together.

Example

>>> from aiviro.modules.universal_parser import UniversalParser
>>> header, items = UniversalParser().parse_document_items()
>>> for item in items:
...     # access item one by one
...     for r in item.rows:
...         # process item by rows
...     # access all boxes at once
...     boxes = item.boxes
rows: List[List[BoundBox]]

Text boxes grouped into separate rows

area: BoundBox

The overall area of the item

page_index: int

The page index where the item was detected

property boxes: List[BoundBox]

Return all text-boxes in the item

class aiviro.modules.universal_parser.BaseSplitCondition

Base class for implementing custom split condition for the document items.

Example

>>> from aiviro.modules.universal_parser.item_parsing import BaseSplitCondition
>>> class MySplitCondition(BaseSplitCondition):
...     def should_split(self, data_row: List["BoundBox"]) -> bool:
...         first_text = self._select_first_box_text(data_row)
...         return first_text.text == "some text"
abstract should_split(data_row: List[BoundBox]) bool

Implements the logic for splitting the data into a new document’s item.

Returns

True if the provided row is the first row of the new document’s item, False otherwise.

class aiviro.modules.universal_parser.QRData(robot: PDFRobot)

Detects, decodes and map QR data from pdf-robot using CzechQRInvoiceDecoder, to be accessible in UniversalParser. See section QR, Bar Codes for more info.

extract() aiviro.modules.universal_parser.utils.qr_code.QRData

Extracts QR data from provided pdf-robot.

get(key: aiviro.modules.reader.constants.DataColumnNames) Optional[Any]

Returns extracted value based on the key, or None if it doesn’t exist.

class aiviro.modules.universal_parser.CSVWriter(robot: PDFRobot, file_prefix: str = 'reader_export', output_folder: Optional[Union[pathlib.Path, str]] = None)

Writer to save parsed values from Universal Parser into a csv-file. Name of the file is constructed as {prefix}_{year}_{month}.csv.

Parameters
  • robot – PDF-Robot containing parsed pdf-file

  • file_prefix – Prefix of the csv-file

  • output_folder – Folder where to save csv-files, by default log-folder from Aiviro config-file is used

save_value(column_name: aiviro.modules.reader.constants.DataColumnNames, value: Any) None

Saves value into a csv-file.

Parameters
  • column_name – Type of the column

  • value – Value to save