Reader

Note

It’s required to install reader, pdf and codes from Optional dependencies section.

The first step for using an Universal Parser, to automatically extract information from the pdf-files, is to create a new class that inherits from a UniversalParser.

The next step, in a case that the parser is not working properly or you would like it to work differently, you can implement an Extension, see UniversalParserExtension. The extensions are then provided in the constructor of your Universal Parser, they are automatically selected then based on the context value.

At the end, you just need to implement parse_body() method, that contains overall parsing logic.

class aiviro.modules.universal_parser.UniversalParser(extension_classes: Optional[List[Type[aiviro.modules.universal_parser.parser.UniversalParserExtension]]] = None)

Main class to inherit from for defining your parsing logic of Universal Parser.

Parameters

extension_classes – List of universal-parser extensions, see UniversalParserExtension

Example

>>> from aiviro.modules.universal_parser import UniversalParser, DocumentType
>>> from dataclasses import dataclass
>>> from decimal import Decimal
>>> from typing import Optional
>>>
>>> @dataclass
... class ParserOutput:
...     vat_supplier: Optional[str] = None
...     invoice_date: Optional[str] = None
...     total_amount: Optional[Decimal] = None
...
>>> class MyParser(UniversalParser[ParserOutput]):
...     def parse_body(self, pdf_r: "PDFRobot") -> ParserOutput:
...         output = ParserOutput()
...         output.vat_supplier = self.parse_vat_supplier_number(DocumentType.INVOICE)
...         output.total_amount = self.parse_total_amount().total_amount
...         # for a specific vat-number, change parsing logic to other extension-parser,
...         # which is defined by a context value: "my-extension"
...         if output.vat_supplier == "CZ1234567890":
...             self.context = "my-extension"
...
...         output.invoice_date = self.parse_invoice_date()
...         return output
>>> import aiviro
>>> from aiviro.modules.pdf import create_pdf_robot
>>> from aiviro.modules.universal_parser import CSVWriter, QRData
>>> # use defined Universal Parser in a scenario
>>> if __name__ == "__main__":
...     aiviro.init_logging()
...
...     r = create_pdf_robot("path/to/file.pdf")
...     csv_writer = CSVWriter(r)
...     uni_parser = MyParser([MyExtension])  # defined in the example above
...     res = uni_parser.parse(
...         pdf_r=r,
...         qr_data=QRData(r).extract(),
...         csv_writer=csv_writer,
...     )
...     print(res)
...     # ParserOutput(
...     #   vat_supplier='CZ1234567890',
...     #   invoice_date=datetime.date(2020, 1, 21),
...     #   total_amount=Decimal('12100')
...     # )
property default_parser_extension_class: Type[aiviro.modules.universal_parser.parser.UniversalParserExtension]

The default behaviour class when no context is set.

Override this method to change default behaviour

property context: str

Currently set context for selection of a universal-parser extension.

Getter

Returns current context

Setter

Sets new context and selects appropriate parser

parse_currency() Optional[str]

Parse type of currency.

parse_invoice_date() Optional[datetime.date]

Parse invoice date.

parse_invoice_number() Optional[str]

Parse invoice number.

parse_order_number(primary_regex: Optional[List[str]] = None, built_in_primary_regex: Optional[List[str]] = None) List[str]

Parse order number.

Parameters
  • primary_regex – List of primary regexes to look for as a format of order-number

  • built_in_primary_regex – Predefined format options of order-number

Example

>>> from aiviro.modules.universal_parser.constants.keywords_regex import OrderNumberRegex
>>> parser = MyParser()
>>> parser.parse_order_number(
...     built_in_primary_regex=OrderNumberRegex.PRIMARY_REGEXTEXT_9SLASH3
... )
parse_subscriber_id(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse subscriber id.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of expected subscriber ids

parse_supplier_id(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse supplier id.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of ids to exclude from search

parse_taxable_date() Optional[datetime.date]

Parse tax data.

parse_total_amount(include_amount_without_vat: bool = True) aiviro.modules.universal_parser.item_parsing.amount.total_amount_manager.AmountValues

Parse total amount with & without vat.

Parameters

include_amount_without_vat – Option to also parse total amount without vat

parse_variable_symbol() Optional[str]

Parse variable symbol.

parse_vat_rate() Optional[int]

Parse vat rate.

parse_vat_subscriber_number(doc_type: Optional[aiviro.modules.universal_parser.item_parsing.id_vat.utils.determinator.DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse subscriber vat number.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of expected subscriber vat numbers

parse_vat_supplier_number(doc_type: Optional[DocumentType] = None, known_identifiers: OptionalListType = None) Optional[str]

Parse supplier vat number.

Parameters
  • doc_type – Type of document, see DocumentType

  • known_identifiers – List of vat numbers to exclude from search

abstract parse_body(pdf_r: PDFRobot) aiviro.modules.universal_parser.parser.T

Method containing main logic of the universal parser.

final parse(pdf_r: PDFRobot, qr_data: Optional[QRData] = None, csv_writer: Optional[CSVWriter] = None) aiviro.modules.universal_parser.parser.T

Main method to call to parse a provided pdf-file.

Parameters
  • pdf_r – PDF Robot with a loaded pdf-file

  • allow_qr_codes – If True, qr-code will be automatically detected and parsed in the pdf-file

  • csv_writer – Writer to automatically save parsed values

  • **kwargs – Any additional arguments that are passed into a parse_body() method

class aiviro.modules.universal_parser.UniversalParserExtension(robot: PDFRobot, qr_data: Optional[QRData] = None, csv_writer: Optional[CSVWriter] = None)

Extension class to inherit from for expanding or changing parsing logic of Universal Parser. Every extension needs to be specified by its context_name value.

Example

>>> import datetime
>>> from aiviro.modules.universal_parser import UniversalParserExtension
>>> # create custom parser logic for a specific item (invoice_date)
>>> class MyExtension(UniversalParserExtension):
...     @classmethod
...     def get_context_name(cls) -> str:
...         return "my-extension"
...
...     def parse_invoice_date(self) -> Optional[date]:
...         return datetime.date(year=2020, month=1, day=21)
abstract classmethod get_context_name() str

Unique extension name, used for selecting correct extension class.

class aiviro.modules.universal_parser.DocumentType(value)

Type of documents to parse, used in Universal Parser.

INVOICE = 'invoice'

Invoice document

ORDER = 'order'

Order document

class aiviro.modules.universal_parser.QRData(robot: PDFRobot)

Detects, decodes and map QR data from pdf-robot using CzechQRInvoiceDecoder, to be accessible in UniversalParser. See section QR, Bar Codes for more info.

extract() aiviro.modules.universal_parser.item_parsing.utilities.qr_code.QRData

Extracts QR data from provided pdf-robot.

get(key: aiviro.modules.reader.constants.DataColumnNames) Optional[Any]

Returns extracted value based on the key, or None if it doesn’t exist.

class aiviro.modules.universal_parser.CSVWriter(robot: PDFRobot, file_prefix: str = 'reader_export', output_folder: Optional[Union[pathlib.Path, str]] = None)

Writer to save parsed values from Universal Parser into a csv-file. Name of the file is constructed as {prefix}_{year}_{month}.csv.

Parameters
  • robot – PDF-Robot containing parsed pdf-file

  • file_prefix – Prefix of the csv-file

  • output_folder – Folder where to save csv-files, by default log-folder from Aiviro config-file is used

save_value(column_name: aiviro.modules.reader.constants.DataColumnNames, value: Any) None

Saves value into a csv-file.

Parameters
  • column_name – Type of the column

  • value – Value to save