Reader

Warning

Previous implementation of the Reader was moved to Universal Parser section. This is a new improved version of the Reader, old version is deprecated and will be removed in the future.

Note

It’s required to install reader, pdf and codes from Optional dependencies section.

Aiviro Reader allows you to process PDF files and extract crucial information. Whether it’s vendor IDs, customer IDs, total amounts, or more, it simplifies data extraction from invoices.

Invoice Processing

Supported field extraction

Name

Type

Description

language

Language

Language of the invoice

customer_id

str

Customer reference ID

customer_ico

str

The Czech company ID number of the customer

customer_tax_id

str

The taxpayer number associated with the customer

customer_address

InvoiceAddress

Mailing address for the customer

customer_address_recipient

str

Name associated with the customer address

customer_name

str

Name of the customer

vendor_id

str

Vendor reference ID

vendor_ico

str

The Czech company ID number of the vendor

vendor_tax_id

str

The taxpayer number associated with the vendor

vendor_address

InvoiceAddress

Vendor mailing address

vendor_address_recipient

str

Name associated with the vendor address

vendor_name

str

Name of the vendor

invoice_id

str

Invoice number

invoice_date

datetime.date

Date the invoice was issued

due_date

datetime.date

Date payment for this invoice is due

tax_date

datetime.date

Date the tax was applied to the invoice

order_number

list, str

Order reference number

total_amount

decimal.Decimal

Total amount of the invoice

total_amount_without_tax

decimal.Decimal

Total amount of the invoice without tax

total_tax

decimal.Decimal

Total tax amount of the invoice

amount_due

decimal.Decimal

Amount due for the invoice

currency

str

Currency of the invoice

variable_symbol

str

Variable symbol of the invoice

payment_terms

str

The terms of payment for the invoice

bank_accounts

list, InvoiceBankAccount

List of bank accounts

items

list, InvoiceItem

List of invoice items, filtered by total_amount and total_amount_without_tax

raw_items

list, InvoiceItem

List of unfiltered invoice items

Invoice Bank Account

Name

Type

Description

iban

str

IBAN of the bank account

swift

str

SWIFT (BIC) code of the bank account

bank_name

str

Name of the bank

country_code

str

Country code of the bank account, e.g., CZ, DE, etc.

local_account

LocalBankAccount

Contains bank code, account number and account prefix

Invoice items

Name

Type

Description

index

int

Line item index, starting from 0

item

str

Full string text line of the line item

description

str

The text description for the invoice line item

quantity

decimal.Decimal

The quantity of the item

unit_price

decimal.Decimal

The net or gross price of one unit of this item, primarily net

unit

str

The unit of the line item, e.g, kg, lb etc.

product_code

str

Product code, product number, SKU, etc.

tax

decimal.Decimal

Tax associated with the line item

tax_rate

decimal.Decimal

Tax rate associated with the line item

amount

decimal.Decimal

Total gross amount of the line item

amount_without_tax

decimal.Decimal

Total net amount of the line item

identifier

str

Identifier of the item, found in product code, description or item’s content

tag

str

Tag of the item associated based on the identifier

class aiviro.modules.reader.Language(value)

Supported languages for invoice reader.

CZ = 'cs'

Czech

SK = 'sk'

Slovak

EN = 'en'

English

DE = 'de'

German

PL = 'pl'

Polish

class aiviro.modules.reader.InvoiceReader(pdf_r: PDFRobot, reader_config: ReaderConfig | None = None)

Reads pdf invoice file and extracts data from it.

Parameters:
  • pdf_r – PDFRobot instance.

  • reader_config – Configuration for InvoiceReader.

Note

If you receive a 401 Unauthorized Error, please contact our support team. This error typically indicates that you may be missing the necessary permissions for the API.

Example:

>>> from aiviro.modules.reader import InvoiceReader
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> if __name__ == "__main__":
...     pdf_r = create_pdf_robot("path/to/invoice.pdf")
...     reader = InvoiceReader(pdf_r)
...     extracted_data = reader.parse()
...
...     # print value of invoice-id
...     print(extracted_data.invoice_id.value)
...     # '123456789'
...
...     # print items
...     for item in extracted_data.items:
...         print(item.product_code.value)
...         print(item.amount.value)
...     # "ACD-123"
...     # Decimal('100.00')
...     # "DC-456"
...     # Decimal('157.23')
>>> from aiviro.modules.reader import InvoiceReader, ReaderConfig
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> if __name__ == "__main__":
...     pdf_r = create_pdf_robot("path/to/invoice.pdf")
...     reader_config = ReaderConfig([r"(\d{4,6})", r"([A-Z]{2}\d{4})"])
...     reader = InvoiceReader(pdf_r, reader_config)
...     extracted_data = reader.parse()
...
...     # print value of order-numbers
...     for order_number in extracted_data.order_number:
...         print(order_number.value)
...     # '4654'
...     # '123456'
...     # 'AC1234'
property invoice_data: InvoiceData

Return merged extracted data from all processors. Call parse() first to extract data.

Returns:

Extracted data from invoice.

parse(offline_cloud_data: Dict | None = None, isdoc_file: Path | None = None) InvoiceData

Parse invoice and return extracted data.

Parameters:
  • offline_cloud_data – If set, cloud processor will not be used and this data will be used instead.

  • isdoc_file – Path to .isdoc file, which is not included in the pdf.

Returns:

Extracted data from invoice.

add_items_tag(items: List[InvoiceItem], tags_identifiers: Dict[str, Set[str]], overwrite: bool = False) List[InvoiceItem]

Add tag to items based on their identifiers.

Parameters:
  • items – List of InvoiceItems.

  • tags_identifiers – Dictionary where key is tag name and value is set of identifiers.

  • overwrite – If True, tags will be overwritten, otherwise they will be appended.

Returns:

List of items with added tags.

class aiviro.modules.reader.ISDOCProcessor(config: GlobalConfig, reader_config: ReaderConfig)

Processor for isdoc files. If you’re processing Invoices, we recommend using InvoiceReader. If for some reason you need to process only isdoc files, use this class.

Example:

>>> import pathlib
>>> from collections import OrderedDict
>>> from aiviro.modules.reader import ISDOCProcessor, ReaderConfig, InvoiceData
>>> from aiviro.core.utils.configuration import get_global_config
>>> from aiviro.modules.pdf import create_pdf_robot
>>>
>>> # Process .isdoc file directly, without PDFRobot
>>> def process_isdoc(isdoc_path: pathlib.Path) -> InvoiceData:
...     processor = ISDOCProcessor(get_global_config(), ReaderConfig())
...     processor.isdoc_path = isdoc_path
...     return processor.process(None, OrderedDict())
...
>>> # Process .isdoc file from PDF, file is included as pdf attachment
>>> def process_isdoc_pdf(pdf_path: pathlib.Path) -> InvoiceData:
...    pdf_robot = create_pdf_robot(pdf_path)
...    processor = ISDOCProcessor(get_global_config(), ReaderConfig())
...    return processor.process(pdf_robot, OrderedDict())
class aiviro.modules.reader.ReaderConfig(order_number_formats: ~typing.List[str] = <factory>, order_number_ignore_keywords: bool = False, document_language: ~aiviro.modules.reader.common.keywords.Language | None = None, item_identifiers: ~typing.Set[str] | ~typing.Dict[str, ~typing.Set[str]] = <factory>)

Configuration for InvoiceReader.

Parameters:
  • order_number_formats – List of regex patterns for order number, if not provided, default patterns will be used.

  • order_number_ignore_keywords – If set, keywords for order number will be ignored. And therefore, the reader will try to find the order number on the every page. Keyword is a word or phrase that defines where the order number is located, e.g. “Reference: OD1234”. The “Reference” is the keyword in this case, and “OD1234” is the order number.

  • document_language – If set auto-detection of language will be skipped.

  • item_identifiers – If provided, reader will try to find the identifier in the item’s product_code, description or content. If the identifier is found, it will be stored in the item’s identifier field. In case dictionary is provided, the key is the tag name and the value is a set of possible identifiers.

Example:

>>> from aiviro.modules.reader import OrderNumberFormats, ReaderConfig
>>>
>>> if __name__ == "__main__":
...     # use predefined patterns (helios pattern '123456789/123')
...     reader_config = ReaderConfig(
...         order_number_formats=OrderNumberFormats.PRIMARY_REGEX_9SLASH3,
...         order_number_ignore_keywords=True,
...         item_identifiers={"CODE1234", "DESC1234", "CONTENT1234"},
...     )
class aiviro.modules.reader.InvoiceData(language: Optional[aiviro.modules.reader.common.keywords.Language] = None, customer_id: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), customer_tax_id: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), customer_address: aiviro.modules.reader.storage.InvoiceField[aiviro.modules.reader.storage.InvoiceAddress] = InvoiceField(value=None, bound_box=None, page_index=-1), customer_address_recipient: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), customer_ico: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), customer_name: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_id: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_tax_id: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_address: aiviro.modules.reader.storage.InvoiceField[aiviro.modules.reader.storage.InvoiceAddress] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_address_recipient: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_ico: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), vendor_name: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), shipping_address: aiviro.modules.reader.storage.InvoiceField[aiviro.modules.reader.storage.InvoiceAddress] = InvoiceField(value=None, bound_box=None, page_index=-1), shipping_address_recipient: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), invoice_id: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), invoice_date: aiviro.modules.reader.storage.InvoiceField[datetime.date] = InvoiceField(value=None, bound_box=None, page_index=-1), due_date: aiviro.modules.reader.storage.InvoiceField[datetime.date] = InvoiceField(value=None, bound_box=None, page_index=-1), tax_date: aiviro.modules.reader.storage.InvoiceField[datetime.date] = InvoiceField(value=None, bound_box=None, page_index=-1), order_number: List[aiviro.modules.reader.storage.InvoiceField[str]] = <factory>, total_amount: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal] = InvoiceField(value=None, bound_box=None, page_index=-1), total_amount_without_tax: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal] = InvoiceField(value=None, bound_box=None, page_index=-1), total_tax: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal] = InvoiceField(value=None, bound_box=None, page_index=-1), amount_due: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal] = InvoiceField(value=None, bound_box=None, page_index=-1), currency: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), variable_symbol: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), payment_terms: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), bank_accounts: List[aiviro.modules.reader.storage.InvoiceBankAccount] = <factory>, items: List[aiviro.modules.reader.storage.InvoiceItem] = <factory>, raw_items: List[aiviro.modules.reader.storage.InvoiceItem] = <factory>)
property page_boxes: Dict[int, List[BoundBox]]

Returns a dictionary of page index and list of BoundBox objects on that page.

class aiviro.modules.reader.InvoiceAddress(house_number: str, road: str, city: str, postal_code: str, street_address: str, country_code_A3: str)
class aiviro.modules.reader.InvoiceBankAccount(iban: aiviro.modules.reader.storage.InvoiceField[str], swift: aiviro.modules.reader.storage.InvoiceField[str], bank_name: aiviro.modules.reader.storage.InvoiceField[str], country_code: aiviro.modules.reader.storage.InvoiceField[str], local_account: aiviro.modules.reader.storage.LocalBankAccount)
class aiviro.modules.reader.InvoiceItem(index: int, item: aiviro.modules.reader.storage.InvoiceField[str], description: aiviro.modules.reader.storage.InvoiceField[str], quantity: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal], unit_price: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal], unit: aiviro.modules.reader.storage.InvoiceField[str], product_code: aiviro.modules.reader.storage.InvoiceField[str], tax: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal], tax_rate: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal], amount: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal], amount_without_tax: aiviro.modules.reader.storage.InvoiceField[decimal.Decimal] = InvoiceField(value=None, bound_box=None, page_index=-1), identifier: aiviro.modules.reader.storage.InvoiceField[str] = InvoiceField(value=None, bound_box=None, page_index=-1), tags: List[aiviro.modules.reader.storage.InvoiceField[str]] = <factory>)
class aiviro.modules.reader.InvoiceField(value: T | None = None, bound_box: Optional[ForwardRef('BoundBox')] = None, page_index: int = -1)