Athento SE is the intelligent engine of Athento which focuses in document analysis and document capture features. Athento Smart Engine is a web application designed to process documents, get information and automatize your processing in online site.
Athento SE can analyze many characteristics of documents such as predominant colors, histograms, OCR, HOCR, white analysis, logos detections, tables, text orientation, among other features.
This document analysis allows us to automate tasks such as automatic document classification, organization, and automatic categorization in document repositories such as Sharepoint, Documentum, FileNet, Alfresco, Nuxeo or OpenText or the extraction of data of this content.
Athento is a web application developed in Python and designed to withstand Cloud environments. Athento incorporates more than 100 operations for document analysis. The OCR engine Tesseract is used by default, but you can use other engines, including OCR, ICR and OMR engines.
Loading documents to Athento
The system allows to capture content from multiple sources, among which are:
- User interface: Form and Drag & Drop.
- Hot folders: This feature is ideal for massive load of documentation. A demon will regulary visit the folder within the frametime we indicate on a Celery task.
- By email.
Extracting the text of a document or content
Athento can convert scanned PDF, images and other digital formats into editable text output formats. This is possible because Athento applies OCR and other technologies to documents and digital content.
The following are the file formats allowed in Athento:
|Images||Indexable File Formats||Audio|
|GIF (.gif)||TXT (.txt)||Mp3- only mono (.mp3)|
|JPG (.jpg y .jpeg)||Microsoft Word (.doc y .docx)||OGG (.ogg)|
HTML (.html y htm)
|Microsoft WAV (.wav)|
|PNG (.png)||Microsoft Excel (.xls y .xlsx)|
|TIFF (.tiff)||Power Point (.pptx)|
Open Office (.odt)
Athento can also process audio files by applying Speech Recognition technologies.
Athento extracts the text from each page of a document as you can see in the following image:
Athento is able to automatically identify document types. This task can take place by several methods, for example:
- Using keywords contained in the document : For example, we can use the keyword "invoice" in order to identify an invoice.
2. Image Anchor Templates: Athento allows you to define images from documents to act as anchor templates. This makes possible for example, to classify a document by looking for a logotype.
3. Barcodes and QR codes: Athento can classify a document depending on a value contained in a barcode or QR code.
4. Blank percentaje within an image: This method combined to other can improve classification results.
Classification and separation of batches
Athento classifies page to page a group of documents that are included within a PDF . You can apply for each page of a document the same sorting operations that can be applied to individual documents.
Once classified pages , Athento creates individual documents from different document types you have found in the batch or lot .
Athento allows the automatic extraction of a scheme of metadata associated to a document type. The metadata fields can be defined from the user interface.
Athento provides multiple methods for data extraction.
Regular Expressions: Athento looks for patterns within the textual content of the document. This feature is useful for extracting dates, ID cards, bank accounts, or any data with a clearly defined format. With this mechanism, we need not define the exact area where the data will appear.
Expressions strings delimited by start and end: Athento allows extraction of information that is in between two strings. This method is useful when we are looking always appears in the same textual context. With this mechanism, we need not define the exact area where the data will appear.
Zonal OCR: Allows you to extract information when we know the exact area in which this appears. Athento allows us to configure this area since the preview of the document selector area. This feature is of great use for forms processing.
Bar codes and QR: Athento allows detection of bar codes and QR standards. Some of permitted codes are:
HOCR: Allows locate within a document coordinates relative to a specific word contained within the text of the document (HOCR).
Removing tabular information: Athento is able to extract tables from documents. The Tupi table metadata exported to Excel tabular information enclosed within the specified coordinates.
Number auto-increment: Athento an auto-increment number assigned to documents that belong to a particular document type.
For more information on configuring metadata, please review the documentation metadata extraction guide of Athento Smart Engine in the following link:
Handwritten Signatures and Seals Detection
Athento SE can detect the presence or absence of an object in a given area. However, it can not directly detect a signature or a seal. This detection is done by calculating whether or not the area is empty. This is possible by taking into account the percentage of white within a given area. The higher the percentage of white in an area, the emptier is the area. You can see here how signature and seal detection works
PDF Compression: Athento allows you to compress PDF files. The compression is made via Ghostscript so that a good level of compression is achieved without much loss of quality in the document. Depending on the type of document, you can reach a 50% compression rate.
Besides, Athento can automate the "archiving" of documents, so that after a certain condition of quality and time, the documents would be save in an archive repository. This makes storage management more efficient.
Quality: Athento can apply different image filters that improve quality of documents. Some of these filters are detailed below.
- Salt & pepper
- Blur Bayessian
Corrections: Athento allows you to perform corrections of images. For instance, Athento has a feature that detects pages rotation in vertical and horizontal degrees: 90, 180 and 270. Scanned pages are often a few degrees off from straight; the edges of the scanned page don't align with the edge of the pdf document. After detecting rotation percentage on a page, Athento can use this rotation percentage in order to straighten the page. On the other hand, it has a feature of "deskewing" to correct small rotations of the document.
Also, Athento allows removing blank pages within a PDF. This is posible because the software recognizes the percentage of white on a page and uses this data to delete a page when it is empty.
Detecting Predominant Colors on a Document
Athento can detect the colors appearing more frequently in a document. This information can be used for classifying documents.
Athento SE uses Semantics to find key terms in documents and turn them into labels or tags that connect to other documents which include the same terms. By default, Athento uses Open Calais' ontology. Using Open Calais, Athento can detect and tag documents with dates, people, companies, cities and countries.
In addition, Athento allows customers to use their own ontology. You will be able to automagically tag documents with terms that are relevant to your business. This is possible through integration with Apache Stanbol.
Detecting and Deleting Human Faces from Documents
Athento can recognize human faces in and delete them automatically from the image.
In order to protect privacy and personal data, Athento allows you to anonymize certain areas within the document. These areas are typically used to hide data to certain profiles while consulting documentation.
The areas may be indicated by exact coordinates or by using text anchors within the text of the document. Anonymization of metadata can be set up from Athento's interface.
Athento can get documents from a CMIS repository in order to process them. Likewise, it is able to save processed documents into a particular external repository. This is possible by using CMIS. Load configuration in both directions is set up from the software interface.
Extracting data from external databases
Athento allows obtaining data from external databases. These databases are loaded into the system as CSV files.
Other operations on files
Renaming files: Athento allows you to automatically change a filename after processing.
File Compression: Athento allows downsizing of PDF documents.
Merge of Documents: Athento allows merging several documents into a single PDF document.
- No labels