Synthetic Document Generation Pipeline for Training Artificial Intelligence Models

Abstract. Embodiments described herein are directed towards a synthetic document generation pipeline for training artificial intelligence models. One embodiment includes a method including a device that receives an instruction to generate a document to be used as a training instance for a first machine learning model, the instruction including an element configuration, a document class configuration, a format configuration, an augmentation configuration, and data bias and fairness. The device can receive an element from an interface based at least in part on the element configuration, the element can simulate a real-world image, real-world text, or real-world machine-readable visual code. The device can generate metadata describe a layout for the element on the document based on the document class configuration. The device can generate the document by arranging the element on the document based on the metadata, wherein the document is generated in a format based on the format configuration.

Links: Patent

Updated: