An entity can be a word or series of words that consistently refer to the same thing. Every detected entity is classified into a prelabelled category. For example, a NER model might detect the word "London" in a text and classify it as a 'Geography'.
But we need to find the entities from the images. So for this purpose, we need to extract text from the images, so for extracting text we are a technique called OCR.
What is Optimal character recognition(OCR)?
OCR stands for Optical Character Recognition. It is widespread technology to recognize text inside images, such as scanned documents and photos. OCR is used to convert any kind of image containing text like(typed handwritten or printed) into machine-readable text format.
For extracting the text we are using open-source software called tesseract which can be implemented using the Pytesseract package.
Techniques used
- Open CV
- Spacy
- Nltk
- OCR
- Regex
- Pandas
How does Pytesseract work?
Pytesseract detects the images in five different stages where we can collect complete text step by step
Step: 1 -> detect complete page
Step: 2 -> detect individual blocks of the image
Step : 3 -> detect paragraphs
Step: 4 -> detect Line
Step: 5 -> detect words
For detecting entities, we collected individual words from images using step 5 and created a rectangle box on the top of each word using geometric transformations
Same procedure we applied for entire data and collected individual words and saved it in a CSV file.
For detecting the entities we need class labels for each word, so for creating custom entity recognition on images we used a technique called BIO, where B - Token begins an entity, I - Token is inside an entity, O - Token is outside an entity. Using five different labels we made unstructured data into a structured format. Class labels we used to train the models are,
Now for training the custom entity recognition, we selected a spacy pre-trained model, so we convert the data into a spacy format like complete image data and its corresponding words and labels into dictionary type, This process, we applied for entire data and divide the data into the training part and the testing part. For training purposes, we used 50 epochs where at the end of the training we got 94% accuracy for the model, 91% precision, and 90.6% recall. We test around 20 images using our trained model and check the results,
But we got predictions based on the labels we gave for each word while training using the BOI concept, so we find a solution to collect the index position, left and right positions which we get from Pytesseract, for the predicted word and if any word matched BOI format with the same name we can add those detected labels.
Finally, after adding their index positions it works well to detect entities from the images.
Few Predicted Images :