Skip to content
background-image background-image

Recognize and “read” the text embedded in image

This example demonstrates the use of Python's pytesseract library to extract and analyze text embedded in an image.

Statement

The input is a .png image file selected via a File Reader from the local system. This image will be processed using pytesseract to extract any embedded text.

import base64
import pytesseract
from PIL import Image
from langdetect import detect
from io import BytesIO

image_data = base64.b64decode(INPUT_DATA[0]["Data"])
image_pil = Image.open(BytesIO(image_data))
ocr_result = pytesseract.image_to_string(image_pil.convert("L"), lang="ces+eng+pol+deu+spa+fra")
return [{
  "Text": ocr_result,
  "Lang": detect(ocr_result)
}]

In this code, a base64-encoded image is decoded and processed using pytesseract to extract text. The detected text is then analyzed to determine its language.

Explanation

Our Python service processes image input in base64 format. Upon receiving the input, it automatically decodes the base64 string and attempts to open it as an image using the Pillow library.

Additionally, the behavior of pytesseract.image_to_string() and langdetect.detect() functions is modified:

  • pytesseract.image_to_string(): Converts the decoded image to grayscale and extracts any readable text using Tesseract OCR with support for multiple languages (Czech, English, Polish, German, Spanish, and French).
  • langdetect.detect(): Analyzes the extracted text to determine the most probable language.

For more information, refer to the Pytesseract documentation.

Conclusion

This documentation provides a clear explanation of how to extract text from images using Python’s pytesseract library. It outlines how base64-encoded image data can be decoded, processed, and analyzed using OCR and language detection. The described Python service demonstrates how to automate text extraction from scanned documents and multilingual sources for further processing or analysis.