Recognize and “read” the text embedded in image
This example demonstrates the use of Python's pytesseract library to extract and analyze text embedded in an image.
Statement
The input is a .png
image file selected via a File Reader from the local system. This image will be processed using pytesseract to extract any embedded text.
import base64
import pytesseract
from PIL import Image
from langdetect import detect
from io import BytesIO
image_data = base64.b64decode(INPUT_DATA[0]["Data"])
image_pil = Image.open(BytesIO(image_data))
ocr_result = pytesseract.image_to_string(image_pil.convert("L"), lang="ces+eng+pol+deu+spa+fra")
return [{
"Text": ocr_result,
"Lang": detect(ocr_result)
}]
In this code, a base64-encoded image is decoded and processed using pytesseract to extract text. The detected text is then analyzed to determine its language.
Explanation
Our Python service processes image input in base64 format. Upon receiving the input, it automatically decodes the base64 string and attempts to open it as an image using the Pillow library.
Additionally, the behavior of pytesseract.image_to_string()
and langdetect.detect()
functions is modified:
pytesseract.image_to_string()
: Converts the decoded image to grayscale and extracts any readable text using Tesseract OCR with support for multiple languages (Czech, English, Polish, German, Spanish, and French).langdetect.detect()
: Analyzes the extracted text to determine the most probable language.
For more information, refer to the Pytesseract documentation.
Conclusion
This documentation provides a clear explanation of how to extract text from images using Python’s pytesseract library. It outlines how base64-encoded image data can be decoded, processed, and analyzed using OCR and language detection. The described Python service demonstrates how to automate text extraction from scanned documents and multilingual sources for further processing or analysis.