(CLD3, fastText, or BERT). A single page may contain three languages. The extractor must identify each word’s script and language to apply the correct Unicode normalization and reordering. Misidentification—treating Polish “ł” as a Latin-1 glyph or Bengali as Devanagari—propagates errors. 3. The Hard Problems: Where Pipelines Bleed 3.1. Tables and Multi-column Layouts A two-column scientific PDF in French, with a sidebar in German and footnotes in Latin. A naive extractor reads across columns, producing nonsense. Robust solutions combine line clustering with whitespace analysis and column detection (e.g., camelot or pdfplumber ’s table heuristics). But true generalization requires training on multilingual table corpora—extremely scarce. 3.2. Embedded Fonts and Missing Glyphs Many PDFs subset fonts to reduce size, discarding unused Unicode codepoints. When extracting, the engine may see glyph ID 42 but have no mapping to U+0F67 (Tibetan). The fallback is a .notdef character or empty string. A multilingual system must either keep a font cache or use OCR as a secondary channel. 3.3. Right-to-Left and Mixed Direction In PDF, Arabic text is often stored in logical order (left-to-right as typed) but rendered by the viewer using the Arabic shaping engine. The text extraction layer must reorder the characters for display: what’s stored as [h, e, l, l, o, space, a, l, e, f] must become [f, e, l, a, space, h, e, l, l, o] after detecting RTL runs. Most extractors (e.g., pdftotext 4.00+) now handle this via the Unicode Bidirectional Algorithm, but errors appear when numbers or embedded Latin words interrupt the flow. 3.4. Historical and OCRed PDFs Scanned PDFs (image-only) have no text layer. A multilingual extractor must invoke OCR (Tesseract, EasyOCR, PaddleOCR) with automatic script detection. A single page may mix Fraktur (German blackletter) with modern Latin, or Ottoman Turkish in Arabic script. OCR confidence must be reported per region, and downstream NLP must tolerate character error rates >20%. 4. Landscape: Existing Tools and Their Blind Spots | Tool | Strengths | Multilingual Weaknesses | |------|-----------|------------------------| | pdfminer.six (Python) | Precise layout extraction | No built-in RTL reordering; broken for many Arabic PDFs | | pdftotext (Poppler) | Fast, reliable for Latin/Cyrillic | Limited complex script support; no table detection | | Adobe Extract API | Cloud-based, handles ligatures and tables | Proprietary, costly for bulk, non-free | | GROBID | Excellent for scientific references (any language) | Requires training data per layout; not general PDF | | Tesseract + PDF | OCR fallback for scanned docs | Requires manual script selection unless wrapped |
(ICU, HarfBuzz). For complex scripts (Devanagari, Thai, Arabic), PDFs may store precomposed glyphs (e.g., क + ् + त → क्त) or store them as separate components that must be re-ordered and ligated. A multilingual engine must reverse the shaping process. For Arabic, it must detect the base character from initial/medial/final glyph forms. For Tamil, it must reorder vowel signs that appear left or right of the consonant in print but must follow the consonant in logical Unicode. multilingual-pdf2text
# Conceptual pipeline (pseudo-code) class MultilingualPDFExtractor: def extract(self, path): # Stage 0: Render to image + text layer images = pdf2images(path, dpi=150) raw_textruns = pdfminer_extract(path) # Stage 1: Glyph-to-character (HarfBuzz shaping) char_sequence = harfbuzz_shape(raw_textruns, font=extract_fonts(path)) # Stage 2: Reading order (detect columns / vertical text) blocks = cluster_by_position(char_sequence) ordered = resolve_reading_order(blocks) # ML or heuristic # Stage 3: Language ID per block (CLD3) for block in ordered: lang, confidence = detect_language(block.text) if confidence < 0.7: # Fallback to OCR for this block block = ocr_region(images, block.bbox) block.lang = lang # Stage 4: BiDi reordering if RTL if script_is_rtl(lang): block.text = bidi_reshape(block.text) # Stage 5: Normalization (NFKC for compatibility) return unicodedata.normalize('NFKC', ' '.join(block.text for block in ordered)) (CLD3, fastText, or BERT)
No open-source tool currently handles scripts with high accuracy. The state of the art remains a hybrid: pdfminer for vector PDFs + langdetect + arabic_reshaper + bidi.algorithm + pytesseract fallback—a fragile pipeline. 5. Architectural Deep Dive: A Robust Pipeline Design A production-grade multilingual PDF-to-text system should implement the following stages, with failure recovery at each step: Tables and Multi-column Layouts A two-column scientific PDF
(heuristics + ML). PDFs lack a DOM tree. Text blocks must be clustered by Y-coordinates (lines), then X-coordinates (words), then sorted. For Latin, a simple top-to-bottom, left-to-right rule works 80% of the time. But for Mongolian (vertical), traditional Japanese (top-to-bottom, right-to-left columns), or mixed scripts (Arabic text with Latin numbers), static heuristics fail. Modern systems (e.g., Adobe’s Extract API, Google’s DocAI) use layout-aware transformers (LayoutLM, Donut) trained on millions of document pages to infer logical spans.