Excited to announce I’m building Docuglean - an open-source, privacy-focused document intelligence layer with three key foundations:. • Agentic OCR for parsing documents into high-quality, ready-to-use data. powered by SOTA VLM models for complex tables, formats, and layouts. • an opensource PII engine that beats Microsoft's Presidio in accuracy and speed • a suite of Document AI tools including classify, extract (layout, bounding box support, structured schema), review, document splitting, chunking, multi-doc type converter, multilingual support, and tokens count
Why Build This? Data is the fuel for AI innovation. Over the past couple of years, I've helped several traditional companies integrate AI workflows into their systems, and the biggest bottleneck has been extracting insights from existing data - millions of unstructured documents in PDFs and images (scanned, handwritten, etc.). Garbage document processing = garbage AI outputs. Plus • Privacy concerns block enterprise AI adoption • Complex PDFs with tables/layouts break traditional OCR It doesn't matter how sophisticated your AI workflows, prompts, RAG, or agents are; getting the foundational document processing pipeline right is the only way to maximize the value of your data. That starts with parsing documents into ready-to-use, high-quality data.
Why Agentic OCR? Over the past few months, there's been a huge influx of amazing SOTA VLMs from Qwen, Allen AI, and Hugging Face. But documents (especially PDFs) are tricky. Complex layouts, tables, and formats mean using just VLMs isn't enough. Agentic OCR: ✅ Reviews outputs for errors ✅ Enforces layout/structure awareness ✅ Adds precise bounding boxes ✅ Handles complex tables & formats Initially, Docuglean will work with existing VLMs; however, I plan to build and release custom domain-specific models that provide fast document intelligence without compromising privacy.
Privacy shouldn't be an afterthought 🔒 The biggest blocker for enterprise AI adoption has always been privacy concerns. For an AI-first document processing pipeline, privacy should be the default, not an afterthought to check compliance boxes. I'm fascinated by this challenge and have written extensively on the topic. My goal is to create an AI-first PII engine that matches and surpasses Microsoft's Presidio in accuracy and speed.