Excited to announce I’m building Docuglean - an open-source, privacy-focused document intelligence layer with three key foundations:. • Agentic OCR for parsing documents into high-quality, ready-to-use data. powered by SOTA VLM models for complex tables, formats, and layouts. • an opensource PII engine that beats Microsoft's Presidio in accuracy and speed • a suite of Document AI tools including classify, extract (layout, bounding box support, structured schema), review, document splitting, chunking, multi-doc type converter, multilingual support, and tokens count
Why Build This? Data is the fuel for AI innovation. Over the past couple of years, I've helped several traditional companies integrate AI workflows into their systems, and the biggest bottleneck has been extracting insights from existing data - millions of unstructured documents in PDFs and images (scanned, handwritten, etc.). Garbage document processing = garbage AI outputs. Plus • Privacy concerns block enterprise AI adoption • Complex PDFs with tables/layouts break traditional OCR It doesn't matter how sophisticated your AI workflows, prompts, RAG, or agents are; getting the foundational document processing pipeline right is the only way to maximize the value of your data. That starts with parsing documents into ready-to-use, high-quality data.