• Moving documents from paper to bits to text

    Home » Forums » Newsletter and Homepage topics » Moving documents from paper to bits to text

    Author
    Topic
    #492518


    BEST SOFTWARE


    Moving documents from paper to bits to text

    By Lincoln Spector

    Remember the paperless office? Never happened; we still have piles of paper that take up too much room, can be difficult to search, and can’t be encrypted. OCR software lets you scan important documents and turn them into searchable PDFs. But the technology is still far from perfect.
    On the other hand, you’ll miss a lot of junk, too.


    The full text of this column is posted at windowssecrets.com/best-software/moving-documents-from-paper-to-bits-to-text/ (paid content, opens in a new window/tab).

    Columnists typically cannot reply to comments here, but do incorporate the best tips into future columns.[/td]

    [/tr][/tbl]

    Viewing 2 reply threads
    Author
    Replies
    • #1430193

      Thanks, Lincoln.
      Another option you didn’t mention is getting the software with a scanner. I got Fujitsu’s ScanSnap S1300i. It’s a full duplex scanner (both sides) that makes scanning office docs from business cards to 8.5×14 docs fast and easy. Much easier than a flatbed. Auto-corrects and adjusts all the typical things. (colour or not, duplex or not, straightening, etc.) You can set it to default to various formats, and with or without OCR. I typically have it scan to PDF, then OCR as a separate step those documents that would benefit by it. (I also scan photos, notes etc) It came with ABBYY – the only limitation being that it checks the Meta tags to ensure it was scanned with the ScanSnap.

      I’ve now processed thousands of pages with it – the old file cabinets, archives, shurlock books, and binders of grad work. Vastly easier than a page by page flatbed. And now all fully searchable and quotable.

      I’ve also used ABBYY and Fujitsu scanners professionally in a shop that processed thousands of pages to PDF daily so I knew they were both excellent and high quality.

    • #1430198

      How accurate are your scans? My experience and thought is that you don’t get perfect accuracy with the scans, that you always need to proofread after scanning.

      I haven’t done it in a good while, so maybe things have improved.

      Group "L" (Linux Mint)
      with Windows 10 running in a remote session on my file server
      • #1430424

        I think you’re wrong about “Window’s own search tool” not finding words inside pdfs. I just checked that with Windows 7 SP1 by typing an unusual word which occurs in a dozen of my pdfs into the Start Button’s search box. It instantly popped up all of those pdfs in the search results window under “documents”. (That said, I still greatly prefer X1 Search over Windows search – far more flexible, and also instantly displays, with the search term highlighted, the files it finds if they are a common file type (.doc, .xls, pdf, etc).) It will also index and search Outlook, although I don’t use that feature.

        Also, for those applications that require proofing, editing, etc, you might also consider Omnipage, from Nuance. I haven’t used ABBYY FineReader, but from reading your description, it sounds like Omnipage is about the same price ($150 list) and will do everything you describe, plus more. Omnipage may be a bit more complicated to use because of the extra capabilities. It includes the ability to scan multipage documents, including large books, where is worth while to optimize the recognition accuracy to include the specific peculiarities of the font used in the specific document being scanned. I’ve found this particularly valuable when scanning old geneological documents found in libraries or on the web, where the documents may have already been copied or scanned in a sub-optimal manner, with relatively poor quality.

        Did you check how ABBYY FineReader handles multi-column documents? If a document is to undergo further editing, it’s really important (and not easy) for the OCR app to properly interpret the column-to-column flow.

        And a final comment: If a document is undergo further editing after OCR, I recommend using a plain text output from the OCR application, rather than .doc. I’ve found that, while the .doc file may look just fine, it is very difficult to do additional formatting, because the styles applied by Omnipage (and I assume also ABBYY FineReader – I think this is a fundamental problem) are very complex and also somewhat haphazard – changing to fit the local context.

    • #1434404

      I also think Lincoln is wrong about “Window’s own search tool” not finding words inside PDFs.
      However, to get it working may need installation of the “Adobe PDF iFilter”.
      I think the latest is version 11, and there are 32-bit and 64-bit variants, but a Google search will reveal all.

    Viewing 2 reply threads
    Reply To: Moving documents from paper to bits to text

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: