OCR with Powershell

I wrote a little function that utilizes Microsoft Office Document Imaging (MODI) to retrieve text from images with OCR.

I have put a few notes in-line in the script and have dummy-proofed it somewhat, but ymmv! Below the snippet I’ll show an example where I compare 12pt font recognition with this technique.

Here’s an example:

Image Get-TextFromImage Output
OCR Test Image Windows Powershell NODI OCR Test Image
l2pt COURIER NEW ABCDEFGHIJKLMNOPORSTUVLJXYZ
abcdefghijklmnopqrstuvwxyz
01234567890 

12 pt TAHOMA
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789

12 pt TERMINAL
ABCDEFCH I JICLMNOPQRSTUUIIXYZ
abc de f gh ii Ic inn o pqrs t tw wxyz
0123456789

12 pt VERDANA
ABCD EFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789

12 pt CONSOLAS
ABCDE FCHIJ KL KNOPQRSTUVHXYZ
abcdefghij kirnnopqrst uvwxyz
0123456789

12 pt Times New Roman
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklnmopqrstuvwxyz
0123456739

] pt OCR?A Extended ABCDEFGHI JKL1NOPQRSTUVLdXYZ
abcde fghij kimnopqrstuvwxy z
O]3456789

The OCR-specific font failed miserably. Funny huh?

It appears that at 12pt in a jpg, Times New Roman is the best candidate for OCR using MODI via Powershell if you intend on having accurate results!

Relevant links:
http://stackoverflow.com/questions/316068/what-is-the-ideal-font-for-ocr
http://cerealnumber.livejournal.com/47638.html
http://stackoverflow.com/questions/9277571/how-can-i-retrieve-modi-reference-from-com-in-my-application

17,037 total views, 4 views today

7 thoughts on “OCR with Powershell

  1. Jeffrey Snover[MSFT]

    Howdy Rex!

    I didn’t realize that you could do this – that is cool. I was looking at your script and you might consider making a change. They PowerShell convention for passing in files is the -PATH parameter. We also have a number of [VALIDATE…] attributes which do the work for you and then you get both standardized error messages and we’ll translate them when your script is running in other countries. Here is my suggestion:

    [Parameter(Mandatory=$true)]
    [ValidateScript({test-path $_})]
    [ValidatePattern(“\.jpg$|\.jpeg$|\.bmp$”)]
    [string]$Path

    Give it a try and see if you like it.

    Jeffrey Snover [MSFT]
    Distinguished Engineer and Lead Architect for Windows Sever and System Center Datacenter

    Reply
    1. Rex Hardin Post author

      Hi Jeffery –

      That’s a very valid suggestion! I’ll update the post in the next day or two. I suppose I ought to production-ize scripts/snippets I publicize, huh? 😛

      I’ll be posting more interesting stuff on /r/PowerShell – keep an eye out! I have a small backlog of nifty things I’ve learned/encountered and have been meaning to blog about.

      Thanks!
      Rex

      Reply
  2. tostaky

    Thanks a lot, i wanted something like that, it could be perfect, but sometimes it doesn’t work cause of colors.
    Is it possible to custom the precision and be able to read some picture text in color ?

    Reply
  3. Alex

    I’m running windows 10 with office 2007. Tried running this with 32 bit powershell but still got errors. Any ideas?

    Reply
    1. Alex

      Sorry, meant to include some output. This was from the ISE (with exits commented out) but I also tried from the console.

      Get-TextFromImage : Could not load MODI Com Object. Make sure you are running a 32bit
      Powershell sessio and have MS Office 2003/2007 installed.
      At line:41 char:1
      + Get-TextFromImage
      + ~~~~~~~~~~~~~~~~~
      + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
      + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
      xtFromImage

      Get-TextFromImage : If you have MS Office 2010, try this:
      http://support.microsoft.com/kb/982760
      At line:41 char:1
      + Get-TextFromImage
      + ~~~~~~~~~~~~~~~~~
      + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
      + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
      xtFromImage

      Get-TextFromImage : Retrieving the COM class factory for component with CLSID
      {00000000-0000-0000-0000-000000000000} failed due to the following error: 80040154
      Class not registered (Exception from HRESULT: 0x80040154 (REGDB_E_CLASSNOTREG)).
      At line:41 char:1
      + Get-TextFromImage
      + ~~~~~~~~~~~~~~~~~
      + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
      + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
      xtFromImage

      Get-TextFromImage : Failed to process file:
      At line:41 char:1
      + Get-TextFromImage
      + ~~~~~~~~~~~~~~~~~
      + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
      + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
      xtFromImage

      Get-TextFromImage : You cannot call a method on a null-valued expression.
      At line:41 char:1
      + Get-TextFromImage
      + ~~~~~~~~~~~~~~~~~
      + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
      + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
      xtFromImage

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *