I wrote a little function that utilizes Microsoft Office Document Imaging (MODI) to retrieve text from images with OCR.
I have put a few notes in-line in the script and have dummy-proofed it somewhat, but ymmv! Below the snippet I’ll show an example where I compare 12pt font recognition with this technique.
Function Get-TextFromImage { param ( [Parameter(Mandatory=$true)][string]$File ) # Requires 32bit Powershell Session # Requires Office 2003 or 2007 Installation Try { $FileObj = Get-ChildItem $File $FilePath = $FileObj.Fullname If ($FileObj.Extension -ne ".jpg" -or ".jpeg" -or ".bmp") { Write-Warning "$File does not have a known working extension, trying anyway..." } } Catch { Write-Error "Error Retrieving $File:" Write-Error $_ Exit 1 } Try { $MODIObj = New-Object -ComObject MODI.Document } Catch { Write-Error "Could not load MODI Com Object. Make sure you are running a 32bit Powershell sessio and have MS Office 2003/2007 installed." Write-Error "If you have MS Office 2010, try this: http://support.microsoft.com/kb/982760" Write-Error $_ Exit 2 } Try { $MODIObj.Create($FilePath) $MODIObj.OCR() $Output = $MODIObj.Images.Item(0).Layout.Text.ToString().Trim() Write-Output $Output | Tee-Object -File $($FilePath + ".ocr.txt") } Catch { Write-Error "Failed to process file:" Write-Error $_ Exit 3 } }
Here’s an example:
Image | Get-TextFromImage Output |
---|---|
![]() |
Windows Powershell NODI OCR Test Image l2pt COURIER NEW ABCDEFGHIJKLMNOPORSTUVLJXYZ abcdefghijklmnopqrstuvwxyz 01234567890 12 pt TAHOMA 12 pt TERMINAL 12 pt VERDANA 12 pt CONSOLAS 12 pt Times New Roman ] pt OCR?A Extended ABCDEFGHI JKL1NOPQRSTUVLdXYZ |
The OCR-specific font failed miserably. Funny huh?
It appears that at 12pt in a jpg, Times New Roman is the best candidate for OCR using MODI via Powershell if you intend on having accurate results!
Relevant links:
http://stackoverflow.com/questions/316068/what-is-the-ideal-font-for-ocr
http://cerealnumber.livejournal.com/47638.html
http://stackoverflow.com/questions/9277571/how-can-i-retrieve-modi-reference-from-com-in-my-application
28,213 total views, 3 views today
Howdy Rex!
I didn’t realize that you could do this – that is cool. I was looking at your script and you might consider making a change. They PowerShell convention for passing in files is the -PATH parameter. We also have a number of [VALIDATE…] attributes which do the work for you and then you get both standardized error messages and we’ll translate them when your script is running in other countries. Here is my suggestion:
[Parameter(Mandatory=$true)]
[ValidateScript({test-path $_})]
[ValidatePattern(“\.jpg$|\.jpeg$|\.bmp$”)]
[string]$Path
Give it a try and see if you like it.
Jeffrey Snover [MSFT]
Distinguished Engineer and Lead Architect for Windows Sever and System Center Datacenter
Hi Jeffery –
That’s a very valid suggestion! I’ll update the post in the next day or two. I suppose I ought to production-ize scripts/snippets I publicize, huh? 😛
I’ll be posting more interesting stuff on /r/PowerShell – keep an eye out! I have a small backlog of nifty things I’ve learned/encountered and have been meaning to blog about.
Thanks!
Rex
How do you use this? I would love to try this!
I have tif’s in Portugese that I want to OCR. How can i change the language?
Thanks a lot, i wanted something like that, it could be perfect, but sometimes it doesn’t work cause of colors.
Is it possible to custom the precision and be able to read some picture text in color ?
I’m running windows 10 with office 2007. Tried running this with 32 bit powershell but still got errors. Any ideas?
Sorry, meant to include some output. This was from the ISE (with exits commented out) but I also tried from the console.
Get-TextFromImage : Could not load MODI Com Object. Make sure you are running a 32bit
Powershell sessio and have MS Office 2003/2007 installed.
At line:41 char:1
+ Get-TextFromImage
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
xtFromImage
Get-TextFromImage : If you have MS Office 2010, try this:
http://support.microsoft.com/kb/982760
At line:41 char:1
+ Get-TextFromImage
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
xtFromImage
Get-TextFromImage : Retrieving the COM class factory for component with CLSID
{00000000-0000-0000-0000-000000000000} failed due to the following error: 80040154
Class not registered (Exception from HRESULT: 0x80040154 (REGDB_E_CLASSNOTREG)).
At line:41 char:1
+ Get-TextFromImage
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
xtFromImage
Get-TextFromImage : Failed to process file:
At line:41 char:1
+ Get-TextFromImage
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
xtFromImage
Get-TextFromImage : You cannot call a method on a null-valued expression.
At line:41 char:1
+ Get-TextFromImage
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Get-Te
xtFromImage
Please give an example about how to run the command