Reports

It is a very irritating problem with a capital "I" when it is the only character in the word. Yeah, a lot of other things will make it worse, but I've run it against pdf documents that were 300 or 600 dpi, I just made sure I had all the latest tesseract libs, tesseract works great, recognizes everything that isn't actually smudged or distorted, but never a capital "I" reliably. It does some. But if it's proceeded or followed by a non-alphabetic character, like quote, capital I, or the word I've, it interprets the I as, variously, l,L,1,], or T. I think from that it's obvious that the problem is that all those characters are an "I". They just have a bit of other stuff, but not stuff that makes it another character. Which is what you want. If you have an R with some curvy flairs on the sharp points, you want it recognized as an R. Which is exactly what it's doing when it renders I as 1. With the exception of the T. Maybe it's OK to be missing bits, but not adding bits, vis a vis, the "doesn't make it another letter" comment I made just now. I just have vi replace (start of line, or space)(T)(space) with (I)(space). The parens are just to clarify what goes together, not something actually typed as part of the vi command. Then I repeat that for each of the aforementioned characters. It would be nice to have a solution for that, of course. Any bright ideas? Human pattern recognition is still pretty hard to beat with the best AI. I think that's asking a bit much from tesseract. Hope this helps someone that was a frustrated with it before I realized that it actually was working, just not doing what I needed done.

79540517