PDA

View Full Version : OT: Adobe Experts?


Cat3roadracer
01-17-2019, 11:40 AM
I am looking for a way to electronically read a PDF. Does anyone know of any applications that do this?

Example - a customer uploads a PDF, I need to be able to use automation to search the document for specific data points

Thanks.

Dude
01-17-2019, 12:07 PM
Do you want this to happen with no human involvement at all? Or do you mean user uploads a PDF and there is the ability for a person to search document.

If it's the latter, what you're looking for is OCR technology. Google docs will let you search a pdf for word(s).

Hope that helps.

tuscanyswe
01-17-2019, 12:18 PM
Im sure there is a cheaper way but i use this service for similar tasks tho i dont read pdfs. Think i pay like 30$ / month for basic account. They do however read data from pdfs and extract info from them and ive been happy with the service thus far. It actually works really well for what i use it for.

https://mailparser.io/blog/read-parse-process-email-file-attachments/

OtayBW
01-17-2019, 01:31 PM
Why not get Acrobat Pro - you can convert the PDF to a *.doc and edit/search as you please? I've also used Foxit Phantom PDF, a cheaper alternative to AA, that has worked well for me. Or, for whatever Acrobat version that you've got, go to the Command and do Edit > Find, and it will incrementally search for your keywords.

palincss
01-17-2019, 01:36 PM
I am looking for a way to electronically read a PDF. Does anyone know of any applications that do this?

Example - a customer uploads a PDF, I need to be able to use automation to search the document for specific data points

Thanks.

All PDFs are not created equal. Some are just image files with some wrapper around them; others have image plus text. If the files contain text you could run them through a PDF->text converter and search the text for patterns, but if it's just images you'd have to OCR them, and results can vary a lot. Actually, even if there's hidden text the results can vary a lot because of formatting.

If, on the other hand, you were looking for something simple - say, wanting to know if that stuffed cabbage recipe you saved as a PDF file contains the word "toothpick" you could at the command line in a Linux terminal window type this: pdftotext stuffedcabbage.pdf - | fgrep toothpick ( run the program pdftotext with the file stuffedcabbage.pdf as input, sending the output to STDOUT and in turn piping it to fgrep to search for the character pattern "toothpick" and sending any lines containing that string to STDOUT ) and you'd and you'd get back this line: mixture in each cabbage leaf. Roll up and secure with wooden toothpick.

HenryA
01-17-2019, 01:59 PM
You can do OCR in Acrobat of a scanned image of a paper document. Its pretty accurate if you feed it good scans. Acrobat converts to text which is then searchable and exportable as well. Not perfect but good. Things like invoices or forms with lines and boxes not so good.

There used to be (probably still is) a set up with online completion/submission of a pdf to a server app that then gathered the info and placed it into desired fields. May have been called Acrobat Server or something imaginative like that. This was from Adobe.

If the pdfs are documents saved from a word processor as pdfs then all you do is open and search. If you work on a Mac, you might be able to build and run an Automator script where a named folder gets run through a pre determined search criteria. A Mail action moves the attachments into folder XXXXXX. The routine might look like Go to Folder XXXXXX FIND “refund”, SAVE in Folder YYYYYY — as a very general Off the top of the head and over simplified example. I am sure other platforms make this available but know almost nothing about them.

And how automated is automated? Do you want to do this one at a time on your computer after you receive the pdf, or do you want a system that reads the pdf and then reports its finding for certain words or strings without your intervention?

NickR
01-17-2019, 11:43 PM
X2 on Adobe Acrobat Pro. Had it at my old job and was very useful for work related and personal stuff.

ariw
01-18-2019, 06:02 AM
Take a look at Fine Reader from Abbyy. You can configure it to automatically OCR pdf files as they come into a folder, then move them into another folder or do something else with the file. We have used this for our legal customers for years, not hard to setup and not too expensive.

Ari

Weneed
10-06-2020, 07:58 AM
As far as I see you can using this service the free xlsx to pdf converter https://onlineconvertfree.com/convert-format/xlsx-to-pdf/What does everyone think of this idea? I came to the conclusion that it's the only right decision.

tuscanyswe
10-06-2020, 08:02 AM
As far as I see you can using this service the free xlsx to pdf converter https://onlineconvertfree.com/convert-format/xlsx-to-pdf/What does everyone think of this idea? I came to the conclusion that it's the only right decision.

That only saves the file to a pdf right? It does not search for extraction of data automaticly. If u just want to an xlsx file saved as pdf i think u can just use the option "save as pdf" at least thats available in google docs.

paredown
10-06-2020, 08:29 AM
As others have said--*.pdf or searchable *.pdf that has been run through an OCR program. z

If they send you the former, with Acrobat pro you can open it and convert it with the internal OCR program to make it searchable.

I'm doing a bunch of scanning now--and even the Cheapo HP OfficeJet we have--with a document feeder has the Adobe print driver built in--so you can click 'searchable *pdf" as file type, and it will scan, then run the included OCR program, and save it as a searchable *.pdf. Although it is less reliable than the version of Acrobat Pro I have (v. 8), it is faster because it will feed the sheets consecutively, Acrobat (as I have it) makes you click 'next page' to continue with a document. (There are 'batch process' capabilities that I could not get to work reliably, though.)

Other than going file by file, I have never searched for key words, although Pro does have ways of indexing documents.

If you want a usable version for less than new retail, Acrobat X Pro has most of the bells and whistles, and will run reliably on Windows 10. You can pick it up on eBay with the usual proviso--watch out for the idiots selling counterfeit software and purchase the 'full retail box' only IMO.

benb
10-06-2020, 09:42 AM
You said automation so I'd assume you're trying to automate this as part of some sort of software you're building or cobbling together.

I'd look into the various open source software libraries. There are many.

There have long been libraries out there for software development that let you work with PDF without paying anything to Adobe.

Practically every piece of enterprise software I've worked on in my career has required either generating or reading/parsing PDF at some point and I don't think any piece of software I worked on has ever used any Adobe software.

It is going to be a lot harder if the PDFs are scans than if they're a form you generated that the customer has filled in with PDF software though.

If it's a scan the PDF is often just a wrapper around image data and the OCR stuff hasn't been one for you. But there are Free/Open Source OCR libraries out there as well.