exporting pdf to text (accessible)

Question

Level 1

29 points

exporting pdf to text (accessible)

I regularly export a pdf to 'text(accessible)...but many of the words are lacking spaces between.

Is there a way to avoid this???? ..... "Wegive thanks for those throughout the worldengaged in works of justice, mercy and peace;

and for those blessings we now name, eithersilently or aloud."

Mac mini, macOS 13.3

Posted on Aug 17, 2023 3:12 AM

Reply

Answer 1

Top-ranking reply

VikingOSX

Level 10

122,988 points

Aug 18, 2023 9:47 AM in response to rlcpd

I have tested Adobe Acrobat Reader DC (v.2023.003.20269) on macOS Monterey 12.6.8, and Ventura 13.5. This is the current version as of 2023-08-18. The identical PDF files were used on both instances of macOS.

What I discovered on both platforms is that the Adobe product may generate an empty text file, whether its PDF origin is Pages, LibreOffice Writer, or TexShop. It may decide to split the text file as one word per line, or it may actually get the text word extraction right, but with random concatentation of text words. This is a bug for Adobe to fix.

Worse, on macOS Monterey, one PDF when saved as text, the result was concatenated words mixed with individual words with a trailing space and a carriage return (^M). Looks like this:

and although one can remove the carriage returns, it does beg the question what Adobe is doing injecting carriage returns on a UNIX machine where linefeeds are the norm. This particular PDF was generated by TeXShop and appeared normally in Acrobat Reader and Preview.

So I got fed up with this nonsense and wrote a brief Swift script that just generates a correct text file regardless of the PDF it ingests. No concatenation of words, and none of the Adobe misdeeds. Works perfectly with all of the PDFs I tested against Adobe's product, and no concatenated words. Can process multiple PDFs on the command line into their text file equivalents.

#!/usr/bin/swift

/*
Script to read n-tuple PDF provided on command line and extract text to
file[s] in the same location with the ".txt" extension. Works on PDF
documents correctly where Adobe Acrobat Reader DC mangles the result.

Tested: Ventura 13.5 (Swift 5.8.1), Monterey 12.6.8 (Swift 5.7.2)
Compiled: swiftc -Osize -o pdf2text pdf2text.swift -framework Foundation -framework AppKit -framework PDFKit

Usage: ./pdf2text.swift ~/Desktop/foo.pdf ~/Desktop/bar.pdf

Author: VikingOSX, 2023-08-18, Apple Support Communities, No warranties of any kind.
*/

import Foundation
import AppKit
import PDFKit

func readPDF(urlpath: URL) -> String {
    let pdf = PDFDocument(url: urlpath)
    return pdf!.string!
}

let fileManager = FileManager.default
var inputArgs: Array<URL>
inputArgs = CommandLine.arguments.dropFirst().map { URL(fileURLWithPath: $0).absoluteURL }

inputArgs.forEach { elem in

    guard fileManager.fileExists(atPath: elem.path) else {
        let notFound = NSString.init(string:elem.path).abbreviatingWithTildeInPath
        print("Error: \(notFound): File not found.")
        return  // the .forEach equivalent of continue
    }
    // print("File: \(elem.path)")
    var outfile: URL!
    var text: String
    // write the outfile text contents to the same location as the PDF
    outfile = elem.deletingPathExtension().appendingPathExtension("txt")
    text = readPDF(urlpath: elem)

    do {
        try text.write(to: outfile, atomically: true, encoding: String.Encoding.utf8)
    } catch {
        print("Error: unable to write text file.")
        return
    }
    let tildeFile = NSString.init(string: outfile.path).abbreviatingWithTildeInPath
    print("Written: \(tildeFile)")
}
exit(EXIT_SUCCESS)

If Swift is not installed, one can install the Apple Command Line Tools for Xcode (~/3GB) and Swift/Swiftc will be in /usr/bin which is already in your Terminal PATH.

Launch the Terminal application and at the Terminal prompt, enter the following (not the # lines) and then press the return key after each entry:

# make the swift script that you saved executable
chmod +x ./pdf2text.swift 

# now install the Xcode command line tools
xcode-select --install

That will not install Xcode and will then prompt with the following installer dialogs as shown in this article. When this is done you can invoke pdf2text.swift from the Terminal and specify one or multiple PDFs that you want to extract text as shown in the script comments.

Reply

Answer 2

dialabrain

Level 10

135,287 points

Aug 17, 2023 3:38 AM in response to rlcpd

Not seeing this problem with Adobe Acrobat Reader in Monterey and above. Make sure macOS and your app are up to date. Your signature indicates you are running Ventura 13.3. The current version is 13.5.

Reply

Answer 3

VikingOSX

Level 10

122,988 points

Aug 18, 2023 11:39 AM in response to dialabrain

I took the same content in that PDF image from my prior post and saved it as a PDF from Pages v13.1. Before I did that, I disabled hyphenation and ligatures. Using that exported PDF as input to my Swift application resulted in identical text file output as you see above. Acrobat Reader DC generated an empty text file.

Reply

Answer 4

dialabrain

Level 10

135,287 points

Aug 18, 2023 11:26 AM in response to VikingOSX

The only PDFs I could find that caused a problem were those created by Pages.

Reply

Answer 5

dialabrain

Level 10

135,287 points

Aug 18, 2023 2:44 AM in response to rlcpd

rlcpd wrote:

I thought I was still Monterey as I have not opted to upgrade to the Venturas....

That's possible. The signature could be wrong. You can check in About This Mac under the Apple menu. The current version of Monterey is 12.6.8.

however: I am glad you do not see this problem. But I do! Just another dumb Mac glitch, I guess.

Which version of Adobe Acrobat or Acrobat Reader are you using? The current version of Reader is 23.003.20269

Reply

Answer 6

rlcpd Author

Level 1

29 points

Aug 17, 2023 11:20 PM in response to dialabrain

I thought I was still Monterey as I have not opted to upgrade to the Venturas....however: I am glad you do not see this problem. But I do! Just another dumb Mac glitch, I guess.

Reply

Answer 7

dialabrain

Level 10

135,287 points

Oct 29, 2025 10:07 AM in response to VikingOSX

Strange. I tried about 5 different PDF files and had no such issues. Just tried another in Monterey.

[Edited by Moderator]

Reply

Answer 8

dialabrain

Level 10

135,287 points

Aug 18, 2023 10:26 AM in response to dialabrain

Another…

Another test

Reply

Answer 9

dialabrain

Level 10

135,287 points

Aug 18, 2023 10:47 AM in response to dialabrain

Lastly, if I create a PDF with Pages, I do get a blank page when exporting it as text with Reader. However, if i create a PDF in LibreOffice then export it as text from Reader, it exports normally.

Reply