exporting pdf to text (accessible)

I regularly export a pdf to 'text(accessible)...but many of the words are lacking spaces between.

Is there a way to avoid this???? ..... "Wegive thanks for those throughout the worldengaged in works of justice, mercy and peace;

and for those blessings we now name, eithersilently or aloud."


Mac mini, macOS 13.3

Posted on Aug 17, 2023 3:12 AM

Reply
Question marked as Top-ranking reply

Posted on Aug 18, 2023 9:47 AM

I have tested Adobe Acrobat Reader DC (v.2023.003.20269) on macOS Monterey 12.6.8, and Ventura 13.5. This is the current version as of 2023-08-18. The identical PDF files were used on both instances of macOS.


What I discovered on both platforms is that the Adobe product may generate an empty text file, whether its PDF origin is Pages, LibreOffice Writer, or TexShop. It may decide to split the text file as one word per line, or it may actually get the text word extraction right, but with random concatentation of text words. This is a bug for Adobe to fix.


Worse, on macOS Monterey, one PDF when saved as text, the result was concatenated words mixed with individual words with a trailing space and a carriage return (^M). Looks like this:



and although one can remove the carriage returns, it does beg the question what Adobe is doing injecting carriage returns on a UNIX machine where linefeeds are the norm. This particular PDF was generated by TeXShop and appeared normally in Acrobat Reader and Preview.


So I got fed up with this nonsense and wrote a brief Swift script that just generates a correct text file regardless of the PDF it ingests. No concatenation of words, and none of the Adobe misdeeds. Works perfectly with all of the PDFs I tested against Adobe's product, and no concatenated words. Can process multiple PDFs on the command line into their text file equivalents.


#!/usr/bin/swift

/*
Script to read n-tuple PDF provided on command line and extract text to
file[s] in the same location with the ".txt" extension. Works on PDF
documents correctly where Adobe Acrobat Reader DC mangles the result.

Tested: Ventura 13.5 (Swift 5.8.1), Monterey 12.6.8 (Swift 5.7.2)
Compiled: swiftc -Osize -o pdf2text pdf2text.swift -framework Foundation -framework AppKit -framework PDFKit

Usage: ./pdf2text.swift ~/Desktop/foo.pdf ~/Desktop/bar.pdf

Author: VikingOSX, 2023-08-18, Apple Support Communities, No warranties of any kind.
*/

import Foundation
import AppKit
import PDFKit

func readPDF(urlpath: URL) -> String {
    let pdf = PDFDocument(url: urlpath)
    return pdf!.string!
}

let fileManager = FileManager.default
var inputArgs: Array<URL>
inputArgs = CommandLine.arguments.dropFirst().map { URL(fileURLWithPath: $0).absoluteURL }

inputArgs.forEach { elem in

    guard fileManager.fileExists(atPath: elem.path) else {
        let notFound = NSString.init(string:elem.path).abbreviatingWithTildeInPath
        print("Error: \(notFound): File not found.")
        return  // the .forEach equivalent of continue
    }
    // print("File: \(elem.path)")
    var outfile: URL!
    var text: String
    // write the outfile text contents to the same location as the PDF
    outfile = elem.deletingPathExtension().appendingPathExtension("txt")
    text = readPDF(urlpath: elem)

    do {
        try text.write(to: outfile, atomically: true, encoding: String.Encoding.utf8)
    } catch {
        print("Error: unable to write text file.")
        return
    }
    let tildeFile = NSString.init(string: outfile.path).abbreviatingWithTildeInPath
    print("Written: \(tildeFile)")
}
exit(EXIT_SUCCESS)



If Swift is not installed, one can install the Apple Command Line Tools for Xcode (~/3GB) and Swift/Swiftc will be in /usr/bin which is already in your Terminal PATH.


Launch the Terminal application and at the Terminal prompt, enter the following (not the # lines) and then press the return key after each entry:


# make the swift script that you saved executable
chmod +x ./pdf2text.swift 

# now install the Xcode command line tools
xcode-select --install


That will not install Xcode and will then prompt with the following installer dialogs as shown in this article. When this is done you can invoke pdf2text.swift from the Terminal and specify one or multiple PDFs that you want to extract text as shown in the script comments.

Similar questions

11 replies
Question marked as Top-ranking reply

Aug 18, 2023 9:47 AM in response to rlcpd

I have tested Adobe Acrobat Reader DC (v.2023.003.20269) on macOS Monterey 12.6.8, and Ventura 13.5. This is the current version as of 2023-08-18. The identical PDF files were used on both instances of macOS.


What I discovered on both platforms is that the Adobe product may generate an empty text file, whether its PDF origin is Pages, LibreOffice Writer, or TexShop. It may decide to split the text file as one word per line, or it may actually get the text word extraction right, but with random concatentation of text words. This is a bug for Adobe to fix.


Worse, on macOS Monterey, one PDF when saved as text, the result was concatenated words mixed with individual words with a trailing space and a carriage return (^M). Looks like this:



and although one can remove the carriage returns, it does beg the question what Adobe is doing injecting carriage returns on a UNIX machine where linefeeds are the norm. This particular PDF was generated by TeXShop and appeared normally in Acrobat Reader and Preview.


So I got fed up with this nonsense and wrote a brief Swift script that just generates a correct text file regardless of the PDF it ingests. No concatenation of words, and none of the Adobe misdeeds. Works perfectly with all of the PDFs I tested against Adobe's product, and no concatenated words. Can process multiple PDFs on the command line into their text file equivalents.


#!/usr/bin/swift

/*
Script to read n-tuple PDF provided on command line and extract text to
file[s] in the same location with the ".txt" extension. Works on PDF
documents correctly where Adobe Acrobat Reader DC mangles the result.

Tested: Ventura 13.5 (Swift 5.8.1), Monterey 12.6.8 (Swift 5.7.2)
Compiled: swiftc -Osize -o pdf2text pdf2text.swift -framework Foundation -framework AppKit -framework PDFKit

Usage: ./pdf2text.swift ~/Desktop/foo.pdf ~/Desktop/bar.pdf

Author: VikingOSX, 2023-08-18, Apple Support Communities, No warranties of any kind.
*/

import Foundation
import AppKit
import PDFKit

func readPDF(urlpath: URL) -> String {
    let pdf = PDFDocument(url: urlpath)
    return pdf!.string!
}

let fileManager = FileManager.default
var inputArgs: Array<URL>
inputArgs = CommandLine.arguments.dropFirst().map { URL(fileURLWithPath: $0).absoluteURL }

inputArgs.forEach { elem in

    guard fileManager.fileExists(atPath: elem.path) else {
        let notFound = NSString.init(string:elem.path).abbreviatingWithTildeInPath
        print("Error: \(notFound): File not found.")
        return  // the .forEach equivalent of continue
    }
    // print("File: \(elem.path)")
    var outfile: URL!
    var text: String
    // write the outfile text contents to the same location as the PDF
    outfile = elem.deletingPathExtension().appendingPathExtension("txt")
    text = readPDF(urlpath: elem)

    do {
        try text.write(to: outfile, atomically: true, encoding: String.Encoding.utf8)
    } catch {
        print("Error: unable to write text file.")
        return
    }
    let tildeFile = NSString.init(string: outfile.path).abbreviatingWithTildeInPath
    print("Written: \(tildeFile)")
}
exit(EXIT_SUCCESS)



If Swift is not installed, one can install the Apple Command Line Tools for Xcode (~/3GB) and Swift/Swiftc will be in /usr/bin which is already in your Terminal PATH.


Launch the Terminal application and at the Terminal prompt, enter the following (not the # lines) and then press the return key after each entry:


# make the swift script that you saved executable
chmod +x ./pdf2text.swift 

# now install the Xcode command line tools
xcode-select --install


That will not install Xcode and will then prompt with the following installer dialogs as shown in this article. When this is done you can invoke pdf2text.swift from the Terminal and specify one or multiple PDFs that you want to extract text as shown in the script comments.

Aug 18, 2023 2:44 AM in response to rlcpd

rlcpd wrote:

I thought I was still Monterey as I have not opted to upgrade to the Venturas....

That's possible. The signature could be wrong. You can check in About This Mac under the Apple menu. The current version of Monterey is 12.6.8.

however: I am glad you do not see this problem. But I do! Just another dumb Mac glitch, I guess.

Which version of Adobe Acrobat or Acrobat Reader are you using? The current version of Reader is 23.003.20269

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

exporting pdf to text (accessible)

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.