OT: Copying text from a PDF

T

Terry Pinnell

Guest
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

--
Terry Pinnell
Hobbyist, West Sussex, UK
 
"Terry Pinnell" <terrypinDELETE@THESEdial.pipex.com> wrote in message
news:eek:ajq911s3q95buepdthrl0ekpc0jnfrmm7@4ax.com...
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?
I just tried it and it worked OK for me when I pasted the text into the PFE
editor. Here are a couple of lines:

Drain to Source Breakdown Voltage (Note 1) . . . . . . . . . . . . . . . . .
.. . . . . . . . . . . . . . . . . . . . . .V
DS
50 V
Drain to Gate Voltage (R
GS
= 20k
Ů
) (Note 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.. . . . V
DGR
50 V
Continuous Drain Current T
C

It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many
problems.

Leon
 
Terry Pinnell <terrypinDELETE@thesedial.pipex.com> wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.

Matt Roberds
 
Leon Heller wrote:
[...]

It's not perfect, but I haven't got a CR after every character.

I often extract text from PDFs whan creating PCB parts, and don't have many
problems.

Leon
Don'cha love it when the author turns off the "Text Copy" tool on the
document so you can't copy and paste? Why they do that is beyond me. You
could print as many copies as you wish, or make infinite copies on a
Xerox machine. Why make it difficult to copy a couple of lines of text?

Another moan is when the author uses some wierd font that produces
garbage characters when you paste into a text editor. I often end up
shrinking the editor to a small window that overlays the pdf file, and do
a manual copy.

Then there's the text in a scanned image format. No copying, no searches,
and it takes a lot of room on the disk.

Hopefully, in 50 years or so, paper will be found only in museums, and
everyone will have flexible electronic displays. Since there will be no
need to print anything, searches will be easy, and there won't be a need
to use special fonts or lock the document for any reason. Life will be
easy for engineers.

Sure...

Mike Monett
 
On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
<terrypinDELETE@THESEdial.pipex.com> wrote:

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?
One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
--
"it's the network..." "The Journey is the reward"
speff@interlog.com Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog Info for designers: http://www.speff.com
 
Spehro Pefhany <speffSNIP@interlogDOTyou.knowwhat> wrote:

On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
terrypinDELETE@THESEdial.pipex.com> wrote:

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
Thanks for all those prompt responses. I'll follow up the suggestions.

Using TextPad here - great editor.

Same result when pasting into various other apps. I shouldn't have
said returns after *every* character, but still pretty bad:
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif

--
Terry Pinnell
Hobbyist, West Sussex, UK
Wed 1 June 2005, 08:36 UK time
 
"Terry Pinnell" <terrypinDELETE@THESEdial.pipex.com> wrote in message
news:eek:ajq911s3q95buepdthrl0ekpc0jnfrmm7@4ax.com...
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

--
Terry Pinnell
Hobbyist, West Sussex, UK
Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This will
give you a more coherent display but still not perfect.

Cheers
 
"Chris" <not@work.com> wrote in message
news:rKene.9031$BR4.6459@news-server.bigpond.net.au...
"Terry Pinnell" <terrypinDELETE@THESEdial.pipex.com> wrote in message
news:eek:ajq911s3q95buepdthrl0ekpc0jnfrmm7@4ax.com...
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

--
Terry Pinnell
Hobbyist, West Sussex, UK


Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This
will
give you a more coherent display but still not perfect.

Cheers
ooops
www.foxitsoftware.com
 
"Chris" <not@work.com> wrote:

"Terry Pinnell" <terrypinDELETE@THESEdial.pipex.com> wrote in message
news:eek:ajq911s3q95buepdthrl0ekpc0jnfrmm7@4ax.com...
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

--
Terry Pinnell
Hobbyist, West Sussex, UK


Couple of options.

Under Adobe Reader 6 use the snapshot tool to copy and paste into Word or
Excel.

or 2.

download an alternative and quicker to open pdf reader from
www.foxitsiftware.com and use the text tool and paste into Excel. This will
give you a more coherent display but still not perfect.

Cheers
Thanks. Yes, that is arguably an improvement:
http://www.terrypin.dial.pipex.com/Images/PDFText2.gif
compared to Adobe Acrobat Reader (5 in my case; each version seems to
get worse to me!):
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
but I see PDF Reader has pasted a fixed size font rather than the
original proportional?

--
Terry Pinnell
Hobbyist, West Sussex, UK
 
Terry Pinnell <terrypinDELETE@THESEdial.pipex.com> wrote:


Thanks. Yes, that is arguably an improvement:
http://www.terrypin.dial.pipex.com/Images/PDFText2.gif
compared to Adobe Acrobat Reader (5 in my case; each version seems to
get worse to me!):
http://www.terrypin.dial.pipex.com/Images/PDFText1.gif
but I see PDF Reader has pasted a fixed size font rather than the
original proportional?
....but guess I must have used WordPad for the first! Don't recall
doing so - but can't think of any other explanation. So that makes pdf
reader definitely an improvement.

--
Terry Pinnell
Hobbyist, West Sussex, UK
 
On Wed, 01 Jun 2005 03:17:53 -0400, Spehro Pefhany
<speffSNIP@interlogDOTyou.knowwhat> wrote:

On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
terrypinDELETE@THESEdial.pipex.com> wrote:

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
I'm using Adobe Acrobat 4... I have version 5, but it's been screwed
over by zealot programmers, so I only use it to read some stuff that
version 4 lacks font capability for.

With version 4 I get spaces with subscripted text, no <CR>; otherwise
looks OK.

...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC's and Discrete Systems | manus |
| Phoenix, Arizona Voice:(480)460-2350 | |
| E-mail Address at Website Fax:(480)460-2142 | Brass Rat |
| http://www.analog-innovations.com | 1962 |

I love to cook with wine. Sometimes I even put it in the food.
 
On Wed, 01 Jun 2005 03:17:53 -0400, Spehro Pefhany
<speffSNIP@interlogDOTyou.knowwhat> wrote:

On Wed, 01 Jun 2005 07:00:21 +0100, the renowned Terry Pinnell
terrypinDELETE@THESEdial.pipex.com> wrote:

Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

One thing I notice that's amiss is that there is a carriage return
before and after subscripted text. So:

V 50 V
DS

Comes out as V<CR>DS<CR> 50 V

The symbol characters (degrees and ohms) also tend to get
translated/screwed up, depending on where you're pasting to. There are
also some lines screwed up, st the ends of some lines end up together
on later lines.

Problems in extracting text are mostly a function of the application
that created the PDF (Framemaker 5.5 for the Power PC set to
LaserWriter 8 8.7 and Acrobat Distiller 4.0 for Macintosh in this
case). In this case, if you open the document in Illustrator you can
see many individual blocks of text, some of which the copy operation
strings together, and others which it misses.

This stuff is fairly easily fixed by a bit of editing-- those dot
leaders are irritating to fix. I tried pasting into a text-only
application (Ultraedit), Excel, the Open Office text editor and into
MS Word, and all came out pretty much the same except for the symbols.
It might even be faster than re-typing everything.

Extracting text using GSView in "normal" mode is only slightly better.


Best regards,
Spehro Pefhany
I use Clipmate http://www.thornsoft.com/ which has nice text cleanup.
Apparently it was not necessary for:

30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET

It showed up as WYSIWYG

--

Boris Mohar
 
mroberds@worldnet.att.net wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows,
Just downloaded it. Thanks. Wouldn't want to run it under 'doze
anyway. :)

BTW, Ghost Script/Ghost View extracts it with no problem. So does
Acrobat but it's easier with Ghost.

Ted
 
Terry Pinnell wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?
Three suggestions:
Get PMView and use the screen capture => convert to 16 color => Save as
a .PNG. The file size for the max ratings is <6KB.
Install a virtual PostScript printer set to print to file.

You can grab anything with these tools.

Ted
 
Ted Edwards <Ted_Espamless@telus.net> wrote:

Terry Pinnell wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.
Each individual character gets a return inserted. Typical example is
at http://www.fairchildsemi.com/ds/BU/BUZ11.pdf, where I just wanted
to extract the details under 'Absolute Maximum Ratings'.

What's the deal here please? If the document is proprietorially
protected, wouldn't the Text tool be inaccessible?

Three suggestions:
Get PMView and use the screen capture => convert to 16 color => Save as
a .PNG. The file size for the max ratings is <6KB.
Install a virtual PostScript printer set to print to file.

You can grab anything with these tools.

Ted
Thanks. I took a look at PMView but it seems to be just a (versatile)
image viewer, rather like several others (e.g. IrfanView), which can
also Print to File. Maybe I should explore the second part of your
recommendation; what 'virtual PostScript printer' do you use please?

BTW, I have Snagit, which can also capture *text* from many windows,
although it fails in the PDF example under discussion.

--
Terry Pinnell
Hobbyist, West Sussex, UK
 
Terry Pinnell wrote:
Thanks. I took a look at PMView but it seems to be just a (versatile)
image viewer, rather like several others (e.g. IrfanView), which can
also Print to File.
It is that but it also has a capture facility that allows capturing the
whole screen, a selected area of the screen, a window or the interior of
a window.

Maybe I should explore the second part of your
recommendation; what 'virtual PostScript printer' do you use please?
From your headers, I guess you are running 'doze. I'm not so I can
only give you general guidelines for what I did. Since I am printing to
file the physical printer does not need to be present at all. I picked
a high end colour laser printer and downloaded the postscript driver for
it. I installed it but checked the box that says "Print to file". I
also have a real Canon i850 on my system so when ever i send something
to the printer, I am given the choice of which of the two printers is to
be used. If I want real hard copy, I select the i850. If I want a file
I suggest the PostScript printer. With the later, I'm then asked for a
file, e.g. G:\downloads\glurp.ps. I can then convert that to PDF, PNG
or a choice of several other formats including "extract text" with Ghost
View.

Perhaps someone here who is a 'doze user can clarify this for you.

Ted
 
On Wed, 01 Jun 2005 07:08:01 +0000, mroberds wrote:

Terry Pinnell <terrypinDELETE@thesedial.pipex.com> wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.

What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.
Barely a day goes by that Slackware doesn't pleasantly surprise me!
It seems I got xpdf along with it, and lo and behold:
------------------------
30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET
This is an N-Channel enhancement mode silicon gate power field effect
transistor designed for applications such as switching regulators,
switching converters, motor drivers, relay drivers and drivers for high
power bipolar switching transistors requiring high speed and low gate
drive power. This type can be operated directly from integrated circuits.
Formerly developmental type TA9771.
Ordering Information
PART NUMBER PACKAGE BRAND
BUZ11 TO-220AB BUZ11
NOTE: When ordering, use the entire part number.

Features
ˇ 30A, 50V
ˇ rDS(ON) = 0.040
ˇ SOA is Power Dissipation Limited
ˇ Nanosecond Switching Speeds
ˇ Linear Transfer Characteristics
ˇ High Input Impedance
ˇ Majority Carrier Device
ˇ Related Literature
- TB334 "Guidelines for Soldering Surface Mount
Components to PC Boards"
Symbol
D
G
S
--------------

Cheers!
Rich
 
Rich Grise <richgrise@example.net> wrote:

On Wed, 01 Jun 2005 07:08:01 +0000, mroberds wrote:

Terry Pinnell <terrypinDELETE@thesedial.pipex.com> wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.

What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.

Barely a day goes by that Slackware doesn't pleasantly surprise me!
It seems I got xpdf along with it, and lo and behold:
------------------------
30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET
This is an N-Channel enhancement mode silicon gate power field effect
transistor designed for applications such as switching regulators,
switching converters, motor drivers, relay drivers and drivers for high
power bipolar switching transistors requiring high speed and low gate
drive power. This type can be operated directly from integrated circuits.
Formerly developmental type TA9771.
Ordering Information
PART NUMBER PACKAGE BRAND
BUZ11 TO-220AB BUZ11
NOTE: When ordering, use the entire part number.

Features
ˇ 30A, 50V
ˇ rDS(ON) = 0.040
ˇ SOA is Power Dissipation Limited
ˇ Nanosecond Switching Speeds
ˇ Linear Transfer Characteristics
ˇ High Input Impedance
ˇ Majority Carrier Device
ˇ Related Literature
- TB334 "Guidelines for Soldering Surface Mount
Components to PC Boards"
Symbol
D
G
S
--------------

Cheers!
Rich
Thanks for the text paste.

Must say I'm a bit lost on that site
http://www.foolabs.com/xpdf/home.html
Can you help me locate specifically the PDF to text converter please?
I'm wallowing in files with off-putting and Windows-alien names like
't1lib-1.3.tar.gz'.

--
Terry Pinnell
Hobbyist, West Sussex, UK
 
On Thu, 02 Jun 2005 23:16:06 +0100, Terry Pinnell wrote:

Rich Grise <richgrise@example.net> wrote:

On Wed, 01 Jun 2005 07:08:01 +0000, mroberds wrote:

Terry Pinnell <terrypinDELETE@thesedial.pipex.com> wrote:
Quite often I have trouble extracting text from a PDF. I use the Text
tool, copy, but on then pasting into my text editor I get garbage.

What's your text editor? Assuming you're under Windows, perhaps the
problem is trying to paste Unicode into an editor that can't handle it.
You might try pasting the text into Word or Wordpad to see what happens.

You might also look at xpdf, http://www.foolabs.com/xpdf/ . I don't
think you can run the PDF viewer under Windows, but the command-line
utilities, including a PDF-to-text converter, will work.

Barely a day goes by that Slackware doesn't pleasantly surprise me!
It seems I got xpdf along with it, and lo and behold:
------------------------
30A, 50V, 0.040 Ohm, N-Channel Power
MOSFET
This is an N-Channel enhancement mode silicon gate power field effect
transistor designed for applications such as switching regulators,
switching converters, motor drivers, relay drivers and drivers for high
power bipolar switching transistors requiring high speed and low gate
drive power. This type can be operated directly from integrated circuits.
Formerly developmental type TA9771.
Ordering Information
PART NUMBER PACKAGE BRAND
BUZ11 TO-220AB BUZ11
NOTE: When ordering, use the entire part number.

Features
ˇ 30A, 50V
ˇ rDS(ON) = 0.040
ˇ SOA is Power Dissipation Limited
ˇ Nanosecond Switching Speeds
ˇ Linear Transfer Characteristics
ˇ High Input Impedance
ˇ Majority Carrier Device
ˇ Related Literature
- TB334 "Guidelines for Soldering Surface Mount
Components to PC Boards"
Symbol
D
G
S
--------------

Cheers!
Rich

Thanks for the text paste.

Must say I'm a bit lost on that site
http://www.foolabs.com/xpdf/home.html
Can you help me locate specifically the PDF to text converter please?
I'm wallowing in files with off-putting and Windows-alien names like
't1lib-1.3.tar.gz'.
Click "Download" to get to http://www.foolabs.com/xpdf/download.html ,
then scroll down to "Precompiled binaries" and it's probably either
xpdf-3.00pl3-win32.zip (1142558 bytes) for Win32 or
xpdf-3.00pl3-dos6.zip (1775202 bytes) for DOS.

Here's a chunk of the README for the win32 version, copied without
permission:
--------<begin excerpt>----------
Xpdf
====

version 3.00
2004-jan-22

The Xpdf software and documentation are
copyright 1996-2004 Glyph & Cog, LLC.

Email: derekn@foolabs.com
WWW: http://www.foolabs.com/xpdf/

The PDF data structures, operators, and specification are
copyright 1985-2003 Adobe Systems Inc.


What is Xpdf?
-------------

Xpdf is an open source viewer for Portable Document Format (PDF)
files. (These are also sometimes also called 'Acrobat' files, from
the name of Adobe's PDF software.) The Xpdf project also includes a
PDF text extractor, PDF-to-PostScript converter, and various other
utilities.

Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and
should run on pretty much any system with a decent C++ compiler.

Xpdf is designed to be small and efficient. It can use Type 1 or
TrueType fonts.


Distribution
------------

Xpdf is licensed under the GNU General Public License (GPL), version
2. In my opinion, the GPL is a convoluted, confusing, ambiguous mess.
But it's also pervasive, and I'm sick of arguing. And even if it is
confusing, the basic idea is good.

In order to cut down on the confusion a little bit, here are some
informal clarifications:

- I don't mind if you redistribute Xpdf in source and/or binary form,
as long as you include all of the documentation: README, man pages
(or help files), and COPYING. (Note that the README file contains a
pointer to a web page with the source code.)

- Selling a CD-ROM that contains Xpdf is fine with me, as long as it
includes the documentation. I wouldn't mind receiving a sample
copy, but it's not necessary.

- If you make useful changes to Xpdf, please make the source code
available -- post it on a web site, email it to me, whatever.

If you're interested in commercial licensing, please see the Glyph &
Cog web site:

http://www.glyphandcog.com/


Compatibility
-------------

Xpdf is developed and tested on a Linux 2.4 x86 system.

In addition, it has been compiled by others on Solaris, AIX, HP-UX,
Digital Unix, Irix, and numerous other Unix implementations, as well
as VMS and OS/2. It should work on pretty much any system which runs
X11 and has Unix-like libraries. You'll need ANSI C++ and C compilers
to compile it.

The non-X components of Xpdf (pdftops, pdftotext, pdfinfo, pdffonts,
pdftoppm, and pdfimages) can also be compiled on Win32 systems. See
the Xpdf web page for details.

If you compile Xpdf for a system not listed on the web page, please
let me know. If you're willing to make your binary available by ftp
or on the web, I'll be happy to add a link from the Xpdf web page. I
have decided not to host any binaries I didn't compile myself (for
disk space and support reasons).

If you can't get Xpdf to compile on your system, send me email and
I'll try to help.

Xpdf has been ported to the Acorn, Amiga, BeOS, and EPOC. See the
Xpdf web page for links.


Getting Xpdf
------------

The latest version is available from:

http://www.foolabs.com/xpdf/

or:

ftp://ftp.foolabs.com/pub/xpdf/

Source code and several precompiled executables are available.

Announcements of new versions are posted to several newsgroups
(comp.text.pdf, comp.os.linux.announce, and others) and emailed to a
list of people. If you'd like to receive email notification of new
versions, just let me know.


Running Xpdf
------------

To run xpdf, simply type:

xpdf file.pdf

To generate a PostScript file, hit the "print" button in xpdf, or run
pdftops:

pdftops file.pdf

To generate a plain text file, run pdftotext:

pdftotext file.pdf

There are four additional utilities (which are fully described in
their man pages):

pdfinfo -- dumps a PDF file's Info dictionary (plus some other
useful information)
pdffonts -- lists the fonts used in a PDF file along with various
information for each font
pdftoppm -- converts a PDF file to a series of PPM/PGM/PBM-format
bitmaps
pdfimages -- extracts the images from a PDF file

Command line options and many other details are described in the man
pages (xpdf.1, etc.) and the VMS help files (xpdf.hlp, etc.).
-------<end excerpt>-----

Good Luck!
Rich
 
Rich Grise <richgrise@example.net> wrote:

Click "Download" to get to http://www.foolabs.com/xpdf/download.html ,
then scroll down to "Precompiled binaries" and it's probably either
xpdf-3.00pl3-win32.zip (1142558 bytes) for Win32 or
xpdf-3.00pl3-dos6.zip (1775202 bytes) for DOS.
[snip useful extract]

Good Luck!
Rich
Thanks, got it, and it works a treat. Particularly impressed with its
speed. Glad you introduced me to it.

Mind you, I'm never too keen on leaving the GUI and getting into a DOS
Command Prompt window <g>. Seem to be two ways of doing it:

1) Open the DOS window in the XPDF program folder and then (assuming
defaults are OK) enter:

pdftotext "D:\long path\probably with some blanks\so needs
quotes\filename.pdf"


2) Open the DOS window in the folder containing the PDF file and
enter:

"D:\Program Files\xpdf-3.00pl3\pdftotext" filename.pdf

Am I right? Any other methods?

--
Terry Pinnell
Hobbyist, West Sussex, UK
 

Welcome to EDABoard.com

Sponsor

Back
Top