Archive formats...

On 02/12/2021 09:52, Don Y wrote:
On 12/2/2021 2:24 AM, Martin Brown wrote:
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be
subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct.  This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

He is stating a well known and general result.

That only applies in the general case.  The fact that most compressors
achieve *some* compression means the general case is RARE in the wild;
typically encountered when someone tries to compress already compressed
content.

That used to be true but most content these days apart from HTML and
flat text files is already compressed with something like a crude ZIP.
MS Office 2007 onwards files are thinly disguised ZIP files.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text.  (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]

His general point is true though.

It isn\'t important to the issues I\'m addressing.

It isn\'t clear to me what issue you are trying to address. Recognising
the header info for each compression method and the using it will get
you a long way but ISTR it has already been done. A derivative of the
IFL DOS utility programme that used to do that for example.

AV programs that can see into (some) archive files also use this method.

*If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
ENCOUNTER THE (compressed) FILE(s).  Any increase or decrease in file
size will already have been \"baked in\".  There is no value to my being
able to \"lecture\" the content creator that his compression actually
INCREASED the size of his content.  (caps are for emphasis, not shouting).

But if you have control of what is being done you can detect files where
the method you are using cannot make any improvements and just copy it.

[Compression also affords other features that are absent in its absence.
In particular, most compressors include checksums -- either implied
or explicit -- that further act to vouch for the integrity of the
content.  Can you tell me if \"foo.txt\" is corrupted?  What about
\"foo.zip\"?]

*My* concern is being able to recover the original file(s).  REGARDLESS
OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.

You definitely want to use the byte entropy to classify them then. That
will tell you fairly quickly whether or not a given block of disk is
JPEG, HTML or EXE without doing much work at all.
A user can use an off-the-shelf archiver to \"bundle\" multiple files
into a single \"archive file\".  So, I need to be able to \"unbundle\"
them, regardless of the archiver he happened to choose -- hence my
interest in \"archive formats\".

There were utilities of this sort back in the 1990\'s why reinvent the
wheel? It is harder now since there are even more rival formats.

A user can often opt to \"compress\" that resulting archive (or, the archive
program may offer that as an option applied while the archive is built).
(Or, an individual file without \"bundling\")

So, in order to unbundle the archive (or recover the singleton), I need
to be able to UNcompress it.  Hence my interest in compressors.

A user *could* opt to encrypt the contents.  If so, I won\'t even attempt
to access the original files.  I have no desire to expend resource
\"guessing\" secrets!

He can also opt to apply some other (wacky, home-baked) encoding or
compression
scheme (e.g., when sending executables through mail, I routinely change the
file extenstion to \"xex\" and prepend some gibberish at the front of the
file
to obscure its signature -- because some mail scanners will attempt to
decompress compressed files to \"protect\" the recipients, otherwise wrapping
it in a ZIP would suffice).  If so, I won\'t even attempt to access the
original file(s).

That was a part of what I used to do to numeric data files way back
exploiting the fact that spreadsheets treat a blank cell as 0. Then
compress by RLE and then hit that with ZIP. It got quite close to what a
modern compression algorithm can do.

One can argue that a user might do some other \"silly\" transform (ROT13?)
so I could cover those bases with (equally silly) inversions.  I want to
identify the sorts of *likely* \"processes\" to which some (other!) user
could have subjected a file\'s (or group of files\') content and be able
to reverse them.

Transforms that only alter the value of the symbols but not their
frequency should make no difference at all to the compressibility with
the entropy based variable bit length encoding methods being used today.
[I recently encountered some dictionaries that were poorly disguised ZIP
archives]

If the user *chose* to encode his content in BNPF, then I want to be able
to *decode* that content.  (as long as I don\'t have to \"guess secrets\"
or try to reverse engineer some wacky coding/packing scheme)

Its a relatively simple problem to solve --once you\'ve identified the
range of *common* archivers/encoders/compressors that might be used!
(e.g., SIT is/was common on Macs)

Trying every one is a bit too brute force for my taste. Trying the ones
that might be appropriate for the dataset would be my preference.

Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of
additional compression too - most algorithms don\'t even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)

I think you will find this gets complicated and slow.
I do something not unlike what you are proposing to recognise fragments
of damaged JPEG files and then splice a plausible header on the front to
take a look at what it looks like. Enough systems use default Huffman
tables that there is a fair chance of getting a piece of an image back.

Most modern lossless compressors optimise their symbol tree heavily and
so you have no such crib to deal with a general archive file. The
impossible ones are the backups using proprietary unpublished algorithms
- I can tell from byte entropy that they are pretty good though.

BTW do you have time to run that (now rather large benchmark) through
the Intel C/C++ compiler? I have just got it to compile with Clang for
the M1 and am hoping to run the benchmarks on Friday. My new target
Intel CPU to test is 12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)

I\'m sure there will be some since every port of notionally \"Portable\"
software to a new compiler uncovers coding defects (or compiler
defects). Clang for instance doesn\'t honour I64 length in printf.

--
Regards,
Martin Brown
 
Martin Brown wrote:
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts  <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct.  This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

He is stating a well known and general result.

One that sometimes catches people out. We had offline compression for
bulk data over phone line that could break some telecom modems realtime
compression back in the day. Internal buffer overflow because the data
expanded quite a bit when their simplistic \"compression\" algorithm tried
to process it in realtime. If it is still around I created a document
called fullfile which epitomised the maximally incompressible file.
There were already a test file sample of ASCII text and an empty file
(which essentially tests the baud rate of the modems at each end).

I, for example, have JUST designed a compressor that compresses
all occurrences of the string \"No, you are assuming there is no
other (implicit) source of information that the compressor can
rely upon.\" into the hex constant 0xFE.

As such, the first paragraph in my reply, here, can be compressed
to a single byte!  The remaining characters in this message are
not affected by my compressor.  So, the message ends up SMALLER
as a result of the elided characters in that first paragraph.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text.  (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]

His general point is true though.

Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of additional
compression too - most algorithms don\'t even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)

A piece of ancient programming wisdom appears relevant:

\"As everybody knows, all programs have bugs, and all programs can be
made smaller.

Therefore all programs can be reduced to a single incorrect instruction.\"

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 12/2/2021 3:47 AM, Martin Brown wrote:
On 02/12/2021 09:52, Don Y wrote:
On 12/2/2021 2:24 AM, Martin Brown wrote:
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be subsequently
compressed).

If by compressed you mean made smaller, that\'s obviously false.

If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.

No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.

He is stating a well known and general result.

That only applies in the general case. The fact that most compressors
achieve *some* compression means the general case is RARE in the wild;
typically encountered when someone tries to compress already compressed
content.

That used to be true but most content these days apart from HTML and flat text
files is already compressed with something like a crude ZIP. MS Office 2007
onwards files are thinly disguised ZIP files.

Again, it doesn\'t matter to *my* use. If someone chooses to ZIP a RAR of a UUE
of a BZ2 and then repeat the entire chain of compressors a *second* time,
resulting in a godawful mess, there\'s nothing that I can do to have prevented
that from being done.

*My* interest is being able to unravel those different layers -- regardless
of *which* compressors and archivers had been applied to get at the \"juicy
noughat\" inside.

My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)

[Unapplicable \"proof\" elided]

His general point is true though.

It isn\'t important to the issues I\'m addressing.

It isn\'t clear to me what issue you are trying to address. Recognising the
header info for each compression method and the using it will get you a long
way but ISTR it has already been done. A derivative of the IFL DOS utility
programme that used to do that for example.

I want to know the \"extent\" of the problem before posing a solution.
I have many compressors, decompressors, archivers, dearchivers, etc.
already. Many have been written to try to address multiple \"forms\"
of these actions.

But, none (IMO) have actually addressed *all*. And, I don\'t *need*
a single-executable-solution; I need to know *which* executable to
apply based on the compression and archiving format \"detected\"
in the file in question. In the classic UN*X fashion, I can
build a new tool from a *set* of existing tools -- instead of the
Windows solution of rewriting (re-bugging?) all of those existing
tools into a new version of same. The advantage being that I can
add whatever new/exotic format comes along instead of waiting for
someone to build a new (bug free!) executable.

AV programs that can see into (some) archive files also use this method.

*If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
ENCOUNTER THE (compressed) FILE(s). Any increase or decrease in file
size will already have been \"baked in\". There is no value to my being
able to \"lecture\" the content creator that his compression actually
INCREASED the size of his content. (caps are for emphasis, not shouting).

But if you have control of what is being done you can detect files where the
method you are using cannot make any improvements and just copy it.

Again, I\'m not going to alter the original archive/compressed file/etc.
I\'m going to leave it however it was. *But* apply whatever inverse
transforms are needed to examine its internal contents in the form
that those contents were originally intended to take.

E.g., if someone builds a tarball of a bunch of RAW images and then compresses
with StuffIt, I\'m going to preserve that mess -- but extract copies of those
images for my own use. There\'s no point in my \"repackaging\" them; they
are already \"packaged\" AND I HAVE THE TOOLS TO UNPACK THEM, again, if I
so desire.

[Compression also affords other features that are absent in its absence.
In particular, most compressors include checksums -- either implied
or explicit -- that further act to vouch for the integrity of the
content. Can you tell me if \"foo.txt\" is corrupted? What about
\"foo.zip\"?]

*My* concern is being able to recover the original file(s). REGARDLESS
OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.

You definitely want to use the byte entropy to classify them then. That will
tell you fairly quickly whether or not a given block of disk is JPEG, HTML or
EXE without doing much work at all.

Or SIT, DMG, UUE, BZ2, ...

My plan was to use the file extension as a *hint* and then file(1) (or similar)
to verify signature(s) before processing with whatever tool (and set of command
line switches) is required -- assuming the tool may require \"further direction\"
than just being thrown at the file.

7z is fairly comprehensive. But, I\'m not sure it would recognize \"legacy\"
tarballs. (this is also an argument against re-encoding the files; I\'d
be annoyed if a Sun patch archive wouldn\'t be deployable on a Slowaris box
because I opted to reencode it in a \"denser\" form!)

A user can use an off-the-shelf archiver to \"bundle\" multiple files
into a single \"archive file\". So, I need to be able to \"unbundle\"
them, regardless of the archiver he happened to choose -- hence my
interest in \"archive formats\".

There were utilities of this sort back in the 1990\'s why reinvent the wheel? It
is harder now since there are even more rival formats.

If there\'s such a beast, then it will clearly enumerate EVERY such format,
right? So, all I\'d need to do is look at it\'s spec sheet to answer the
question posed by this post...

If it is foolish enough to rely on file names (extensions), then it\'s
already likely doomed -- as anyone can name any file anything they
want!

A user can often opt to \"compress\" that resulting archive (or, the archive
program may offer that as an option applied while the archive is built).
(Or, an individual file without \"bundling\")

So, in order to unbundle the archive (or recover the singleton), I need
to be able to UNcompress it. Hence my interest in compressors.

A user *could* opt to encrypt the contents. If so, I won\'t even attempt
to access the original files. I have no desire to expend resource
\"guessing\" secrets!

He can also opt to apply some other (wacky, home-baked) encoding or compression
scheme (e.g., when sending executables through mail, I routinely change the
file extenstion to \"xex\" and prepend some gibberish at the front of the file
to obscure its signature -- because some mail scanners will attempt to
decompress compressed files to \"protect\" the recipients, otherwise wrapping
it in a ZIP would suffice). If so, I won\'t even attempt to access the
original file(s).

That was a part of what I used to do to numeric data files way back exploiting
the fact that spreadsheets treat a blank cell as 0. Then compress by RLE and
then hit that with ZIP. It got quite close to what a modern compression
algorithm can do.

One can argue that a user might do some other \"silly\" transform (ROT13?)
so I could cover those bases with (equally silly) inversions. I want to
identify the sorts of *likely* \"processes\" to which some (other!) user
could have subjected a file\'s (or group of files\') content and be able
to reverse them.

Transforms that only alter the value of the symbols but not their frequency
should make no difference at all to the compressibility with the entropy based
variable bit length encoding methods being used today.

Again, I don\'t care about how effective a compressor is. Or, how much
overhead an archive carries with it (do I really want/need to be
able to observe the *permissions* of the files that were packed
into the archive?)

There are countless other \"hazards\" in the process (e.g., tarbombs
that don\'t unpack cooperatively into the \"new\" file system) that I\'ve
got to be sure I address.

How many times have you tried to unpack something (under Windows)
only to discover you\'ve hit the MAXPATHLEN limit for their shell?
Or, that \"Makefile\" overwrote \"makefile\" in the same folder?
Or, \"object::method\" couldn\'t be stored in the file system?
Or, \"What\'s up with all these file format?.txt\"?

[I recently encountered some dictionaries that were poorly disguised ZIP
archives]

If the user *chose* to encode his content in BNPF, then I want to be able
to *decode* that content. (as long as I don\'t have to \"guess secrets\"
or try to reverse engineer some wacky coding/packing scheme)

Its a relatively simple problem to solve --once you\'ve identified the
range of *common* archivers/encoders/compressors that might be used!
(e.g., SIT is/was common on Macs)

Trying every one is a bit too brute force for my taste. Trying the ones that
might be appropriate for the dataset would be my preference.

Look *inside* the envelope for a signature. Try whatever tool *that*
suggests.

Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.

You also have to work much harder to get that very last 1% of additional
compression too - most algorithms don\'t even try.

PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)

I think you will find this gets complicated and slow.
I do something not unlike what you are proposing to recognise fragments of
damaged JPEG files and then splice a plausible header on the front to take a
look at what it looks like. Enough systems use default Huffman tables that
there is a fair chance of getting a piece of an image back.

I have file recovery tools that try to do that. Typically, targeting
some *type* of file (e.g., MSOffice, photos, etc.).

My preferred solution is not to \"lose\" the file, in the first place! :>

Most modern lossless compressors optimise their symbol tree heavily and so you
have no such crib to deal with a general archive file. The impossible ones are
the backups using proprietary unpublished algorithms - I can tell from byte
entropy that they are pretty good though.

Yes, I received the sources for a compiler on such a backup. Convenient for
the client to deliver (ages ago) but a nuisance for me as I had to then
install the \"restore\" utility to gain access to the contents of that (ahem)
\"image\".

As new compressors (and archivers) come out, I often find myself scrambling
to find the appropriate tool to gain access. It seems a wasted effort
as I don\'t see HUGE differences in their capabilities or performance.

Many seem opportunistic. Is there a reason VMDKs couldn\'t have been
implemented as \"OS-neutral\" ISOs? Yes, there\'s always an advantage
cited for the choice -- but, rarely have I seen a \"quantified\"
explanation rationalizing the cost it imposes on users. Imagine
a more efficient POP4 or IMAP7 -- what cost to avail yourself of
those efficiencies? <frown>

I am unclear as to how I shall address Jan\'s mention of multimedia files
and the various associated containers. I wasn\'t prepared for that
possibility... :<

BTW do you have time to run that (now rather large benchmark) through the Intel
C/C++ compiler? I have just got it to compile with Clang for the M1 and am
hoping to run the benchmarks on Friday. My new target Intel CPU to test is
12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)

I\'m overwhelmed, currently. I made a commitment to release six designs by
year end (so folks could slip them into production after the holidays)
but ended up \"playing\" for a few weeks (fun, but it comes with a cost).
I\'m hoping to get the second of the six done this week. Which means it
will be a real squeeze to get all six done, given the presence of the
holiday (despite earlier play time, I\'m not keen on working THROUGH
the holiday to meet those commitments)

[It seems like its been taking about a week to do all the release
engineering chores for each. ~4 weeks left would make it a nail-biter.
Losing one of those to the holidays makes it almost impossible. I\'ve
already seriously curtailed my baking -- normally 50 pounds of flour
runs through the kitchen for the holidays; this year, I doubt I\'ll use 25!]

I can try sometime after new years (assuming no problems creep up
with my product releases once other folks get their hands on them)
I\'ll try to make a point to verify the compiler is installed and
running (I\'ve been working from a different workstation, recently,
and haven\'t had need of the tools on \"that\" workstation)

I\'m sure there will be some since every port of notionally \"Portable\" software
to a new compiler uncovers coding defects (or compiler defects). Clang for
instance doesn\'t honour I64 length in printf.

One reason I keep so many compilers on hand is to be able to do
these kinds of \"checks\". Only, in my case, it\'s typically to ensure
I\'m not relying on features that may be too new for legacy tools
(some of the processors I\'ve had to target have \"single source\"
toolchains; if you want to run the code on that processor, you\'d
better make sure THAT compiler won\'t choke on it cuz there\'s
no alternative available!)

It\'s also handy for verifying the code is free of endian-ness issues,
etc. (I really don\'t like to commit to a hardware implementation
until just before product release -- so I can see where the most
cost efficient hardware is, or appears to be headed!)
 
On 12/2/2021 8:31 AM, Don Y wrote:

BTW do you have time to run that (now rather large benchmark) through the
Intel C/C++ compiler? I have just got it to compile with Clang for the M1 and
am hoping to run the benchmarks on Friday. My new target Intel CPU to test is
12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)

I\'m overwhelmed, currently. I made a commitment to release six designs by
year end (so folks could slip them into production after the holidays)
but ended up \"playing\" for a few weeks (fun, but it comes with a cost).
I\'m hoping to get the second of the six done this week. Which means it
will be a real squeeze to get all six done, given the presence of the
holiday (despite earlier play time, I\'m not keen on working THROUGH
the holiday to meet those commitments)

I can try sometime after new years (assuming no problems creep up
with my product releases once other folks get their hands on them)
I\'ll try to make a point to verify the compiler is installed and
running (I\'ve been working from a different workstation, recently,
and haven\'t had need of the tools on \"that\" workstation)

OK. I was planning on baking last night but got my booster and
made the mistake of falling asleep immediately after. :< As
such, I didn\'t get a chance to \"work the muscle\" (which has
always helped me avoid any injection site soreness). The
prospect of kneading 10 pounds of bread dough didn\'t seem very
appealling....

So, I spent the night clearing the (physical!) crap off that
workstation (with only ~50 sq ft of bench space -- and maybe
two or three of those truly \"free\" -- anything that is not
actively being used tends to attract clutter!)

Found the compiler. But, I\'d apparently uninstalled VS at
some point (likely preparing to install an upgrade in its
place -- I don\'t trust \"overwrite installs\").

Reinstalled that (which is a bitch -- 20GB?? -- especially
if doing so \"offline\".)

Ran a few test cases and *think* it is operational and configured
for \"Intel64\". So, we can try your code, if still interested.

[I likely won\'t be able to bake anything \"of effort\" tonight as I\'m
still sore. Maybe some Benne Wafers... they\'re easy! (though I
see SWMBO has left her EMPTY biscotti on the kitchen counter as
a not-so-subtle hint...)]
 

Welcome to EDABoard.com

Sponsor

Back
Top