0
|
1 |
'\" t
|
11
|
2 |
.\" ident "@(#)bzip2.1.sunman 1.7 10/03/16 SMI"
|
0
|
3 |
.\"
|
|
4 |
.\" modified to reference existing Solaris man pages, and to add note
|
|
5 |
.\" about source availability ([email protected])
|
|
6 |
.\"
|
|
7 |
.PU
|
|
8 |
.TH bzip2 1
|
|
9 |
.SH NAME
|
|
10 |
bzip2, bunzip2 \- a block-sorting file compressor, v1.0.5
|
|
11 |
.br
|
|
12 |
bzcat \- decompresses files to stdout
|
|
13 |
.br
|
|
14 |
bzip2recover \- recovers data from damaged bzip2 files
|
|
15 |
|
|
16 |
.SH SYNOPSIS
|
|
17 |
.ll +8
|
|
18 |
.B bzip2
|
|
19 |
.RB [ " \-cdfkqstvzVL123456789 " ]
|
|
20 |
[
|
|
21 |
.I "filenames \&..."
|
|
22 |
]
|
|
23 |
.ll -8
|
|
24 |
.br
|
|
25 |
.B bunzip2
|
|
26 |
.RB [ " \-fkvsVL " ]
|
|
27 |
[
|
|
28 |
.I "filenames \&..."
|
|
29 |
]
|
|
30 |
.br
|
|
31 |
.B bzcat
|
|
32 |
.RB [ " \-s " ]
|
|
33 |
[
|
|
34 |
.I "filenames \&..."
|
|
35 |
]
|
|
36 |
.br
|
|
37 |
.B bzip2recover
|
|
38 |
.I "filename"
|
|
39 |
|
|
40 |
.SH DESCRIPTION
|
|
41 |
.I bzip2
|
|
42 |
compresses files using the Burrows-Wheeler block sorting
|
|
43 |
text compression algorithm, and Huffman coding. Compression is
|
|
44 |
generally considerably better than that achieved by more conventional
|
|
45 |
LZ77/LZ78-based compressors, and approaches the performance of the PPM
|
|
46 |
family of statistical compressors.
|
|
47 |
|
|
48 |
The command-line options are deliberately very similar to
|
|
49 |
those of
|
|
50 |
.I GNU gzip,
|
|
51 |
but they are not identical.
|
|
52 |
|
|
53 |
.I bzip2
|
|
54 |
expects a list of file names to accompany the
|
|
55 |
command-line flags. Each file is replaced by a compressed version of
|
|
56 |
itself, with the name "original_name.bz2".
|
|
57 |
Each compressed file
|
|
58 |
has the same modification date, permissions, and, when possible,
|
|
59 |
ownership as the corresponding original, so that these properties can
|
|
60 |
be correctly restored at decompression time. File name handling is
|
|
61 |
naive in the sense that there is no mechanism for preserving original
|
|
62 |
file names, permissions, ownerships or dates in filesystems which lack
|
|
63 |
these concepts, or have serious file name length restrictions, such as
|
|
64 |
MS-DOS.
|
|
65 |
|
|
66 |
.I bzip2
|
|
67 |
and
|
|
68 |
.I bunzip2
|
|
69 |
will by default not overwrite existing
|
|
70 |
files. If you want this to happen, specify the \-f flag.
|
|
71 |
|
|
72 |
If no file names are specified,
|
|
73 |
.I bzip2
|
|
74 |
compresses from standard
|
|
75 |
input to standard output. In this case,
|
|
76 |
.I bzip2
|
|
77 |
will decline to
|
|
78 |
write compressed output to a terminal, as this would be entirely
|
|
79 |
incomprehensible and therefore pointless.
|
|
80 |
|
|
81 |
.I bunzip2
|
|
82 |
(or
|
|
83 |
.I bzip2 \-d)
|
|
84 |
decompresses all
|
|
85 |
specified files. Files which were not created by
|
|
86 |
.I bzip2
|
|
87 |
will be detected and ignored, and a warning issued.
|
|
88 |
.I bzip2
|
|
89 |
attempts to guess the filename for the decompressed file
|
|
90 |
from that of the compressed file as follows:
|
|
91 |
|
|
92 |
filename.bz2 becomes filename
|
|
93 |
filename.bz becomes filename
|
|
94 |
filename.tbz2 becomes filename.tar
|
|
95 |
filename.tbz becomes filename.tar
|
|
96 |
anyothername becomes anyothername.out
|
|
97 |
|
|
98 |
If the file does not end in one of the recognised endings,
|
|
99 |
.I .bz2,
|
|
100 |
.I .bz,
|
|
101 |
.I .tbz2
|
|
102 |
or
|
|
103 |
.I .tbz,
|
|
104 |
.I bzip2
|
|
105 |
complains that it cannot
|
|
106 |
guess the name of the original file, and uses the original name
|
|
107 |
with
|
|
108 |
.I .out
|
|
109 |
appended.
|
|
110 |
|
|
111 |
As with compression, supplying no
|
|
112 |
filenames causes decompression from
|
|
113 |
standard input to standard output.
|
|
114 |
|
|
115 |
.I bunzip2
|
|
116 |
will correctly decompress a file which is the
|
|
117 |
concatenation of two or more compressed files. The result is the
|
|
118 |
concatenation of the corresponding uncompressed files. Integrity
|
|
119 |
testing (\-t)
|
|
120 |
of concatenated
|
|
121 |
compressed files is also supported.
|
|
122 |
|
|
123 |
You can also compress or decompress files to the standard output by
|
|
124 |
giving the \-c flag. Multiple files may be compressed and
|
|
125 |
decompressed like this. The resulting outputs are fed sequentially to
|
|
126 |
stdout. Compression of multiple files
|
|
127 |
in this manner generates a stream
|
|
128 |
containing multiple compressed file representations. Such a stream
|
|
129 |
can be decompressed correctly only by
|
|
130 |
.I bzip2
|
|
131 |
version 0.9.0 or
|
|
132 |
later. Earlier versions of
|
|
133 |
.I bzip2
|
|
134 |
will stop after decompressing
|
|
135 |
the first file in the stream.
|
|
136 |
|
|
137 |
.I bzcat
|
|
138 |
(or
|
|
139 |
.I bzip2 -dc)
|
|
140 |
decompresses all specified files to
|
|
141 |
the standard output.
|
|
142 |
|
|
143 |
.I bzip2
|
|
144 |
will read arguments from the environment variables
|
|
145 |
.I BZIP2
|
|
146 |
and
|
|
147 |
.I BZIP,
|
|
148 |
in that order, and will process them
|
|
149 |
before any arguments read from the command line. This gives a
|
|
150 |
convenient way to supply default arguments.
|
|
151 |
|
|
152 |
Compression is always performed, even if the compressed
|
|
153 |
file is slightly
|
|
154 |
larger than the original. Files of less than about one hundred bytes
|
|
155 |
tend to get larger, since the compression mechanism has a constant
|
|
156 |
overhead in the region of 50 bytes. Random data (including the output
|
|
157 |
of most file compressors) is coded at about 8.05 bits per byte, giving
|
|
158 |
an expansion of around 0.5%.
|
|
159 |
|
|
160 |
As a self-check for your protection,
|
|
161 |
.I
|
|
162 |
bzip2
|
|
163 |
uses 32-bit CRCs to
|
|
164 |
make sure that the decompressed version of a file is identical to the
|
|
165 |
original. This guards against corruption of the compressed data, and
|
|
166 |
against undetected bugs in
|
|
167 |
.I bzip2
|
|
168 |
(hopefully very unlikely). The
|
|
169 |
chances of data corruption going undetected is microscopic, about one
|
|
170 |
chance in four billion for each file processed. Be aware, though, that
|
|
171 |
the check occurs upon decompression, so it can only tell you that
|
|
172 |
something is wrong. It can't help you
|
|
173 |
recover the original uncompressed
|
|
174 |
data. You can use
|
|
175 |
.I bzip2recover
|
|
176 |
to try to recover data from
|
|
177 |
damaged files.
|
|
178 |
|
|
179 |
Return values: 0 for a normal exit, 1 for environmental problems (file
|
|
180 |
not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
|
|
181 |
compressed file, 3 for an internal consistency error (eg, bug) which
|
|
182 |
caused
|
|
183 |
.I bzip2
|
|
184 |
to panic.
|
|
185 |
|
|
186 |
.SH OPTIONS
|
|
187 |
.TP
|
|
188 |
.B \-c --stdout
|
|
189 |
Compress or decompress to standard output.
|
|
190 |
.TP
|
|
191 |
.B \-d --decompress
|
|
192 |
Force decompression.
|
|
193 |
.I bzip2,
|
|
194 |
.I bunzip2
|
|
195 |
and
|
|
196 |
.I bzcat
|
|
197 |
are
|
|
198 |
really the same program, and the decision about what actions to take is
|
|
199 |
done on the basis of which name is used. This flag overrides that
|
|
200 |
mechanism, and forces
|
|
201 |
.I bzip2
|
|
202 |
to decompress.
|
|
203 |
.TP
|
|
204 |
.B \-z --compress
|
|
205 |
The complement to \-d: forces compression, regardless of the
|
|
206 |
invocation name.
|
|
207 |
.TP
|
|
208 |
.B \-t --test
|
|
209 |
Check integrity of the specified file(s), but don't decompress them.
|
|
210 |
This really performs a trial decompression and throws away the result.
|
|
211 |
.TP
|
|
212 |
.B \-f --force
|
|
213 |
Force overwrite of output files. Normally,
|
|
214 |
.I bzip2
|
|
215 |
will not overwrite
|
|
216 |
existing output files. Also forces
|
|
217 |
.I bzip2
|
|
218 |
to break hard links
|
|
219 |
to files, which it otherwise wouldn't do.
|
|
220 |
|
|
221 |
bzip2 normally declines to decompress files which don't have the
|
|
222 |
correct magic header bytes. If forced (-f), however, it will pass
|
|
223 |
such files through unmodified. This is how GNU gzip behaves.
|
|
224 |
.TP
|
|
225 |
.B \-k --keep
|
|
226 |
Keep (don't delete) input files during compression
|
|
227 |
or decompression.
|
|
228 |
.TP
|
|
229 |
.B \-s --small
|
|
230 |
Reduce memory usage, for compression, decompression and testing. Files
|
|
231 |
are decompressed and tested using a modified algorithm which only
|
|
232 |
requires 2.5 bytes per block byte. This means any file can be
|
|
233 |
decompressed in 2300k of memory, albeit at about half the normal speed.
|
|
234 |
|
|
235 |
During compression, \-s selects a block size of 200k, which limits
|
|
236 |
memory use to around the same figure, at the expense of your compression
|
|
237 |
ratio. In short, if your machine is low on memory (8 megabytes or
|
|
238 |
less), use \-s for everything. See MEMORY MANAGEMENT below.
|
|
239 |
.TP
|
|
240 |
.B \-q --quiet
|
|
241 |
Suppress non-essential warning messages. Messages pertaining to
|
|
242 |
I/O errors and other critical events will not be suppressed.
|
|
243 |
.TP
|
|
244 |
.B \-v --verbose
|
|
245 |
Verbose mode -- show the compression ratio for each file processed.
|
|
246 |
Further \-v's increase the verbosity level, spewing out lots of
|
|
247 |
information which is primarily of interest for diagnostic purposes.
|
|
248 |
.TP
|
|
249 |
.B \-L --license -V --version
|
|
250 |
Display the software version, license terms and conditions.
|
|
251 |
.TP
|
|
252 |
.B \-1 (or \-\-fast) to \-9 (or \-\-best)
|
|
253 |
Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
|
|
254 |
effect when decompressing. See MEMORY MANAGEMENT below.
|
|
255 |
The \-\-fast and \-\-best aliases are primarily for GNU gzip
|
|
256 |
compatibility. In particular, \-\-fast doesn't make things
|
|
257 |
significantly faster.
|
|
258 |
And \-\-best merely selects the default behaviour.
|
|
259 |
.TP
|
|
260 |
.B \--
|
|
261 |
Treats all subsequent arguments as file names, even if they start
|
|
262 |
with a dash. This is so you can handle files with names beginning
|
|
263 |
with a dash, for example: bzip2 \-- \-myfilename.
|
|
264 |
.TP
|
|
265 |
.B \--repetitive-fast --repetitive-best
|
|
266 |
These flags are redundant in versions 0.9.5 and above. They provided
|
|
267 |
some coarse control over the behaviour of the sorting algorithm in
|
|
268 |
earlier versions, which was sometimes useful. 0.9.5 and above have an
|
|
269 |
improved algorithm which renders these flags irrelevant.
|
|
270 |
|
|
271 |
.SH MEMORY MANAGEMENT
|
|
272 |
.I bzip2
|
|
273 |
compresses large files in blocks. The block size affects
|
|
274 |
both the compression ratio achieved, and the amount of memory needed for
|
|
275 |
compression and decompression. The flags \-1 through \-9
|
|
276 |
specify the block size to be 100,000 bytes through 900,000 bytes (the
|
|
277 |
default) respectively. At decompression time, the block size used for
|
|
278 |
compression is read from the header of the compressed file, and
|
|
279 |
.I bunzip2
|
|
280 |
then allocates itself just enough memory to decompress
|
|
281 |
the file. Since block sizes are stored in compressed files, it follows
|
|
282 |
that the flags \-1 to \-9 are irrelevant to and so ignored
|
|
283 |
during decompression.
|
|
284 |
|
|
285 |
Compression and decompression requirements,
|
|
286 |
in bytes, can be estimated as:
|
|
287 |
|
|
288 |
Compression: 400k + ( 8 x block size )
|
|
289 |
|
|
290 |
Decompression: 100k + ( 4 x block size ), or
|
|
291 |
100k + ( 2.5 x block size )
|
|
292 |
|
|
293 |
Larger block sizes give rapidly diminishing marginal returns. Most of
|
|
294 |
the compression comes from the first two or three hundred k of block
|
|
295 |
size, a fact worth bearing in mind when using
|
|
296 |
.I bzip2
|
|
297 |
on small machines.
|
|
298 |
It is also important to appreciate that the decompression memory
|
|
299 |
requirement is set at compression time by the choice of block size.
|
|
300 |
|
|
301 |
For files compressed with the default 900k block size,
|
|
302 |
.I bunzip2
|
|
303 |
will require about 3700 kbytes to decompress. To support decompression
|
|
304 |
of any file on a 4 megabyte machine,
|
|
305 |
.I bunzip2
|
|
306 |
has an option to
|
|
307 |
decompress using approximately half this amount of memory, about 2300
|
|
308 |
kbytes. Decompression speed is also halved, so you should use this
|
|
309 |
option only where necessary. The relevant flag is -s.
|
|
310 |
|
|
311 |
In general, try and use the largest block size memory constraints allow,
|
|
312 |
since that maximises the compression achieved. Compression and
|
|
313 |
decompression speed are virtually unaffected by block size.
|
|
314 |
|
|
315 |
Another significant point applies to files which fit in a single block
|
|
316 |
-- that means most files you'd encounter using a large block size. The
|
|
317 |
amount of real memory touched is proportional to the size of the file,
|
|
318 |
since the file is smaller than a block. For example, compressing a file
|
|
319 |
20,000 bytes long with the flag -9 will cause the compressor to
|
|
320 |
allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
|
|
321 |
kbytes of it. Similarly, the decompressor will allocate 3700k but only
|
|
322 |
touch 100k + 20000 * 4 = 180 kbytes.
|
|
323 |
|
|
324 |
Here is a table which summarises the maximum memory usage for different
|
|
325 |
block sizes. Also recorded is the total compressed size for 14 files of
|
|
326 |
the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
|
|
327 |
column gives some feel for how compression varies with block size.
|
|
328 |
These figures tend to understate the advantage of larger block sizes for
|
|
329 |
larger files, since the Corpus is dominated by smaller files.
|
|
330 |
|
|
331 |
Compress Decompress Decompress Corpus
|
|
332 |
Flag usage usage -s usage Size
|
|
333 |
|
|
334 |
-1 1200k 500k 350k 914704
|
|
335 |
-2 2000k 900k 600k 877703
|
|
336 |
-3 2800k 1300k 850k 860338
|
|
337 |
-4 3600k 1700k 1100k 846899
|
|
338 |
-5 4400k 2100k 1350k 845160
|
|
339 |
-6 5200k 2500k 1600k 838626
|
|
340 |
-7 6100k 2900k 1850k 834096
|
|
341 |
-8 6800k 3300k 2100k 828642
|
|
342 |
-9 7600k 3700k 2350k 828642
|
|
343 |
|
|
344 |
.SH RECOVERING DATA FROM DAMAGED FILES
|
|
345 |
.I bzip2
|
|
346 |
compresses files in blocks, usually 900kbytes long. Each
|
|
347 |
block is handled independently. If a media or transmission error causes
|
|
348 |
a multi-block .bz2
|
|
349 |
file to become damaged, it may be possible to
|
|
350 |
recover data from the undamaged blocks in the file.
|
|
351 |
|
|
352 |
The compressed representation of each block is delimited by a 48-bit
|
|
353 |
pattern, which makes it possible to find the block boundaries with
|
|
354 |
reasonable certainty. Each block also carries its own 32-bit CRC, so
|
|
355 |
damaged blocks can be distinguished from undamaged ones.
|
|
356 |
|
|
357 |
.I bzip2recover
|
|
358 |
is a simple program whose purpose is to search for
|
|
359 |
blocks in .bz2 files, and write each block out into its own .bz2
|
|
360 |
file. You can then use
|
|
361 |
.I bzip2
|
|
362 |
\-t
|
|
363 |
to test the
|
|
364 |
integrity of the resulting files, and decompress those which are
|
|
365 |
undamaged.
|
|
366 |
|
|
367 |
.I bzip2recover
|
|
368 |
takes a single argument, the name of the damaged file,
|
|
369 |
and writes a number of files "rec00001file.bz2",
|
|
370 |
"rec00002file.bz2", etc, containing the extracted blocks.
|
|
371 |
The output filenames are designed so that the use of
|
|
372 |
wildcards in subsequent processing -- for example,
|
|
373 |
"bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in
|
|
374 |
the correct order.
|
|
375 |
|
|
376 |
.I bzip2recover
|
|
377 |
should be of most use dealing with large .bz2
|
|
378 |
files, as these will contain many blocks. It is clearly
|
|
379 |
futile to use it on damaged single-block files, since a
|
|
380 |
damaged block cannot be recovered. If you wish to minimise
|
|
381 |
any potential data loss through media or transmission errors,
|
|
382 |
you might consider compressing with a smaller
|
|
383 |
block size.
|
|
384 |
|
|
385 |
.SH PERFORMANCE NOTES
|
|
386 |
The sorting phase of compression gathers together similar strings in the
|
|
387 |
file. Because of this, files containing very long runs of repeated
|
|
388 |
symbols, like "aabaabaabaab ..." (repeated several hundred times) may
|
|
389 |
compress more slowly than normal. Versions 0.9.5 and above fare much
|
|
390 |
better than previous versions in this respect. The ratio between
|
|
391 |
worst-case and average-case compression time is in the region of 10:1.
|
|
392 |
For previous versions, this figure was more like 100:1. You can use the
|
|
393 |
\-vvvv option to monitor progress in great detail, if you want.
|
|
394 |
|
|
395 |
Decompression speed is unaffected by these phenomena.
|
|
396 |
|
|
397 |
.I bzip2
|
|
398 |
usually allocates several megabytes of memory to operate
|
|
399 |
in, and then charges all over it in a fairly random fashion. This means
|
|
400 |
that performance, both for compressing and decompressing, is largely
|
|
401 |
determined by the speed at which your machine can service cache misses.
|
|
402 |
Because of this, small changes to the code to reduce the miss rate have
|
|
403 |
been observed to give disproportionately large performance improvements.
|
|
404 |
I imagine
|
|
405 |
.I bzip2
|
|
406 |
will perform best on machines with very large caches.
|
|
407 |
|
|
408 |
.SH CAVEATS
|
|
409 |
I/O error messages are not as helpful as they could be.
|
|
410 |
.I bzip2
|
|
411 |
tries hard to detect I/O errors and exit cleanly, but the details of
|
|
412 |
what the problem is sometimes seem rather misleading.
|
|
413 |
|
|
414 |
This manual page pertains to version 1.0.5 of
|
|
415 |
.I bzip2.
|
|
416 |
Compressed data created by this version is entirely forwards and
|
|
417 |
backwards compatible with the previous public releases, versions
|
|
418 |
0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2, 1.0.3 and 1.0.4 but with the
|
|
419 |
following exception: 0.9.0 and above can correctly decompress multiple
|
|
420 |
concatenated compressed files. 0.1pl2 cannot do this; it will stop
|
|
421 |
after decompressing just the first file in the stream.
|
|
422 |
|
|
423 |
.I bzip2recover
|
|
424 |
versions prior to 1.0.2 used 32-bit integers to represent
|
|
425 |
bit positions in compressed files, so they could not handle compressed
|
|
426 |
files more than 512 megabytes long. Versions 1.0.2 and above use
|
|
427 |
64-bit ints on some platforms which support them (GNU supported
|
|
428 |
targets, and Windows). To establish whether or not bzip2recover was
|
|
429 |
built with such a limitation, run it without arguments. In any event
|
|
430 |
you can build yourself an unlimited version if you can recompile it
|
|
431 |
with MaybeUInt64 set to be an unsigned 64-bit integer.
|
|
432 |
|
|
433 |
|
|
434 |
|
|
435 |
.SH AUTHOR
|
|
436 |
Julian Seward, jsewardbzip.org.
|
|
437 |
|
|
438 |
http://www.bzip.org
|
|
439 |
|
|
440 |
The ideas embodied in
|
|
441 |
.I bzip2
|
|
442 |
are due to (at least) the following
|
|
443 |
people: Michael Burrows and David Wheeler (for the block sorting
|
|
444 |
transformation), David Wheeler (again, for the Huffman coder), Peter
|
|
445 |
Fenwick (for the structured coding model in the original
|
|
446 |
.I bzip,
|
|
447 |
and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
|
|
448 |
(for the arithmetic coder in the original
|
|
449 |
.I bzip).
|
|
450 |
I am much
|
|
451 |
indebted for their help, support and advice. See the manual in the
|
|
452 |
source distribution for pointers to sources of documentation. Christian
|
|
453 |
von Roques encouraged me to look for faster sorting algorithms, so as to
|
|
454 |
speed up compression. Bela Lubkin encouraged me to improve the
|
|
455 |
worst-case compression performance.
|
|
456 |
Donna Robinson XMLised the documentation.
|
|
457 |
The bz* scripts are derived from those of GNU gzip.
|
|
458 |
Many people sent patches, helped
|
|
459 |
with portability problems, lent machines, gave advice and were generally
|
|
460 |
helpful.
|
|
461 |
.SH ATTRIBUTES
|
|
462 |
See
|
|
463 |
.BR attributes (5)
|
|
464 |
for descriptions of the following attributes:
|
|
465 |
.sp
|
|
466 |
.TS
|
|
467 |
box;
|
|
468 |
cbp-1 | cbp-1
|
|
469 |
l | l .
|
|
470 |
ATTRIBUTE TYPE ATTRIBUTE VALUE
|
|
471 |
=
|
11
|
472 |
Availability compress/bzip2
|
0
|
473 |
=
|
|
474 |
Interface Stability Committed
|
|
475 |
.TE
|
|
476 |
.PP
|
|
477 |
.SH NOTES
|
|
478 |
Source for bzip2 is available on http://opensolaris.org.
|