|
1 '\" t |
|
2 .\" ident "@(#)bzip2.1.sunman 1.6 08/04/21 SMI" |
|
3 .\" |
|
4 .\" modified to reference existing Solaris man pages, and to add note |
|
5 .\" about source availability ([email protected]) |
|
6 .\" |
|
7 .PU |
|
8 .TH bzip2 1 |
|
9 .SH NAME |
|
10 bzip2, bunzip2 \- a block-sorting file compressor, v1.0.5 |
|
11 .br |
|
12 bzcat \- decompresses files to stdout |
|
13 .br |
|
14 bzip2recover \- recovers data from damaged bzip2 files |
|
15 |
|
16 .SH SYNOPSIS |
|
17 .ll +8 |
|
18 .B bzip2 |
|
19 .RB [ " \-cdfkqstvzVL123456789 " ] |
|
20 [ |
|
21 .I "filenames \&..." |
|
22 ] |
|
23 .ll -8 |
|
24 .br |
|
25 .B bunzip2 |
|
26 .RB [ " \-fkvsVL " ] |
|
27 [ |
|
28 .I "filenames \&..." |
|
29 ] |
|
30 .br |
|
31 .B bzcat |
|
32 .RB [ " \-s " ] |
|
33 [ |
|
34 .I "filenames \&..." |
|
35 ] |
|
36 .br |
|
37 .B bzip2recover |
|
38 .I "filename" |
|
39 |
|
40 .SH DESCRIPTION |
|
41 .I bzip2 |
|
42 compresses files using the Burrows-Wheeler block sorting |
|
43 text compression algorithm, and Huffman coding. Compression is |
|
44 generally considerably better than that achieved by more conventional |
|
45 LZ77/LZ78-based compressors, and approaches the performance of the PPM |
|
46 family of statistical compressors. |
|
47 |
|
48 The command-line options are deliberately very similar to |
|
49 those of |
|
50 .I GNU gzip, |
|
51 but they are not identical. |
|
52 |
|
53 .I bzip2 |
|
54 expects a list of file names to accompany the |
|
55 command-line flags. Each file is replaced by a compressed version of |
|
56 itself, with the name "original_name.bz2". |
|
57 Each compressed file |
|
58 has the same modification date, permissions, and, when possible, |
|
59 ownership as the corresponding original, so that these properties can |
|
60 be correctly restored at decompression time. File name handling is |
|
61 naive in the sense that there is no mechanism for preserving original |
|
62 file names, permissions, ownerships or dates in filesystems which lack |
|
63 these concepts, or have serious file name length restrictions, such as |
|
64 MS-DOS. |
|
65 |
|
66 .I bzip2 |
|
67 and |
|
68 .I bunzip2 |
|
69 will by default not overwrite existing |
|
70 files. If you want this to happen, specify the \-f flag. |
|
71 |
|
72 If no file names are specified, |
|
73 .I bzip2 |
|
74 compresses from standard |
|
75 input to standard output. In this case, |
|
76 .I bzip2 |
|
77 will decline to |
|
78 write compressed output to a terminal, as this would be entirely |
|
79 incomprehensible and therefore pointless. |
|
80 |
|
81 .I bunzip2 |
|
82 (or |
|
83 .I bzip2 \-d) |
|
84 decompresses all |
|
85 specified files. Files which were not created by |
|
86 .I bzip2 |
|
87 will be detected and ignored, and a warning issued. |
|
88 .I bzip2 |
|
89 attempts to guess the filename for the decompressed file |
|
90 from that of the compressed file as follows: |
|
91 |
|
92 filename.bz2 becomes filename |
|
93 filename.bz becomes filename |
|
94 filename.tbz2 becomes filename.tar |
|
95 filename.tbz becomes filename.tar |
|
96 anyothername becomes anyothername.out |
|
97 |
|
98 If the file does not end in one of the recognised endings, |
|
99 .I .bz2, |
|
100 .I .bz, |
|
101 .I .tbz2 |
|
102 or |
|
103 .I .tbz, |
|
104 .I bzip2 |
|
105 complains that it cannot |
|
106 guess the name of the original file, and uses the original name |
|
107 with |
|
108 .I .out |
|
109 appended. |
|
110 |
|
111 As with compression, supplying no |
|
112 filenames causes decompression from |
|
113 standard input to standard output. |
|
114 |
|
115 .I bunzip2 |
|
116 will correctly decompress a file which is the |
|
117 concatenation of two or more compressed files. The result is the |
|
118 concatenation of the corresponding uncompressed files. Integrity |
|
119 testing (\-t) |
|
120 of concatenated |
|
121 compressed files is also supported. |
|
122 |
|
123 You can also compress or decompress files to the standard output by |
|
124 giving the \-c flag. Multiple files may be compressed and |
|
125 decompressed like this. The resulting outputs are fed sequentially to |
|
126 stdout. Compression of multiple files |
|
127 in this manner generates a stream |
|
128 containing multiple compressed file representations. Such a stream |
|
129 can be decompressed correctly only by |
|
130 .I bzip2 |
|
131 version 0.9.0 or |
|
132 later. Earlier versions of |
|
133 .I bzip2 |
|
134 will stop after decompressing |
|
135 the first file in the stream. |
|
136 |
|
137 .I bzcat |
|
138 (or |
|
139 .I bzip2 -dc) |
|
140 decompresses all specified files to |
|
141 the standard output. |
|
142 |
|
143 .I bzip2 |
|
144 will read arguments from the environment variables |
|
145 .I BZIP2 |
|
146 and |
|
147 .I BZIP, |
|
148 in that order, and will process them |
|
149 before any arguments read from the command line. This gives a |
|
150 convenient way to supply default arguments. |
|
151 |
|
152 Compression is always performed, even if the compressed |
|
153 file is slightly |
|
154 larger than the original. Files of less than about one hundred bytes |
|
155 tend to get larger, since the compression mechanism has a constant |
|
156 overhead in the region of 50 bytes. Random data (including the output |
|
157 of most file compressors) is coded at about 8.05 bits per byte, giving |
|
158 an expansion of around 0.5%. |
|
159 |
|
160 As a self-check for your protection, |
|
161 .I |
|
162 bzip2 |
|
163 uses 32-bit CRCs to |
|
164 make sure that the decompressed version of a file is identical to the |
|
165 original. This guards against corruption of the compressed data, and |
|
166 against undetected bugs in |
|
167 .I bzip2 |
|
168 (hopefully very unlikely). The |
|
169 chances of data corruption going undetected is microscopic, about one |
|
170 chance in four billion for each file processed. Be aware, though, that |
|
171 the check occurs upon decompression, so it can only tell you that |
|
172 something is wrong. It can't help you |
|
173 recover the original uncompressed |
|
174 data. You can use |
|
175 .I bzip2recover |
|
176 to try to recover data from |
|
177 damaged files. |
|
178 |
|
179 Return values: 0 for a normal exit, 1 for environmental problems (file |
|
180 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt |
|
181 compressed file, 3 for an internal consistency error (eg, bug) which |
|
182 caused |
|
183 .I bzip2 |
|
184 to panic. |
|
185 |
|
186 .SH OPTIONS |
|
187 .TP |
|
188 .B \-c --stdout |
|
189 Compress or decompress to standard output. |
|
190 .TP |
|
191 .B \-d --decompress |
|
192 Force decompression. |
|
193 .I bzip2, |
|
194 .I bunzip2 |
|
195 and |
|
196 .I bzcat |
|
197 are |
|
198 really the same program, and the decision about what actions to take is |
|
199 done on the basis of which name is used. This flag overrides that |
|
200 mechanism, and forces |
|
201 .I bzip2 |
|
202 to decompress. |
|
203 .TP |
|
204 .B \-z --compress |
|
205 The complement to \-d: forces compression, regardless of the |
|
206 invocation name. |
|
207 .TP |
|
208 .B \-t --test |
|
209 Check integrity of the specified file(s), but don't decompress them. |
|
210 This really performs a trial decompression and throws away the result. |
|
211 .TP |
|
212 .B \-f --force |
|
213 Force overwrite of output files. Normally, |
|
214 .I bzip2 |
|
215 will not overwrite |
|
216 existing output files. Also forces |
|
217 .I bzip2 |
|
218 to break hard links |
|
219 to files, which it otherwise wouldn't do. |
|
220 |
|
221 bzip2 normally declines to decompress files which don't have the |
|
222 correct magic header bytes. If forced (-f), however, it will pass |
|
223 such files through unmodified. This is how GNU gzip behaves. |
|
224 .TP |
|
225 .B \-k --keep |
|
226 Keep (don't delete) input files during compression |
|
227 or decompression. |
|
228 .TP |
|
229 .B \-s --small |
|
230 Reduce memory usage, for compression, decompression and testing. Files |
|
231 are decompressed and tested using a modified algorithm which only |
|
232 requires 2.5 bytes per block byte. This means any file can be |
|
233 decompressed in 2300k of memory, albeit at about half the normal speed. |
|
234 |
|
235 During compression, \-s selects a block size of 200k, which limits |
|
236 memory use to around the same figure, at the expense of your compression |
|
237 ratio. In short, if your machine is low on memory (8 megabytes or |
|
238 less), use \-s for everything. See MEMORY MANAGEMENT below. |
|
239 .TP |
|
240 .B \-q --quiet |
|
241 Suppress non-essential warning messages. Messages pertaining to |
|
242 I/O errors and other critical events will not be suppressed. |
|
243 .TP |
|
244 .B \-v --verbose |
|
245 Verbose mode -- show the compression ratio for each file processed. |
|
246 Further \-v's increase the verbosity level, spewing out lots of |
|
247 information which is primarily of interest for diagnostic purposes. |
|
248 .TP |
|
249 .B \-L --license -V --version |
|
250 Display the software version, license terms and conditions. |
|
251 .TP |
|
252 .B \-1 (or \-\-fast) to \-9 (or \-\-best) |
|
253 Set the block size to 100 k, 200 k .. 900 k when compressing. Has no |
|
254 effect when decompressing. See MEMORY MANAGEMENT below. |
|
255 The \-\-fast and \-\-best aliases are primarily for GNU gzip |
|
256 compatibility. In particular, \-\-fast doesn't make things |
|
257 significantly faster. |
|
258 And \-\-best merely selects the default behaviour. |
|
259 .TP |
|
260 .B \-- |
|
261 Treats all subsequent arguments as file names, even if they start |
|
262 with a dash. This is so you can handle files with names beginning |
|
263 with a dash, for example: bzip2 \-- \-myfilename. |
|
264 .TP |
|
265 .B \--repetitive-fast --repetitive-best |
|
266 These flags are redundant in versions 0.9.5 and above. They provided |
|
267 some coarse control over the behaviour of the sorting algorithm in |
|
268 earlier versions, which was sometimes useful. 0.9.5 and above have an |
|
269 improved algorithm which renders these flags irrelevant. |
|
270 |
|
271 .SH MEMORY MANAGEMENT |
|
272 .I bzip2 |
|
273 compresses large files in blocks. The block size affects |
|
274 both the compression ratio achieved, and the amount of memory needed for |
|
275 compression and decompression. The flags \-1 through \-9 |
|
276 specify the block size to be 100,000 bytes through 900,000 bytes (the |
|
277 default) respectively. At decompression time, the block size used for |
|
278 compression is read from the header of the compressed file, and |
|
279 .I bunzip2 |
|
280 then allocates itself just enough memory to decompress |
|
281 the file. Since block sizes are stored in compressed files, it follows |
|
282 that the flags \-1 to \-9 are irrelevant to and so ignored |
|
283 during decompression. |
|
284 |
|
285 Compression and decompression requirements, |
|
286 in bytes, can be estimated as: |
|
287 |
|
288 Compression: 400k + ( 8 x block size ) |
|
289 |
|
290 Decompression: 100k + ( 4 x block size ), or |
|
291 100k + ( 2.5 x block size ) |
|
292 |
|
293 Larger block sizes give rapidly diminishing marginal returns. Most of |
|
294 the compression comes from the first two or three hundred k of block |
|
295 size, a fact worth bearing in mind when using |
|
296 .I bzip2 |
|
297 on small machines. |
|
298 It is also important to appreciate that the decompression memory |
|
299 requirement is set at compression time by the choice of block size. |
|
300 |
|
301 For files compressed with the default 900k block size, |
|
302 .I bunzip2 |
|
303 will require about 3700 kbytes to decompress. To support decompression |
|
304 of any file on a 4 megabyte machine, |
|
305 .I bunzip2 |
|
306 has an option to |
|
307 decompress using approximately half this amount of memory, about 2300 |
|
308 kbytes. Decompression speed is also halved, so you should use this |
|
309 option only where necessary. The relevant flag is -s. |
|
310 |
|
311 In general, try and use the largest block size memory constraints allow, |
|
312 since that maximises the compression achieved. Compression and |
|
313 decompression speed are virtually unaffected by block size. |
|
314 |
|
315 Another significant point applies to files which fit in a single block |
|
316 -- that means most files you'd encounter using a large block size. The |
|
317 amount of real memory touched is proportional to the size of the file, |
|
318 since the file is smaller than a block. For example, compressing a file |
|
319 20,000 bytes long with the flag -9 will cause the compressor to |
|
320 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 |
|
321 kbytes of it. Similarly, the decompressor will allocate 3700k but only |
|
322 touch 100k + 20000 * 4 = 180 kbytes. |
|
323 |
|
324 Here is a table which summarises the maximum memory usage for different |
|
325 block sizes. Also recorded is the total compressed size for 14 files of |
|
326 the Calgary Text Compression Corpus totalling 3,141,622 bytes. This |
|
327 column gives some feel for how compression varies with block size. |
|
328 These figures tend to understate the advantage of larger block sizes for |
|
329 larger files, since the Corpus is dominated by smaller files. |
|
330 |
|
331 Compress Decompress Decompress Corpus |
|
332 Flag usage usage -s usage Size |
|
333 |
|
334 -1 1200k 500k 350k 914704 |
|
335 -2 2000k 900k 600k 877703 |
|
336 -3 2800k 1300k 850k 860338 |
|
337 -4 3600k 1700k 1100k 846899 |
|
338 -5 4400k 2100k 1350k 845160 |
|
339 -6 5200k 2500k 1600k 838626 |
|
340 -7 6100k 2900k 1850k 834096 |
|
341 -8 6800k 3300k 2100k 828642 |
|
342 -9 7600k 3700k 2350k 828642 |
|
343 |
|
344 .SH RECOVERING DATA FROM DAMAGED FILES |
|
345 .I bzip2 |
|
346 compresses files in blocks, usually 900kbytes long. Each |
|
347 block is handled independently. If a media or transmission error causes |
|
348 a multi-block .bz2 |
|
349 file to become damaged, it may be possible to |
|
350 recover data from the undamaged blocks in the file. |
|
351 |
|
352 The compressed representation of each block is delimited by a 48-bit |
|
353 pattern, which makes it possible to find the block boundaries with |
|
354 reasonable certainty. Each block also carries its own 32-bit CRC, so |
|
355 damaged blocks can be distinguished from undamaged ones. |
|
356 |
|
357 .I bzip2recover |
|
358 is a simple program whose purpose is to search for |
|
359 blocks in .bz2 files, and write each block out into its own .bz2 |
|
360 file. You can then use |
|
361 .I bzip2 |
|
362 \-t |
|
363 to test the |
|
364 integrity of the resulting files, and decompress those which are |
|
365 undamaged. |
|
366 |
|
367 .I bzip2recover |
|
368 takes a single argument, the name of the damaged file, |
|
369 and writes a number of files "rec00001file.bz2", |
|
370 "rec00002file.bz2", etc, containing the extracted blocks. |
|
371 The output filenames are designed so that the use of |
|
372 wildcards in subsequent processing -- for example, |
|
373 "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in |
|
374 the correct order. |
|
375 |
|
376 .I bzip2recover |
|
377 should be of most use dealing with large .bz2 |
|
378 files, as these will contain many blocks. It is clearly |
|
379 futile to use it on damaged single-block files, since a |
|
380 damaged block cannot be recovered. If you wish to minimise |
|
381 any potential data loss through media or transmission errors, |
|
382 you might consider compressing with a smaller |
|
383 block size. |
|
384 |
|
385 .SH PERFORMANCE NOTES |
|
386 The sorting phase of compression gathers together similar strings in the |
|
387 file. Because of this, files containing very long runs of repeated |
|
388 symbols, like "aabaabaabaab ..." (repeated several hundred times) may |
|
389 compress more slowly than normal. Versions 0.9.5 and above fare much |
|
390 better than previous versions in this respect. The ratio between |
|
391 worst-case and average-case compression time is in the region of 10:1. |
|
392 For previous versions, this figure was more like 100:1. You can use the |
|
393 \-vvvv option to monitor progress in great detail, if you want. |
|
394 |
|
395 Decompression speed is unaffected by these phenomena. |
|
396 |
|
397 .I bzip2 |
|
398 usually allocates several megabytes of memory to operate |
|
399 in, and then charges all over it in a fairly random fashion. This means |
|
400 that performance, both for compressing and decompressing, is largely |
|
401 determined by the speed at which your machine can service cache misses. |
|
402 Because of this, small changes to the code to reduce the miss rate have |
|
403 been observed to give disproportionately large performance improvements. |
|
404 I imagine |
|
405 .I bzip2 |
|
406 will perform best on machines with very large caches. |
|
407 |
|
408 .SH CAVEATS |
|
409 I/O error messages are not as helpful as they could be. |
|
410 .I bzip2 |
|
411 tries hard to detect I/O errors and exit cleanly, but the details of |
|
412 what the problem is sometimes seem rather misleading. |
|
413 |
|
414 This manual page pertains to version 1.0.5 of |
|
415 .I bzip2. |
|
416 Compressed data created by this version is entirely forwards and |
|
417 backwards compatible with the previous public releases, versions |
|
418 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2, 1.0.3 and 1.0.4 but with the |
|
419 following exception: 0.9.0 and above can correctly decompress multiple |
|
420 concatenated compressed files. 0.1pl2 cannot do this; it will stop |
|
421 after decompressing just the first file in the stream. |
|
422 |
|
423 .I bzip2recover |
|
424 versions prior to 1.0.2 used 32-bit integers to represent |
|
425 bit positions in compressed files, so they could not handle compressed |
|
426 files more than 512 megabytes long. Versions 1.0.2 and above use |
|
427 64-bit ints on some platforms which support them (GNU supported |
|
428 targets, and Windows). To establish whether or not bzip2recover was |
|
429 built with such a limitation, run it without arguments. In any event |
|
430 you can build yourself an unlimited version if you can recompile it |
|
431 with MaybeUInt64 set to be an unsigned 64-bit integer. |
|
432 |
|
433 |
|
434 |
|
435 .SH AUTHOR |
|
436 Julian Seward, jsewardbzip.org. |
|
437 |
|
438 http://www.bzip.org |
|
439 |
|
440 The ideas embodied in |
|
441 .I bzip2 |
|
442 are due to (at least) the following |
|
443 people: Michael Burrows and David Wheeler (for the block sorting |
|
444 transformation), David Wheeler (again, for the Huffman coder), Peter |
|
445 Fenwick (for the structured coding model in the original |
|
446 .I bzip, |
|
447 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten |
|
448 (for the arithmetic coder in the original |
|
449 .I bzip). |
|
450 I am much |
|
451 indebted for their help, support and advice. See the manual in the |
|
452 source distribution for pointers to sources of documentation. Christian |
|
453 von Roques encouraged me to look for faster sorting algorithms, so as to |
|
454 speed up compression. Bela Lubkin encouraged me to improve the |
|
455 worst-case compression performance. |
|
456 Donna Robinson XMLised the documentation. |
|
457 The bz* scripts are derived from those of GNU gzip. |
|
458 Many people sent patches, helped |
|
459 with portability problems, lent machines, gave advice and were generally |
|
460 helpful. |
|
461 .SH ATTRIBUTES |
|
462 See |
|
463 .BR attributes (5) |
|
464 for descriptions of the following attributes: |
|
465 .sp |
|
466 .TS |
|
467 box; |
|
468 cbp-1 | cbp-1 |
|
469 l | l . |
|
470 ATTRIBUTE TYPE ATTRIBUTE VALUE |
|
471 = |
|
472 Availability SUNWbzip |
|
473 = |
|
474 Interface Stability Committed |
|
475 .TE |
|
476 .PP |
|
477 .SH NOTES |
|
478 Source for bzip2 is available on http://opensolaris.org. |