usr/src/cmd/bzip2/bzip2.1.sunman
changeset 0 b34509ac961f
child 11 87960ed158f9
equal deleted inserted replaced
-1:000000000000 0:b34509ac961f
       
     1 '\" t
       
     2 .\" ident	"@(#)bzip2.1.sunman	1.6	08/04/21 SMI"
       
     3 .\"
       
     4 .\" modified to reference existing Solaris man pages, and to add note
       
     5 .\" about source availability ([email protected])
       
     6 .\"
       
     7 .PU
       
     8 .TH bzip2 1
       
     9 .SH NAME
       
    10 bzip2, bunzip2 \- a block-sorting file compressor, v1.0.5
       
    11 .br
       
    12 bzcat \- decompresses files to stdout
       
    13 .br
       
    14 bzip2recover \- recovers data from damaged bzip2 files
       
    15 
       
    16 .SH SYNOPSIS
       
    17 .ll +8
       
    18 .B bzip2
       
    19 .RB [ " \-cdfkqstvzVL123456789 " ]
       
    20 [
       
    21 .I "filenames \&..."
       
    22 ]
       
    23 .ll -8
       
    24 .br
       
    25 .B bunzip2
       
    26 .RB [ " \-fkvsVL " ]
       
    27 [ 
       
    28 .I "filenames \&..."
       
    29 ]
       
    30 .br
       
    31 .B bzcat
       
    32 .RB [ " \-s " ]
       
    33 [ 
       
    34 .I "filenames \&..."
       
    35 ]
       
    36 .br
       
    37 .B bzip2recover
       
    38 .I "filename"
       
    39 
       
    40 .SH DESCRIPTION
       
    41 .I bzip2
       
    42 compresses files using the Burrows-Wheeler block sorting
       
    43 text compression algorithm, and Huffman coding.  Compression is
       
    44 generally considerably better than that achieved by more conventional
       
    45 LZ77/LZ78-based compressors, and approaches the performance of the PPM
       
    46 family of statistical compressors.
       
    47 
       
    48 The command-line options are deliberately very similar to 
       
    49 those of 
       
    50 .I GNU gzip, 
       
    51 but they are not identical.
       
    52 
       
    53 .I bzip2
       
    54 expects a list of file names to accompany the
       
    55 command-line flags.  Each file is replaced by a compressed version of
       
    56 itself, with the name "original_name.bz2".  
       
    57 Each compressed file
       
    58 has the same modification date, permissions, and, when possible,
       
    59 ownership as the corresponding original, so that these properties can
       
    60 be correctly restored at decompression time.  File name handling is
       
    61 naive in the sense that there is no mechanism for preserving original
       
    62 file names, permissions, ownerships or dates in filesystems which lack
       
    63 these concepts, or have serious file name length restrictions, such as
       
    64 MS-DOS.
       
    65 
       
    66 .I bzip2
       
    67 and
       
    68 .I bunzip2
       
    69 will by default not overwrite existing
       
    70 files.  If you want this to happen, specify the \-f flag.
       
    71 
       
    72 If no file names are specified,
       
    73 .I bzip2
       
    74 compresses from standard
       
    75 input to standard output.  In this case,
       
    76 .I bzip2
       
    77 will decline to
       
    78 write compressed output to a terminal, as this would be entirely
       
    79 incomprehensible and therefore pointless.
       
    80 
       
    81 .I bunzip2
       
    82 (or
       
    83 .I bzip2 \-d) 
       
    84 decompresses all
       
    85 specified files.  Files which were not created by 
       
    86 .I bzip2
       
    87 will be detected and ignored, and a warning issued.  
       
    88 .I bzip2
       
    89 attempts to guess the filename for the decompressed file 
       
    90 from that of the compressed file as follows:
       
    91 
       
    92        filename.bz2    becomes   filename
       
    93        filename.bz     becomes   filename
       
    94        filename.tbz2   becomes   filename.tar
       
    95        filename.tbz    becomes   filename.tar
       
    96        anyothername    becomes   anyothername.out
       
    97 
       
    98 If the file does not end in one of the recognised endings, 
       
    99 .I .bz2, 
       
   100 .I .bz, 
       
   101 .I .tbz2
       
   102 or
       
   103 .I .tbz, 
       
   104 .I bzip2 
       
   105 complains that it cannot
       
   106 guess the name of the original file, and uses the original name
       
   107 with
       
   108 .I .out
       
   109 appended.
       
   110 
       
   111 As with compression, supplying no
       
   112 filenames causes decompression from 
       
   113 standard input to standard output.
       
   114 
       
   115 .I bunzip2 
       
   116 will correctly decompress a file which is the
       
   117 concatenation of two or more compressed files.  The result is the
       
   118 concatenation of the corresponding uncompressed files.  Integrity
       
   119 testing (\-t) 
       
   120 of concatenated 
       
   121 compressed files is also supported.
       
   122 
       
   123 You can also compress or decompress files to the standard output by
       
   124 giving the \-c flag.  Multiple files may be compressed and
       
   125 decompressed like this.  The resulting outputs are fed sequentially to
       
   126 stdout.  Compression of multiple files 
       
   127 in this manner generates a stream
       
   128 containing multiple compressed file representations.  Such a stream
       
   129 can be decompressed correctly only by
       
   130 .I bzip2 
       
   131 version 0.9.0 or
       
   132 later.  Earlier versions of
       
   133 .I bzip2
       
   134 will stop after decompressing
       
   135 the first file in the stream.
       
   136 
       
   137 .I bzcat
       
   138 (or
       
   139 .I bzip2 -dc) 
       
   140 decompresses all specified files to
       
   141 the standard output.
       
   142 
       
   143 .I bzip2
       
   144 will read arguments from the environment variables
       
   145 .I BZIP2
       
   146 and
       
   147 .I BZIP,
       
   148 in that order, and will process them
       
   149 before any arguments read from the command line.  This gives a 
       
   150 convenient way to supply default arguments.
       
   151 
       
   152 Compression is always performed, even if the compressed 
       
   153 file is slightly
       
   154 larger than the original.  Files of less than about one hundred bytes
       
   155 tend to get larger, since the compression mechanism has a constant
       
   156 overhead in the region of 50 bytes.  Random data (including the output
       
   157 of most file compressors) is coded at about 8.05 bits per byte, giving
       
   158 an expansion of around 0.5%.
       
   159 
       
   160 As a self-check for your protection, 
       
   161 .I 
       
   162 bzip2
       
   163 uses 32-bit CRCs to
       
   164 make sure that the decompressed version of a file is identical to the
       
   165 original.  This guards against corruption of the compressed data, and
       
   166 against undetected bugs in
       
   167 .I bzip2
       
   168 (hopefully very unlikely).  The
       
   169 chances of data corruption going undetected is microscopic, about one
       
   170 chance in four billion for each file processed.  Be aware, though, that
       
   171 the check occurs upon decompression, so it can only tell you that
       
   172 something is wrong.  It can't help you 
       
   173 recover the original uncompressed
       
   174 data.  You can use 
       
   175 .I bzip2recover
       
   176 to try to recover data from
       
   177 damaged files.
       
   178 
       
   179 Return values: 0 for a normal exit, 1 for environmental problems (file
       
   180 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
       
   181 compressed file, 3 for an internal consistency error (eg, bug) which
       
   182 caused
       
   183 .I bzip2
       
   184 to panic.
       
   185 
       
   186 .SH OPTIONS
       
   187 .TP
       
   188 .B \-c --stdout
       
   189 Compress or decompress to standard output.
       
   190 .TP
       
   191 .B \-d --decompress
       
   192 Force decompression.  
       
   193 .I bzip2, 
       
   194 .I bunzip2 
       
   195 and
       
   196 .I bzcat 
       
   197 are
       
   198 really the same program, and the decision about what actions to take is
       
   199 done on the basis of which name is used.  This flag overrides that
       
   200 mechanism, and forces 
       
   201 .I bzip2
       
   202 to decompress.
       
   203 .TP
       
   204 .B \-z --compress
       
   205 The complement to \-d: forces compression, regardless of the
       
   206 invocation name.
       
   207 .TP
       
   208 .B \-t --test
       
   209 Check integrity of the specified file(s), but don't decompress them.
       
   210 This really performs a trial decompression and throws away the result.
       
   211 .TP
       
   212 .B \-f --force
       
   213 Force overwrite of output files.  Normally,
       
   214 .I bzip2 
       
   215 will not overwrite
       
   216 existing output files.  Also forces 
       
   217 .I bzip2 
       
   218 to break hard links
       
   219 to files, which it otherwise wouldn't do.
       
   220 
       
   221 bzip2 normally declines to decompress files which don't have the
       
   222 correct magic header bytes.  If forced (-f), however, it will pass
       
   223 such files through unmodified.  This is how GNU gzip behaves.
       
   224 .TP
       
   225 .B \-k --keep
       
   226 Keep (don't delete) input files during compression
       
   227 or decompression.
       
   228 .TP
       
   229 .B \-s --small
       
   230 Reduce memory usage, for compression, decompression and testing.  Files
       
   231 are decompressed and tested using a modified algorithm which only
       
   232 requires 2.5 bytes per block byte.  This means any file can be
       
   233 decompressed in 2300k of memory, albeit at about half the normal speed.
       
   234 
       
   235 During compression, \-s selects a block size of 200k, which limits
       
   236 memory use to around the same figure, at the expense of your compression
       
   237 ratio.  In short, if your machine is low on memory (8 megabytes or
       
   238 less), use \-s for everything.  See MEMORY MANAGEMENT below.
       
   239 .TP
       
   240 .B \-q --quiet
       
   241 Suppress non-essential warning messages.  Messages pertaining to
       
   242 I/O errors and other critical events will not be suppressed.
       
   243 .TP
       
   244 .B \-v --verbose
       
   245 Verbose mode -- show the compression ratio for each file processed.
       
   246 Further \-v's increase the verbosity level, spewing out lots of
       
   247 information which is primarily of interest for diagnostic purposes.
       
   248 .TP
       
   249 .B \-L --license -V --version
       
   250 Display the software version, license terms and conditions.
       
   251 .TP
       
   252 .B \-1 (or \-\-fast) to \-9 (or \-\-best)
       
   253 Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
       
   254 effect when decompressing.  See MEMORY MANAGEMENT below.
       
   255 The \-\-fast and \-\-best aliases are primarily for GNU gzip 
       
   256 compatibility.  In particular, \-\-fast doesn't make things
       
   257 significantly faster.  
       
   258 And \-\-best merely selects the default behaviour.
       
   259 .TP
       
   260 .B \--
       
   261 Treats all subsequent arguments as file names, even if they start
       
   262 with a dash.  This is so you can handle files with names beginning
       
   263 with a dash, for example: bzip2 \-- \-myfilename.
       
   264 .TP
       
   265 .B \--repetitive-fast --repetitive-best
       
   266 These flags are redundant in versions 0.9.5 and above.  They provided
       
   267 some coarse control over the behaviour of the sorting algorithm in
       
   268 earlier versions, which was sometimes useful.  0.9.5 and above have an
       
   269 improved algorithm which renders these flags irrelevant.
       
   270 
       
   271 .SH MEMORY MANAGEMENT
       
   272 .I bzip2 
       
   273 compresses large files in blocks.  The block size affects
       
   274 both the compression ratio achieved, and the amount of memory needed for
       
   275 compression and decompression.  The flags \-1 through \-9
       
   276 specify the block size to be 100,000 bytes through 900,000 bytes (the
       
   277 default) respectively.  At decompression time, the block size used for
       
   278 compression is read from the header of the compressed file, and
       
   279 .I bunzip2
       
   280 then allocates itself just enough memory to decompress
       
   281 the file.  Since block sizes are stored in compressed files, it follows
       
   282 that the flags \-1 to \-9 are irrelevant to and so ignored
       
   283 during decompression.
       
   284 
       
   285 Compression and decompression requirements, 
       
   286 in bytes, can be estimated as:
       
   287 
       
   288        Compression:   400k + ( 8 x block size )
       
   289 
       
   290        Decompression: 100k + ( 4 x block size ), or
       
   291                       100k + ( 2.5 x block size )
       
   292 
       
   293 Larger block sizes give rapidly diminishing marginal returns.  Most of
       
   294 the compression comes from the first two or three hundred k of block
       
   295 size, a fact worth bearing in mind when using
       
   296 .I bzip2
       
   297 on small machines.
       
   298 It is also important to appreciate that the decompression memory
       
   299 requirement is set at compression time by the choice of block size.
       
   300 
       
   301 For files compressed with the default 900k block size,
       
   302 .I bunzip2
       
   303 will require about 3700 kbytes to decompress.  To support decompression
       
   304 of any file on a 4 megabyte machine, 
       
   305 .I bunzip2
       
   306 has an option to
       
   307 decompress using approximately half this amount of memory, about 2300
       
   308 kbytes.  Decompression speed is also halved, so you should use this
       
   309 option only where necessary.  The relevant flag is -s.
       
   310 
       
   311 In general, try and use the largest block size memory constraints allow,
       
   312 since that maximises the compression achieved.  Compression and
       
   313 decompression speed are virtually unaffected by block size.
       
   314 
       
   315 Another significant point applies to files which fit in a single block
       
   316 -- that means most files you'd encounter using a large block size.  The
       
   317 amount of real memory touched is proportional to the size of the file,
       
   318 since the file is smaller than a block.  For example, compressing a file
       
   319 20,000 bytes long with the flag -9 will cause the compressor to
       
   320 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
       
   321 kbytes of it.  Similarly, the decompressor will allocate 3700k but only
       
   322 touch 100k + 20000 * 4 = 180 kbytes.
       
   323 
       
   324 Here is a table which summarises the maximum memory usage for different
       
   325 block sizes.  Also recorded is the total compressed size for 14 files of
       
   326 the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
       
   327 column gives some feel for how compression varies with block size.
       
   328 These figures tend to understate the advantage of larger block sizes for
       
   329 larger files, since the Corpus is dominated by smaller files.
       
   330 
       
   331            Compress   Decompress   Decompress   Corpus
       
   332     Flag     usage      usage       -s usage     Size
       
   333 
       
   334      -1      1200k       500k         350k      914704
       
   335      -2      2000k       900k         600k      877703
       
   336      -3      2800k      1300k         850k      860338
       
   337      -4      3600k      1700k        1100k      846899
       
   338      -5      4400k      2100k        1350k      845160
       
   339      -6      5200k      2500k        1600k      838626
       
   340      -7      6100k      2900k        1850k      834096
       
   341      -8      6800k      3300k        2100k      828642
       
   342      -9      7600k      3700k        2350k      828642
       
   343 
       
   344 .SH RECOVERING DATA FROM DAMAGED FILES
       
   345 .I bzip2
       
   346 compresses files in blocks, usually 900kbytes long.  Each
       
   347 block is handled independently.  If a media or transmission error causes
       
   348 a multi-block .bz2
       
   349 file to become damaged, it may be possible to
       
   350 recover data from the undamaged blocks in the file.
       
   351 
       
   352 The compressed representation of each block is delimited by a 48-bit
       
   353 pattern, which makes it possible to find the block boundaries with
       
   354 reasonable certainty.  Each block also carries its own 32-bit CRC, so
       
   355 damaged blocks can be distinguished from undamaged ones.
       
   356 
       
   357 .I bzip2recover
       
   358 is a simple program whose purpose is to search for
       
   359 blocks in .bz2 files, and write each block out into its own .bz2 
       
   360 file.  You can then use
       
   361 .I bzip2 
       
   362 \-t
       
   363 to test the
       
   364 integrity of the resulting files, and decompress those which are
       
   365 undamaged.
       
   366 
       
   367 .I bzip2recover
       
   368 takes a single argument, the name of the damaged file, 
       
   369 and writes a number of files "rec00001file.bz2",
       
   370 "rec00002file.bz2", etc, containing the  extracted  blocks.
       
   371 The  output  filenames  are  designed  so  that the use of
       
   372 wildcards in subsequent processing -- for example,  
       
   373 "bzip2 -dc  rec*file.bz2 > recovered_data" -- processes the files in
       
   374 the correct order.
       
   375 
       
   376 .I bzip2recover
       
   377 should be of most use dealing with large .bz2
       
   378 files,  as  these will contain many blocks.  It is clearly
       
   379 futile to use it on damaged single-block  files,  since  a
       
   380 damaged  block  cannot  be recovered.  If you wish to minimise 
       
   381 any potential data loss through media  or  transmission errors, 
       
   382 you might consider compressing with a smaller
       
   383 block size.
       
   384 
       
   385 .SH PERFORMANCE NOTES
       
   386 The sorting phase of compression gathers together similar strings in the
       
   387 file.  Because of this, files containing very long runs of repeated
       
   388 symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
       
   389 compress more slowly than normal.  Versions 0.9.5 and above fare much
       
   390 better than previous versions in this respect.  The ratio between
       
   391 worst-case and average-case compression time is in the region of 10:1.
       
   392 For previous versions, this figure was more like 100:1.  You can use the
       
   393 \-vvvv option to monitor progress in great detail, if you want.
       
   394 
       
   395 Decompression speed is unaffected by these phenomena.
       
   396 
       
   397 .I bzip2
       
   398 usually allocates several megabytes of memory to operate
       
   399 in, and then charges all over it in a fairly random fashion.  This means
       
   400 that performance, both for compressing and decompressing, is largely
       
   401 determined by the speed at which your machine can service cache misses.
       
   402 Because of this, small changes to the code to reduce the miss rate have
       
   403 been observed to give disproportionately large performance improvements.
       
   404 I imagine 
       
   405 .I bzip2
       
   406 will perform best on machines with very large caches.
       
   407 
       
   408 .SH CAVEATS
       
   409 I/O error messages are not as helpful as they could be.
       
   410 .I bzip2
       
   411 tries hard to detect I/O errors and exit cleanly, but the details of
       
   412 what the problem is sometimes seem rather misleading.
       
   413 
       
   414 This manual page pertains to version 1.0.5 of
       
   415 .I bzip2.  
       
   416 Compressed data created by this version is entirely forwards and
       
   417 backwards compatible with the previous public releases, versions
       
   418 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2, 1.0.3 and 1.0.4 but with the
       
   419 following exception: 0.9.0 and above can correctly decompress multiple
       
   420 concatenated compressed files.  0.1pl2 cannot do this; it will stop
       
   421 after decompressing just the first file in the stream.
       
   422 
       
   423 .I bzip2recover
       
   424 versions prior to 1.0.2 used 32-bit integers to represent
       
   425 bit positions in compressed files, so they could not handle compressed
       
   426 files more than 512 megabytes long.  Versions 1.0.2 and above use
       
   427 64-bit ints on some platforms which support them (GNU supported
       
   428 targets, and Windows).  To establish whether or not bzip2recover was
       
   429 built with such a limitation, run it without arguments.  In any event
       
   430 you can build yourself an unlimited version if you can recompile it
       
   431 with MaybeUInt64 set to be an unsigned 64-bit integer.
       
   432 
       
   433 
       
   434 
       
   435 .SH AUTHOR
       
   436 Julian Seward, jsewardbzip.org.
       
   437 
       
   438 http://www.bzip.org
       
   439 
       
   440 The ideas embodied in
       
   441 .I bzip2
       
   442 are due to (at least) the following
       
   443 people: Michael Burrows and David Wheeler (for the block sorting
       
   444 transformation), David Wheeler (again, for the Huffman coder), Peter
       
   445 Fenwick (for the structured coding model in the original
       
   446 .I bzip,
       
   447 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
       
   448 (for the arithmetic coder in the original
       
   449 .I bzip).  
       
   450 I am much
       
   451 indebted for their help, support and advice.  See the manual in the
       
   452 source distribution for pointers to sources of documentation.  Christian
       
   453 von Roques encouraged me to look for faster sorting algorithms, so as to
       
   454 speed up compression.  Bela Lubkin encouraged me to improve the
       
   455 worst-case compression performance.  
       
   456 Donna Robinson XMLised the documentation.
       
   457 The bz* scripts are derived from those of GNU gzip.
       
   458 Many people sent patches, helped
       
   459 with portability problems, lent machines, gave advice and were generally
       
   460 helpful.
       
   461 .SH ATTRIBUTES
       
   462 See
       
   463 .BR attributes (5)
       
   464 for descriptions of the following attributes:
       
   465 .sp
       
   466 .TS
       
   467 box;
       
   468 cbp-1 | cbp-1
       
   469 l | l .
       
   470 ATTRIBUTE TYPE	ATTRIBUTE VALUE
       
   471 =
       
   472 Availability	SUNWbzip
       
   473 =
       
   474 Interface Stability	Committed
       
   475 .TE 
       
   476 .PP
       
   477 .SH NOTES
       
   478 Source for bzip2 is available on http://opensolaris.org.