OPENSORT
NAME
opensort - sort text or
binary files
SYNOPSIS
opensort [OPTIONS]
DESCRIPTION
Opensort creates a sorted
output to the disk, in one or multiple files, using all the input files.
It recognises two kinds of input files. Binary fixed length record
(FLR)
or delimited text (TEXT). For
TEXT files, the record delimiter is always the standard EOL sequence
(LF or CRLF) of the operating system,
and the column/field delimiter is defined by the user.
OPTIONS
-h
Displays the help screen
-i path
Input file path. If
there are more
than one input files, this option may be used multiple times.
This option is mandatory and must appear at least once in the command
line.
-o path
Output file path. This
option is
mandatory.
-t path
Temporary file path. If
this option
is absent, opensort will create a temporary file in the standard
temporary directory. This file will be removed by the end of the
program's execution.
-m megabytes
Amount of RAM, in
megabytes,
opensort is allowed to use. If this option is absent, opensort will use
16
megabytes.
-b kilobytes
I/O buffer size in
kilobytes. If
this option is absent, opensort will use 512 kilobytes for each FLR I/O
buffer,
or 4 kilobytes for each TEXT I/O buffer. Values above 65536
will be truncated to this limit. Under Linux, if
opensort has been
built with -DSPLICE flag, this option is ignored for the FLR files.
-z bytes
Record size in bytes.
This option
is mandatory, implies FLR I/O and is mutual exclusive to -delim
(See below).
-delim (space, tab, column,
semicolumn,
pipe or comma)
Specify a delimiter
character which separates fields/columns in a TEXT line. Valid
delimiter options are:
space, tab, column, semicolumn, pipe or comma. This option
is mandatory, implicits TEXT I/O and is mutual exclusive
to -z.
-v bytes
Volume size in bytes.
If the size, in bytes, of the output file exceed this number, multiple
volumes will be
created. Minimum is equal to 256 kilobytes (256*1024). If
this option is absent, opensort will use the
maximum number allowed by the file system. No record will split between
two volumes.
-sorters n
Number of sorting
threads. Default
is 1, maximum is 8.
-directio
Use direct (unbuffered)
I/O. This is a
hint
opensort may ignore, and fall back to buffered I/O. Under Linux, if
opensort
has
been built with -DSPLICE flag, this option is ignored.
-single
Disable parallel I/O. This
option is
useful when the temporary file lays on the same
disk/array with input or
output files. Requires -z.
-statistics
Display statistics.
-k
[+,-]start,end,(t,n,i8,u8,i16,u16,i32,u32,i64,u64,float,double)
Specify a sort key in the
FLR. Start is
the offset
of the first byte of key, count from 0. End is the offset of the last
byte
of the key, count from 0. The optional plus or minus sign in front of
the
start offset indicates the order of the key, where
plus means ascending order and minus means descending order. If the
sign
is absent, plus is implied. The last mandatory
option indicates the data type of key and is translated
as follows:
t: ASCII character or string.
n: Numeric string (String to sort according to its numerical
value).
i8, i16, i32, i64: 8/16/32/64 bit native signed integer.
u8, u16, u32, u64: 8/16/32/64 bit native unsigned integer.
float: Native float (usually 32 bit)
double: Native double (usually 64 bit)
This option is mandatory and must appear at least once in the command
line, if -z has been used.
-k [+,-]pos,(t,n)
Specify a sort key in the
TEXT line.
Pos is the position of the key, as it is defined by a delimiter
character, count from 1.
The optional plus or minus sign in front of the position indicates the
order of the key, where plus means ascending order
and minus means descending order. If the sign
is absent, plus is implied. The last mandatory option indicates the
data type
of the key and is translated
as follows:
t: ASCII character or string.
n: Numeric string (String to sort according to its numerical
value).
This option is mandatory and must appear at least once in the command
line, if -delim has been used.
-detached
Use the detached sorter.
This
is the default sorter
option, able
to cover all possible key type and order combinations,
performing parallel I/O. Requires -z.
-pipeline
Use the pipeline sorter.
This option is a hint.
Opensort may
ignore
it and fallback to -detached option. It implements
a three stage
pipeline: read, sort and write. This sorter will perform
parallel I/O, optimized for multi disk/array systems,
where temporary file lays on a different physical media than
input and output files. Requires -z.
-earlyflush
Use the earlyflush sorter.
This option is a hint.
Opensort may
ignore
it and fallback to -detached option. This sorter is
optimized for RAM or CPU bound hardware and will perform
limited parallel I/O. Performs better in single disk/array systems
or
when temporary file lays on the same physical
media with input and output files. Combines well with -single
option. Requires -z.
-monoblock
Use the monoblock sorter.
This option is a hint.
Opensort may
ignore
it and fallback to -detached option. With this sorter
no parallel I/O is performed, unless more than one sorters are used.
Targets to systems with deep multithreading capabilities,
vast amounts of RAM and high storage throughput. Requires -z.
-conservative
Use conservative prefetch
policy during
merge. 20% of the RAM, available to opensort, will be used as a
microblock
prefetch heap. Requires -z.
-standard
Use standard prefetch
policy during
merge. 33% of the RAM, available to opensort, will be used as a
microblock
prefetch heap. Requires -z.
-aggressive
Use aggressive prefetch
policy during
merge. Almost 50% of the RAM, available to opensort, will be used as a
microblock
prefetch heap. Requires -z.
-mt
Use multithreaded
implementations
of quicksort or radix sort. This option is ignored when combined with
-earlyflush.
-debug
On runtime error, print
additional information for debuging. This options does not affect the
program's speed.
COMMENTS
There is no default
prefetch option. If
none of the options -conservative, -standard,
-aggressive is present, opensort will disable
prefetch completely.
EXAMPLES
The following example
creates the
sorted concatenated output file three.dat, using the FLR files one.dat
and
two.dat as inputs.
Sort keys are two, one from offset 0-9 descending text, and another
from offset 31-38 ascending signed integer 64 bit wide.
Record size is 78:
opensort -i one.dat -i two.dat -o three.dat -k -0,9,t -k 31,38,i64 -z
78
The following example creates sorted output yourtext.txt, split in
multiple files with maximum size 1000000 bytes each, using
input file mytext.txt. Sort keys are two, one in position 7,
ascending text, and another in position 2, descending numeric
string, defined by pipe character as delimiter.
opensort -i mytext.txt -o yourtext.txt -v 1000000 -k 7,t -k -2,n -delim
pipe
AUTHOR
Written by Lucas Tsatiris.
REPORTING BUGS
You may report bugs or
request features
using the project's bug tracker at sourceforge:
http://sourceforge.net/tracker/?group_id=295617
Or by email to: opensort.project@gmail.com.
COPYRIGHT
Copyright © 2009,
2010, 2011 Lucas
Tsatiris. License GPLv2+: GNU GPL version
2 or later:
http://www.gnu.org/licenses/gpl-2.0.html
This is free software: you are free to change
and redistribute it. There is NO WARRANTY, to the extent
permitted by law.
Opensort 0.3.0 - April 2011
|