OPENSORT
NAME
opensort - sort lines of
text files or records of binary files
SYNOPSIS
opensort [OPTIONS]
DESCRIPTION
Sort from standard input to
standard output. Alternatively, disk files may be used as
input and/or output.
Inputs and outputs may be binary fixed length record (FLR)
or delimited text (TEXT), the default. For TEXT
inputs, the record delimiter is always the standard EOL
sequence (LF or CRLF) of the operating system,
and the column/field delimiter is defined by the user.
OPTIONS
-h, -help
Displays the help screen
-i path
Input file path. If
there are more than one input files, this option may
be used multiple times.
Default is standard input. If this option is present,
standard input is ignored.
-o path
Output file path.
Default is standard output. If this option is present,
standard output is ignored.
-t path
Temporary file path. If
this option is absent, opensort will create a
temporary file in the standard
temporary directory. This file will be removed by the
end of the program's execution.
-m megabytes
Amount of RAM, in
megabytes, opensort is allowed to use. If this option
is absent, opensort will use 16
megabytes.
-b kilobytes
I/O buffer size in
kilobytes. If this option is absent, opensort will use
512 kilobytes for each FLR I/O buffer,
or 64 kilobytes for each TEXT I/O buffer. Values above
65536 will be truncated to this limit. Under Linux, if
opensort has been built with -DSPLICE flag, this option
is ignored for the FLR files.
-z bytes
Record size in bytes. This
option implies FLR I/O and is mutual exclusive to
-delim (See below).
-delim (space, tab,
column, semicolumn, pipe or comma)
Specify a delimiter
character which separates fields/columns in a TEXT
line. Valid delimiter options are:
space, tab, column, semicolumn, pipe or comma. The
default behaviour is not to use any delimiter at all,
so the whole line is treated as a single field. This
option implies TEXT I/O and is mutual exclusive
to -z.
-v bytes
Volume size in bytes. If
the size, in bytes, of the output file exceed this
number, multiple volumes will be
created. Minimum is equal to 256 kilobytes
(256*1024). If this option is absent, opensort
will use the
maximum number allowed by the file system. No record
will split between two volumes. If standard
output is used, this option is ignored.
-sorters n
Number of sorting
threads. Default is 1, maximum is 8. Opensort may
ignore this option and fall back to the default.
-directio
Use direct (unbuffered)
I/O. This is a hint opensort may ignore, and fall back
to buffered I/O. Under Linux, if opensort
has been built with -DSPLICE flag, this option is
ignored.
-single
Disable parallel I/O.
This option is useful when the temporary file lays on
the same disk/array with input or
output files. Requires -z. If standard input or
standard output is in use, this option is ignored.
-statistics
Display statistics.
-k
[+,-]start,end,(t,n,i8,u8,i16,u16,i32,u32,i64,u64,float,double)
Specify a sort key in
the FLR. Start is the offset of the first byte of key,
count from 0. End is the offset of the last byte
of the key, count from 0. The optional plus or minus
sign in front of the start offset indicates the order
of the key, where
plus means ascending order and minus means descending
order. If the sign is absent, plus is implied. The
last mandatory
option indicates the data type of key and is
translated as follows:
t: ASCII character or string.
tic: Same as t but ignores case.
n: Numeric string (String to sort according to
its numerical value).
i8, i16, i32, i64: 8/16/32/64 bit native signed
integer.
u8, u16, u32, u64: 8/16/32/64 bit native
unsigned integer.
float: Native float (usually 32 bit)
double: Native double (usually 64 bit)
This option is mandatory and must appear at least once
in the command line, if -z has been used.
-k [+,-]pos,(t,n)
Specify a sort key in
the TEXT line. Pos is the position of the key, as it
is defined by a delimiter character, count from 1.
The optional plus or minus sign in front of the
position indicates the order of the key, where plus
means ascending order
and minus means descending order. If the sign is
absent, plus is implied. The last mandatory option
indicates the data type
of the key and is translated as follows:
t: ASCII character or string.
tic: Same as t but ignores case.
n: Numeric string (String to sort according to
its numerical value).
Default is -k 1,t.
-detached
Use the detached sorter.
This is the default sorter option, able to cover all
possible key type and order combinations,
performing parallel I/O. Requires -z.
-pipeline
Use the pipeline sorter.
This option is a hint. Opensort may ignore it and
fallback to -detached option. It implements
a three stage pipeline: read, sort and write. This
sorter will perform parallel I/O, optimized for multi
disk/array systems,
where temporary file lays on a different
physical media than input and output files. Requires
-z.
-segmented
Use the segmented
sorter. This option is a hint. Opensort may ignore it
and fallback to -detached option. This sorter is
optimized for RAM or CPU bound hardware and will
perform limited parallel I/O. Performs better in
single disk/array systems
or when temporary file lays on the same physical
media with input and output files. Combines well with
-single
option. Requires -z.
-monoblock
Use the monoblock
sorter. This option is a hint. Opensort may ignore it
and fallback to -detached option. With this sorter
no parallel I/O is performed, unless more than one
sorters are used. Targets to systems with deep
multithreading capabilities,
vast amounts of RAM and high storage throughput.
Requires -z.
-conservative
Use conservative
prefetch policy during merge. 20% the RAM, available
to opensort, will be used as a microblock
prefetch heap. Requires -z.
-standard
Use standard prefetch
policy during merge. 33% the RAM, available to
opensort, will be used as a microblock
prefetch heap. Requires -z.
-aggressive
Use aggressive prefetch
policy during merge. Almost 50% of RAM,
available to opensort, will be used as a microblock
prefetch heap. Requires -z.
-mt
Use multithreaded
implementations of quicksort or radix sort.
-singletmp
Use one temporary file
for all blocks. This is the default for FLR.
-multitmp
Use one temporary file
for each block. This is the default for TEXT.
-stable
Keep duplicates in
order.
-merge
Do not sort. Only merge
already sorted inputs. This option disables STDIN.
-debug
On runtime error, print
additional information for debuging. This options does
not affect the program's speed.
COMMENTS
There is no default
prefetch option. If none of the options -conservative,
-standard, -aggressive is present, opensort will disable
prefetch completely.
NOTES
I/O statistics for TEXT are
broken.
EXAMPLES
The following example
creates the sorted concatenated output file three.dat,
using the FLR files one.dat and two.dat as inputs.
Sort keys are two, one from offset 0-9 descending text,
and another from offset 31-38 ascending signed integer 64
bit wide.
Record size is 78:
opensort -i one.dat -i two.dat -o three.dat -k -0,9,t -k
31,38,i64 -z 78
The following example creates sorted output yourtext.txt,
split in multiple files with maximum size 1000000 bytes
each, using
input file mytext.txt. Sort keys are two, one in
position 7, ascending text, and another in position 2,
descending numeric
string, defined by pipe character as delimiter.
opensort -i mytext.txt -o yourtext.txt -v 1000000 -k 7,t
-k -2,n -delim pipe
AUTHOR
Written by Lucas Tsatiris.
REPORTING BUGS
You may report bugs or
request features using the project's bug tracker at
sourceforge:
http://sourceforge.net/tracker/?group_id=295617
Or by email to: opensort.project@gmail.com.
COPYRIGHT
Copyright © 2009, 2010,
2011 Lucas Tsatiris. License GPLv2+: GNU
GPL version 2 or later:
http://www.gnu.org/licenses/gpl-2.0.html
This is free software: you are free to
change and redistribute it. There is NO WARRANTY,
to the extent permitted by law.
WARNING
This software is under
development and may contain serious bugs. We do not
recommend it for production use.
Opensort 0.5.1 - January 2012
|