Libopensort reference manual
|
for libopensort
0.2.0,
revision 1.
Table of contents
Overview
- What is libopensort
Compiling
- How to compile libopensort
API
- The application program interface
Utilities
Overview - What is libopensort
Libopensort is a general purpose library, which provides to other
programs an easy way to sort fast, data stored in one or multiple
files. Using different sorting algorithms, like the stock qsort
function provided by the standard C library, or implementations
of multithreaded qsort, as well as monolithic and multithreaded MSB
radix sort, gives to the developer the ability to use the most
efficient way to sort the data in
memory. In addition, libopensort gives the option of direct
(unbuffered) I/O, or the zero copy mechanism of the operating system,
like the Linux splice (2) system call, follows implementation
paths related
to
the hardware configuration, in order to push the I/O system to its
limits, and finally exploits the hardware to the full.
Libopensort is free and open-source software, distributed under the
terms of the GNU
GPL version
2 or later.
Compiling
-
How
to
compile
libopensort
Build instructions about libopensort can be found here.
API
- The
application
program interface
Basic concepts
Libopensort recognises two kinds of input files:
FLR:
Binary
fixed
length
record
where
the
record
size
is
known
in
advance.
TEXT:
Delimited
text
(ASCII)
where
the
record
delimiter
is
always
the
standard
EOL
sequence
(usually
LF
or CRLF) of the operating system. The column/field
delimiter character is defined by the user.
Libopensort sorts the data in two phases, the sort phase and the merge
phase. During the sort phase, it creates one thread for reading (the
reader), one for writing (the writer) and one or multiple threads for
data sorting (the sorters). Each sorter receives unsorted data blocks
from the reader and sends sorted data blocks to the writer. For the
communication between the reader, the sorters and the writer,
libopensort uses message queues. By the end of the sort phase, there is
one temporary file which hosts all the records of the input files,
sorted
in blocks.
During the merge phase of FLR
inputs, libopensort creates one thread for prefetching microblocks from
the temporary file (the prefetcher), one for merging the records (the
merger) and one for writing (the writer). The merger fills a binary
tree requesting records from the prefetcher, one for each block,
extracts the records from the tree in an order complied with the key
definitions, and sends them to the writer in blocks. Thus, by the end
of the merge phase, there is one or multiple files which host all the
records of the input files, fully sorted. Again, message queues are
used
for the communication between the prefetecher, the merger and the
writer.
The merge phase of the TEXT
inputs is more simple. There is only one thread for reading and
merging, and no microblock prefetch, but only record fetch. The rest of
the merge mechanism remains the same.
Types
Types
defined in libopensort header files.
Synopsis
Description
Types defined in opensort.h
Details
OPENSORT
Type of object holding information necessary
to
control the
sorting process.
os_error_t
typedef
struct
{
const char *function;
char *opensort_error_str;
char *os_error_str;
char *source_file;
int opensort_errno;
int os_errno;
int error_line;
} os_error_t;
|
Type of object holding information about a runtime error.
function:
opensort_error_str:
os_error_str:
source_file:
opensort_errno:
os_errno:
error_line:
|
The function name
the error occured.
The string describing the error.
The string returned by strerror (3).
The source filename.
A value from opensort_errors.
The errno the operating system returned (may be zero).
The line of the source file the error occured.
|
Macros
Macros defined in libopensort header files.
Synopsis
#include
<opensort.h>
#define OPENSORT_MAJOR
#define OPENSORT_MINOR
#define OPENSORT_PATCH
|
Description
Commonly used macros in programs that use libopensort.
Details
OPENSORT_MAJOR
OPENSORT_MINOR
OPENSORT_PATCH
#define
OPENSORT_MAJOR
0
#define OPENSORT_MINOR 2
#define OPENSORT_PATCH 0
|
Version information about libopensort.
Enumerated
types
Enums defined in libopensort header files.
Synopsis
#include
<opensort.h>
typedef enum
{
OS_DUMMY,
OS_FLR,
OS_TEXT,
OS_STRING,
OS_NUMSTRING,
OS_FLOAT,
OS_DOUBLE,
OS_INT8,
OS_UINT8,
OS_INT16,
OS_UINT16,
OS_INT32,
OS_UINT32,
OS_INT64,
OS_UINT64,
OS_ASCENDING,
OS_DESCENDING,
OS_BUFFERED,
OS_UNBUFFERED,
OS_QSORT,
OS_RADIX,
OS_SORTER_PIPELINE,
OS_SORTER_DETACHED,
OS_SORTER_EARLYFLUSH,
OS_SORTER_MONOBLOCK,
OS_SORTMODE_MONO,
OS_SORTMODE_MT,
OS_PREFETCH_NONE,
OS_PREFETCH_CONSERVATIVE,
OS_PREFETCH_STANDARD,
OS_PREFETCH_AGGRESSIVE,
OS_IOMODE_SINGLE,
OS_IOMODE_MULTI,
OS_DELIM_SPACE,
OS_DELIM_TAB,
OS_DELIM_COLUMN,
OS_DELIM_SEMICOLUMN,
OS_DELIM_PIPE,
OS_DELIM_COMMA
} opensort_flags;
typedef enum
{
OS_ERR_NONE,
OS_ERR_INV_PARAM,
OS_ERR_MEM,
OS_ERR_FILENAME,
OS_ERR_NOFLR,
OS_ERR_NOTEXT,
OS_ERR_VOLSIZE,
OS_ERR_INPUT_FILE,
OS_ERR_OUTPUT_FILE,
OS_ERR_OFFSET,
OS_ERR_TEXTKEY,
OS_ERR_KEYTYPE,
OS_ERR_DELIM,
OS_ERR_KEYORDER,
OS_ERR_STAT,
OS_ERR_READQUEUE,
OS_ERR_WRITEQUEUE,
OS_ERR_CREATEQUEUE,
OS_ERR_READ,
OS_ERR_WRITE,
OS_ERR_DEADTHREAD,
OS_ERR_NOKEYS
} opensort_errors;
|
Description
Libopensort defines two enumerated types: opensort_flags
and opensort_errors.
The
elements
in opensort_flags
are used as
function
arguments and their meaning is explained below:
OS_DUMMY
OS_FLR
OS_TEXT
OS_STRING
OS_NUMSTRING
OS_FLOAT
OS_DOUBLE
OS_INT8
OS_UINT8
OS_INT16
OS_UINT16
OS_INT32
OS_UINT32
OS_INT64
OS_UINT64
OS_ASCENDING
OS_DESCENDING
OS_BUFFERED
OS_UNBUFFERED
OS_SORTER_PIPELINE
OS_SORTER_DETACHED
OS_SORTER_EARLYFLUSH
OS_SORTER_MONOBLOCK
OS_SORTMODE_MONO
OS_SORTMODE_MT
OS_PREFETCH_CONSERVATIVE
OS_PREFETCH_STANDARD
OS_PREFETCH_AGGRESSIVE
OS_IOMODE_SINGLE
OS_IOMODE_MULTI
OS_DELIM_SPACE
OS_DELIM_TAB
OS_DELIM_COLUMN
OS_DELIM_SEMICOLUMN
OS_DELIM_PIPE
OS_DELIM_COMMA |
Dummy argument.
Input of fixed length records.
Input of delimited text lines.
Key type is string.
Key type is numeric string. String sorted according to its numeric
value.
Key type is native float.
Key type is native double.
Key type is native 8 bit signed integer.
Key type is native 8 bit unsigned integer.
Key type is native 16 bit signed integer.
Key type is native 16 bit unsigned integer.
Key type is native 32 bit signed integer.
Key type is native 32 bit unsigned integer.
Key type is native 64 bit signed integer.
Key type is native 64 bit unsigned integer.
Sort order is ascending.
Sort order is descending.
Buffered I/O.
Unbuffered (direct) I/O.
Pipeline sorter.
Detached sorter.
Earlyflush sorter.
Monoblock sorter.
Single threaded sorting algorithm.
Multithreaded sorting algorithm.
Conservative prefetch during merge.
Standard prefetch during merge.
Aggressive prefetch during merge.
Disable parallel I/O.
Enable parallel I/O.
Column delimiter is space character.
Column delimiter is tab character.
Column delimiter is column character.
Column delimiter is semicolumn character.
Column delimiter is pipe character.
Column delimiter is comma character.
|
The elements in opensort_errors
are used as
error codes and returned by the functions of
integer type. Their meaning is explained below:
OS_ERR_NONE
OS_ERR_INV_PARAM
OS_ERR_MEM
OS_ERR_FILENAME
OS_ERR_NOFLR
OS_ERR_NOTEXT
OS_ERR_VOLSIZE
OS_ERR_INPUT_FILE
OS_ERR_OUTPUT_FILE
OS_ERR_OFFSET
OS_ERR_TEXTKEY
OS_ERR_KEYTYPE
OS_ERR_DELIM
OS_ERR_KEYORDER
OS_ERR_STAT
OS_ERR_READQUEUE
OS_ERR_WRITEQUEUE
OS_ERR_CREATEQUEUE
OS_ERR_READ
OS_ERR_WRITE
OS_ERR_DEADTHREAD
OS_ERR_NOKEYS
|
Function
executed
normaly. No
error found.
Invalid argument.
Not enough memory.
File name too long.
FLR key definition for TEXT initialized process.
TEXT key definition for FLR initialized process.
Volume size is negative.
Cannot open file for read.
Cannot open file for write.
Start offset greater than end offset in FLR key definition.
Negative position in TEXT key definition.
Invalid key type.
Invalid delimiter.
Invalid key order.
Cannot stat file.
Cannot receive from message queue.
Cannot send to message queue.
Cannot create message queue.
Error during file read.
Error during file write.
Thread exited too early.
No keys defined.
|
Functions
The functions needed to use libopensort within a C program.
Synopsis
#include
<opensort.h>
|
|
|
|
OPENSORT
*opensort_init
|
(unsigned long
reservedbytes,
int recordtype,
size_t recordsize);
|
int opensort_addinputfile
|
(OPENSORT
* data,
char *filename,
int oflag);
|
int opensort_flraddkey
|
(OPENSORT
* data,
size_t startoff,
size_t endoff,
int keytype,
int order,
int dummy);
|
int opensort_textaddkey
|
(OPENSORT
* data,
int fieldpos,
int keytype,
int order);
|
int opensort_settextfielddelim
|
(OPENSORT
* data,
int delim);
|
int opensort_setoutputfile
|
(OPENSORT
* data,
char *filename,
off_t maxvolumesize,
int oflag);
|
int opensort_settempfile
|
(OPENSORT
* data,
char *filename,
int oflag);
|
void opensort_setnsorters
|
(OPENSORT
* data,
int nsorters);
|
void opensort_selectsorter
|
(OPENSORT
* data,
int hint);
|
void opensort_setprefetchpolicy
|
(OPENSORT
* data,
int policy);
|
void opensort_setiomode
|
(OPENSORT
* data,
int mode);
|
void opensort_setsortmode
|
(OPENSORT
* data,
int mode);
|
void opensort_setiobufsize
|
(OPENSORT
* data,
size_t iosize);
|
int opensort_sortmerge
|
(OPENSORT
* data);
|
void opensort_statistics
|
(OPENSORT
* data);
|
void opensort_destroy
|
(OPENSORT
* data);
|
os_error_t
*opensort_geterror
|
(OPENSORT
*
data);
|
const char *opensort_strerror
|
(int OS_ERR);
|
Description
This section describes a number of functions implemented in libopensort.
Details
opensort_init
()
OPENSORT
* |
opensort_init |
(unsigned long
reservedbytes,
int recordtype,
size_t recordsize); |
Initializes the sort/merge process. This function must be called first
and only once before opensort_destroy
().
reservedbytes:
recordtype:
recordsize:
Returns:
|
The total RAM,
in
bytes,
dedicated for the sort/merge process. Zero means 16 MB, the default.
OS_FLR
or OS_TEXT.
The record size for FLR,
ignored
for TEXT.
An OPENSORT
pointer or NULL on failure.
|
opensort_addinputfile
()
int
|
opensort_addinputfile |
(OPENSORT
* data,
char *filename,
int oflag); |
Adds an input file in the input file list. This function may be used
multiple times if there are more than one input files. All the files
will be processed FIFO.
opensort_setoutputfile ()
int
|
opensort_setoutputfile |
(OPENSORT
* data,
char *filename,
off_t maxvolumesize,
int oflag); |
Sets the output file. If the size of the output file is greater than maxvolumesize, libopensort will
create multiple files (volumes) with size not greater than maxvolumesize. No record will split
between two volumes. Each volume cannot be less than 256*1024 bytes. Maxvolumesize equal to 0 means no
maximum.
data:
filename:
maxvolumesize:
oflag:
Returns:
|
The OPENSORT
pointer returned by opensort_init
().
The full path of the output file.
The maximum size of each volume in bytes, or 0 if there is no maximum.
Values between 1 and 262143 will be increased to 262144 (256K) silently.
OS_BUFFERED
or OS_UNBUFFERED.
OS_ERR_NONE
or an error code
if an error occured.
|
opensort_settempfile ()
int
|
opensort_settempfile |
(OPENSORT
* data,
char *filename,
int oflag); |
Sets the temporary file. If a call to this function is absent,
libopensort will create a temporary file in the standard temporary
directory. The temporary file will be removed by the end of the sort
process.
opensort_flraddkey
()
int
|
opensort_flraddkey |
(OPENSORT
* data,
size_t startoff,
size_t endoff,
int keytype,
int order,
int dummy); |
Specifies a key in a FLR
and adds it
in the key
list. This function may
be used multiple times if there are more than one keys. All the keys
will be processed FIFO.
data:
startoff:
endoff:
keytype:
order:
dummy:
Returns:
|
The OPENSORT pointer
returned by opensort_init
().
The offset the key starts, count from 0.
The offset the key ends, count from 0.
The data type of the declared key. Possible values are OS_STRING, OS_NUMSTRING,
OS_FLOAT, OS_DOUBLE, OS_INT8, OS_UINT8, OS_INT16, OS_UINT16, OS_INT32,
OS_UINT32, OS_INT64, OS_UINT64.
The sort order of the key. Possible values are OS_ASCENDING
or OS_DESCENDING.
A dummy argument. This is always OS_DUMMY.
OS_ERR_NONE
or an error code
if an error occured.
|
opensort_settextfielddelim
()
int
|
opensort_settextfielddelim |
(OPENSORT
* data,
int delim); |
Sets the delimiter character which
defines the fields/columns in a TEXT
line.
opensort_textaddkey
()
int
|
opensort_textaddkey |
(OPENSORT
* data,
int fieldpos,
int keytype,
int order); |
Specifies a key in a TEXT
line and
adds it in the
key list. This function may
be used multiple times if there are more than one keys. All the keys
will be processed FIFO.
opensort_setnsorters
()
void
|
opensort_setnsorters |
(OPENSORT
* data,
int nsorters); |
Sets the number of sorting threads. Any number above 8, will be
truncated to this limit.
data:
nsorters:
Returns:
|
The OPENSORT pointer
returned by opensort_init
().
The number of sorting threads. Maximum is 8.
Nothing.
|
opensort_selectsorter
()
void
|
opensort_selectsorter |
(OPENSORT
* data,
int hint); |
Gives a hint about the sorter to be used. Libopensort may ignore it and
fall back to the default. If a call to this function is absent, the
default sorter will be used. The possible hint values are the following:
OS_SORTER_DETACHED: |
Use the
detached
sorter. This
is the default sorter, able
to cover all the possible key type and order combinations, performing
parallel I/O.
|
OS_SORTER_PIPELINE: |
Use the
pipeline sorter. This is a three stage pipeline: read, sort and
write. This sorter will perform
parallel I/O, optimized for multi disk/array systems, where
the temporary file lays on a different physical media than
the
input and the output file.
|
OS_SORTER_EARLYFLUSH: |
Use the
earlyflush sorter. This sorter is optimized for RAM or CPU
bound hardware and will perform
limited parallel I/O. Performs better in single disk/array systems or
when the temporary file lays on the same physical
media with the
input and the output file.
|
OS_SORTER_MONOBLOCK:
|
Use the
monoblock sorter. With this sorter no parallel I/O is
performed, unless opensort_setnsorters
() has set more than one sorters. Targets to systems with deep
multithreading capabilities, vast amounts of RAM and high storage
throughput.
|
opensort_setprefetchpolicy
()
void
|
opensort_setprefetchpolicy |
(OPENSORT
* data,
int policy); |
Sets the microblock prefetch policy to be used during merge. If a call
to this function is absent, microblock prefetching is disabled.
opensort_setiomode
()
void
|
opensort_setiomode |
(OPENSORT
* data,
int mode); |
Sets the operating mode for single or multi disc/array systems. When
set to OS_IOMODE_SINGLE, read blocks write and vice versa. If a call to
this function is absent, OS_IOMODE_MULTI is used.
opensort_setsortmode
()
void
|
opensort_setsortmode |
(OPENSORT
* data,
int mode); |
Orders the sorter to use a monolithic or a multithreaded implementation
of the sort
function, if this is possible. If a call to this function is absent,
the single threaded implementation is used.
opensort_setiobufsize
()
void
|
opensort_setiobufsize |
(OPENSORT
* data,
size_t iosize); |
Sets the size of the read and write buffers. If a call to
this function is absent, the default is
4096 bytes.
data:
iosize:
Returns:
|
The OPENSORT pointer
returned by opensort_init
().
The size of each I/O buffer in bytes.
Nothing.
|
opensort_sortmerge
()
int
|
opensort_sortmerge |
(OPENSORT
* data); |
Performs the sort/merge process.
opensort_statistics
()
void
|
opensort_statistics |
(OPENSORT
* data); |
Prints on screen useful statistics about the sort/merge process. This
function will be removed in the future and will be replaced by a more
efficient mechanism for statistics collection.
opensort_destroy
()
void
|
opensort_destroy |
(OPENSORT
* data); |
Finalizes the process, releasing all the memory and disc space
used.
opensort_geterror
()
Gives access to the error list. This is useful if opensort_sortmerge
() returned with an error, in any other case, it returns NULL.
data:
Returns:
|
The OPENSORT pointer
returned by opensort_init
().
The next element of the error list, or NULL at end.
|
opensort_strerror
()
const
char
*
|
opensort_geterror
|
(int
OS_ERR);
|
Returns a string describing a value from opensort_errors.
OS_ERR:
Returns:
|
A value from opensort_errors.
A string describing the error number or NULL if OS_ERR is invalid.
|
Examples
Example for binary FLR
sort.
#include
<stdio.h>
#include <opensort.h>
#define MEGABYTE (1024*1024)
#define KILOBYTE 1024
int
main ()
{
OPENSORT
*sort;
int result;
/*
Initialize for 32 MB available RAM and 298 bytes record size */
sort = opensort_init
(32 *
MEGABYTE, OS_FLR,
298);
if (sort == NULL)
{
printf ("Opensort initialization
failed\n");
return 0;
}
/* Add an
input file for direct I/O */
result = opensort_addinputfile
(sort, "input.dat", OS_UNBUFFERED);
if (result != OS_ERR_NONE)
{
printf ("Cannot add input file\n");
goto end;
}
/* Set the
output file for direct I/O. It will split in volumes of 10,000 records
each */
result =
opensort_setoutputfile
(sort, "output.dat", 2980000, OS_UNBUFFERED);
if (result != OS_ERR_NONE)
{
printf ("Cannot set output file\n");
goto end;
}
/* Add the
first key. From offset 37 to 40, ascending character */
result =
opensort_flraddkey
(sort, 37, 40, OS_STRING, OS_ASCENDING, OS_DUMMY);
if (result != OS_ERR_NONE)
{
printf ("Cannot add the first key\n");
goto end;
}
/* Add the
second key. From offset 97 to 111, descending numeric string */
result =
opensort_flraddkey
(sort, 97, 111, OS_NUMSTRING, OS_DESCENDING, OS_DUMMY);
if (result != OS_ERR_NONE)
{
printf ("Cannot add the second key\n");
goto end;
}
/* Set the
prefetch policy and the size of I/O buffers */
opensort_setprefetchpolicy
(sort, OS_PREFETCH_CONSERVATIVE);
opensort_setiobufsize
(sort, 64 * KILOBYTE);
/* Proceed
to the sort/merge process */
result = opensort_sortmerge
(sort);
if (result != OS_ERR_NONE)
printf ("Runtime Error: %s\n", opensort_strerror
(result));
end:
/* Release
the allocated RAM and delete the temporary file(s) */
opensort_destroy
(sort);
return 0;
}
|
Example for TEXT
sort.
#include
<stdio.h>
#include <opensort.h>
#define MEGABYTE (1024*1024)
#define KILOBYTE 1024
int
main ()
{
OPENSORT
*sort;
int result;
/*
Initialize for 32 MB available RAM */
sort = opensort_init
(32 *
MEGABYTE, OS_TEXT,
0);
if (sort == NULL)
{
printf ("Opensort initialization
failed\n");
return 0;
}
/* Add an
input file */
result = opensort_addinputfile
(sort, "input.txt", OS_BUFFERED);
if (result != OS_ERR_NONE)
{
printf ("Cannot add input file\n");
goto end;
}
/* Set the
output file */
result = opensort_setoutputfile
(sort, "output.txt", 0, OS_BUFFERED);
if (result != OS_ERR_NONE)
{
printf ("Cannot set output file\n");
goto end;
}
/* The
delimiter is the pipe character */
result = opensort_settextfielddelim
(sort, OS_DELIM_PIPE);
if (result != OS_ERR_NONE)
{
printf ("Cannot set delimiter\n");
goto end;
}
/* Add the
first key. Column 3, ascending character */
result = opensort_textaddkey
(sort, 3, OS_STRING, OS_ASCENDING);
if (result != OS_ERR_NONE)
{
printf ("Cannot add the first key\n");
goto end;
}
/* Add the
second key. Column 9, descending numeric */
result = opensort_textaddkey
(sort, 9, OS_NUMSTRING, OS_DESCENDING);
if (result != OS_ERR_NONE)
{
printf ("Cannot add the second key\n");
goto end;
}
/* Set the
size of I/O buffers */
opensort_setiobufsize
(sort, 8 * KILOBYTE);
/* Proceed
to the sort/merge process */
result = opensort_sortmerge
(sort);
if (result != OS_ERR_NONE)
{
os_error_t
*err;
printf ("Runtime Error: %s\n", opensort_strerror
(result));
/* Print the
contents of the error list */
err = opensort_geterror
(sort);
while (err != NULL)
{
printf (">Error in function:
%s\n", err->function);
printf (" From source: %s at line:
%d\n", err->source_file, err->error_line);
printf (" Operating system
returned: %s\n", err->os_error_str);
printf (" libopensort returned:
%s\n", err->opensort_error_str);
err = opensort_geterror
(sort);
}
}
end:
/* Release
the allocated RAM and delete the temporary file(s) */
opensort_destroy
(sort);
return 0;
}
|
Utilities
Opensort
Opensort is a command line utility built around libopensort. More
information about opensort can be found here.
|
|