Book HomeManaging and Using MySQLSearch this book

Chapter 14. Extending MySQL

Contents:

User-Defined Functions
Alternative Character Sets

Within the spectrum of relational database management systems, MySQL has always occupied a unique position. Its open source nature and fast performance have made it the most popular database server for small-to-medium web-based applications. However, its lack of support for features such as transactions, foreign keys, row-level locking, and stored procedures have made many pass over it for other applications, though its performance would otherwise be beneficial.

As MySQL has matured over time, various solutions have emerged for all these problems. Through alternative table types such as Berkeley DB and InnoDB, MySQL now supports transactions and row-level locking. Thanks to the InnoDB table type, the MySQL team has added support for foreign keys.

One of MySQL's virtues is that it is intended to be extended. For example, stored procedures remain the most requested Oracle/SQL Server/Sybase feature lacking in MySQL. Though MySQL currently has no support for stored procedures (it probably will by the time you read this book), you can extend MySQL by writing your own user-defined functions (UDFs). In this chapter, we examine how to write your own functions and add character sets to MySQL.

14.1. User-Defined Functions

UDFs are simply native programming functions. MySQL links to the libraries containing them and executes them as SQL functions in the same way you have used the PASSWORD( ) and NOW( ) functions. MySQL gives these functions access to the internals of MySQL and empowers them to manipulate data.

You will generally write UDFs in C. The examples in this chapter are all in C. However, because MySQL accesses your functions through shared libraries, you can write them in any language that you can compile into a native-code shared library. In other words, you can write a UDF in C, C++, Objective C, C#, Java, VisualBasic, or even Pascal.

MySQL supports two types of UDFs: standard and aggregate. MySQL applies standard functions to each row in the result set. PASSWORD( ) and DAYNAME( ) are examples of two common standard functions built into MySQL. If you execute the SQL in the mysql database:

SELECT UPPER(User) FROM user;

you will see the name of each user in your database in all capital letters.

An aggregate function operates on groups of rows. You generally execute aggregate functions in conjunction with the SQL GROUP BY clause. This clause causes MySQL to group result sets into sections, in which each section is a single row. An aggregate function operates on all of the values of each group. AVG( ) and MAX( ) are examples of two common aggregate functions built into MySQL:

SELECT AVG(age) FROM People GROUP BY city;

The result of this query is a single column containing the average of all age values for people who live in the same city.

You create aggregate and standard functions in much the same manner—aggregate functions simply have two extra routines.

14.1.1. Standard Functions

Create a standard UDF in three steps, two of which are optional:

init
When an SQL query calls a UDF, the init routine for that function is called first, if it exists. This routine is responsible for allocating data and setting certain parameters for the main routine.

main
After the init routine is called, the main routine for the desired UDF is called. This is the only routine required to exist to define a UDF. This routine is responsible for processing the data and returning the result of the function.

deinit
Once the main routine is finished and the result is returned to the user, MySQL will call a deinit routine for the function, if it exists. The purpose of this routine is to deallocate any memory allocated by the init routine and clean up as necessary.

When an SQL query calls a standard function, these routines are called in the following manner:

init
   main
   main
   main
   ...
deinit

MySQL calls the main routine once for each row returned by the SQL query. The return values of the main routine comprise the results of the UDF as a whole.

When creating a UDF, all three routines must reside in the same library. You can, however, bundle multiple functions in the same library.

TIP: We use the terms "routine" and "function" in a specific manner in this section to reduce the possibility of confusion. The problem is that multiple native functions make up a single UDF. When we refer to a routine, we mean one of the native functions that make up the UDF. When we refer to a function, we mean the UDF itself.

14.1.1.1. The init routine

The init routine is the first routine called when a user attempts to execute a UDF within a SQL query. For a function with the name remove_html( ), the init routine must have the name remove_html_init( ). The init function must have the following signature:

my_bool func_name_init(UDF_INIT *initid, UDF_ARGS *args, char *message);

The three parameters passed to the init routine are all pointers to modifiable data that will be made available to the main function.

initid
This parameter is the main data structure used by the UDF engine. This structure will be made available to all three routines for a particular UDF. Any changes that need to be made to the default values of this structure should be made here in the init routine so they will be available for the main routine. The structure has the following members:

maybe_null
This Boolean value indicates whether the function can return an SQL NULL value. By default, this value is false unless any of the function arguments are nullable, in which case the default is true. To be safe, if you want the function to return an SQL NULL value, set the value of this attribute to true within the init routine.

decimals
This integer value indicates the maximum number of decimals in the result returned by this function. This attribute is used only if the function returns a numeric result. The default value of this attribute is determined by the arguments passed into the function. The number of decimals in the numeric argument with the most decimal places is used as the default. Since there is no way of knowing which values will be passed into the function, if you want to limit the number of decimals returned, set this attribute within the init routine.

max_length
This integer value indicates the maximum number of characters (or digits) in the result returned by this function. The default value depends on the type of function and the arguments passed to it. For a string function, the default value is the length of the longest string argument. For an integer function, the default is 21 digits (one of which will be a sign character, if necessary). For a decimal function, the default is 13 digits or other characters (including any sign character or decimal point) plus the value of the decimals attribute (see above). There is no performance penalty for setting this value to a large number (in case the function returns a BLOB or other large value).

ptr
This generic pointer can be used arbitrarily by the routines for a UDF. It is most useful to pass some data between the three routines of the function or to include a reference to some data from another part of the application. If this pointer will be used for new data (that is, not a reference to some existing data), the memory should be allocated here in the init routine. The main routine will then receive the allocated pointer to use as it will.

args
This array contains the arguments passed in from the SQL query. Because the actual processing of the arguments is performed in the main routine, this structure is rarely touched during the init routine. There are two attributes of this structure that are sometimes accessed within the init routine:

arg_type
This attribute is an array of values that correspond to the constants STRING_RESULT, INT_RESULT, and REAL_RESULT. Within the main routine, this array indicates the types of arguments the routine is receiving. These types may not always be what the routine is expecting. Instead of returning an error, it is possible to force the arguments passed by the user into the types that the main routine will need. This is done by setting the value within the arg_type array in the init routine. For example, if the first argument needs to be a STRING-type argument, set the first element of this array to STRING_RESULT. This will cause any value passed in by the user (even an integer or real value) to be coerced into a string value before the main routine is called.

lengths
This attribute is an array of values that correspond to the length of each argument passed in by the user. Within the main routine, the attribute describes the actual length values of the arguments being passed in, row by row. Within the init routine, this attribute describes maximum possible length values for string arguments.

message
This value is a character pointer used to store a user-level message about any error that occurs during the execution of the function. This preallocated character buffer is 200 characters wide by default. However, convention dictates keeping the message short enough to fit on one line. The message string must be terminated by a null byte.

The return value of the init routine is a my_bool (Boolean) type. If the initialization completes successfully, the routine should return 0 (false). If an error occurs, the routine should return 1 (true). In addition, in the case of an error, a user-level error message should be placed into the message argument.

14.1.1.2. The main routine

The main routine is the heart of the UDF. In fact, if no special initialization or cleanup is required, this routine is the only one needed to create a UDF. For a function with the name remove_html( ), the main routine should have the same name: remove_html( ).

The signature of the main routine depends on the return value of the UDF. There are three different SQL return values possible for UDFs: STRING, INTEGER, and REAL. The main routine for each of these types has the following signatures.

For a STRING:

char *func_name(UDF_INIT *initid, UDF_ARGS *args, 
 char *result, unsigned long *length,
 char *is_null, char *error)

For an INTEGER:

long long func_name(UDF_INIT *initid, UDF_ARGS *args,
 char *is_null, char *error)

For a REAL:

double func_name(UDF_INIT *initid, UDF_ARGS *args,
 char *is_null, char *error)

All the different routine signatures handle incoming arguments the same way. The differences lie in how each routine returns the result. We will consider the four parameters that are identical among the signatures first:

initid
This is the main data structure used by the UDF engine. It is a UDF_INIT structure, which is described in the previous section. Any of the attributes set during the init routine (including the generic data ptr attribute) are available during the main routine. Although there are no hard rules on the matter, the attributes of these structures should be accessed read-only from within the main routine. This is because all the initialization is done during the init routine, and all the deinitialization is done during the deinit routine. Therefore, in the middle, there is no need to modify these values.

args
These are the arguments passed in from the SQL query as a UDF_ARGS structure with the following attributes:

arg_count
This is the number of arguments passed into the UDF from the SQL query. If you want the UDF to have a fixed number of arguments, you can check the value here and return an error if the number is not correct. Otherwise, UDFs can support variable numbers of arguments.

arg_type
This is an array of type identifiers that indicate the SQL type of each argument. The number of elements in this array is equal to the arg_count attribute. Checking the values of the arg_type array in the main routine allows strict type checking of the incoming arguments. However, as mentioned in the previous section, it is generally more flexible to set the desired types in the init routine. This will transform them from whatever the user passed in to whatever types the main routine is expecting. The values of the arg_type attribute correspond to the constants STRING_RESULT, INT_RESULT, and REAL_RESULT, defined in the MySQL development headers.

args
This is an array of the actual arguments passed to the function. As mentioned above, the number of elements in this array is given by the arg_count property. The values of the arguments depend on two factors: the type of the argument and whether the argument was a constant value. Table 14-1 contains a mapping of types to C types. To accommodate these different possibilities, the elements of the args array are passed around as generic void pointers. You must therefore cast them to the appropriate C type.

Table 14-1. The mapping of MySQL result types to C types

Result type

C type

STRING_RESULT

char *

INT_RESULT

long long

REAL_RESULT

double

lengths
This value is an array of the lengths of the arguments passed to the UDF. As with the other array attributes (arg_type and args), the length of the array is given by the arg_count attribute. Each value of the array is the length of that argument (and hence the length of the corresponding element in the args array). For string arguments, the value is the accurate length of the string for that particular call to the function. Because the function will be called once for each row in the SQL query, this value will change to reflect the length of a string argument each time. Integer and decimal arguments have fixed lengths that are either defaults or set within the init routine.

is_null
This argument is a pointer to an integer value. If the UDF is returning a NULL value for a particular call to the function, set the value of this integer to 1. Otherwise, leave it at the default of 0.

error
This argument is a pointer to an integer value. If an error occurs during the main routine of the UDF, set the value of this integer to 1. This will return a NULL value for this particular call to the function. In addition, the main routine will not be called for any subsequent rows of data in the query that called the UDF, and NULL will be returned for each row.

There are two additional arguments specific to UDF that return STRING results:

result
This argument is a pointer to a character array. The return value of the UDF should be placed in the string referenced by this pointer. In addition, the pointer should be returned as the return value of the main routine (explained later). The MySQL UDF engine allocates 255 bytes as the buffer for a string return value. If the result fits within this limit, simply place the result into this buffer and return the pointer. If the result is bigger, you must use a string buffer that you have allocated. This allocation should be done by using the generic ptr attribute within the init routine. The result string should then be placed into this pointer, which should be returned as the return value of the function. If you take this route, the memory allocated for the pointer within the init routine should be deallocated in the deinit routine.

length
This argument is a pointer to an integer value. The value referenced by this pointer should be set to the length of the return value of the UDF. This does not include any trailing null characters. That is, if the return value of the function is the string make, the integer referenced by length should be set to 6.

The return value of the main routine depends upon the return value of the UDF as a whole. For INTEGER_RETURN functions, the main routine should return the return value of the function as a long long integer value. For REAL_RETURN functions, the main routine should return the return value of the function as a double value. For STRING_RETURN functions, the main routine should return the return value of the function as a pointer to a character array. As mentioned above, if the resulting value is less than 255 characters, this pointer should be the result pointer passed into the main routine. Otherwise, it should be a pointer allocated within the init routine.

14.1.1.3. The deinit routine

The deinit routine is responsible for freeing any memory allocated by the init routine and performing any other necessary cleanup. For a UDF with the name remove_html( ), the deinit routine must have the name remove_html_deinit( ). It has the following signature:

void func_name_deinit(UDF_INIT *initid)

The deinit routine takes one argument, the main UDF_INIT structure used by all three routines. This structure contains the generic ptr pointer that is available for generic use throughout the routines. If this pointer was allocated during the init routine, this deinit routine is responsible for freeing that memory. The deinit routine returns no value.

Example 14-1 shows a UDF that returns the contents of an image file, given the filename of that image. This function could be used to efficiently access image data without storing that data as a BLOB in the database.

Example 14-1. A UDF that returns the contents of an image file based on its filename

#include <mysql/my_global.h>
#include <mysql/my_sys.h>
#include <mysql/mysql.h>
#include <mysql/m_ctype.h>
#include <mysql/m_string.h>
   
#define MAX_IMAGE_SIZE 1048576
#define MAX_FILE_NAME 1024

/* Our 'init' routine is responsible for some simple set-up:
      - Check the argument count. This function must have exactly
             one argument, the filename, so we return an error 
             if there are any other number.
      - Coerce the types of the arguments. Theoretically, the 
              image names could be stored as integer IDs 
              (and thus stored in files with simple numbers 
              for names in the filesystem). Therefore, we want
              to allow any type of argument, and coerce it to a string.
     - Allocate the data buffer to hold the image data. This is a fairly large
              buffer to accommodate images of differing sizes.
*/
my_bool getImage_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {
    if (args->arg_count != 1) {
        strmov(message,"Usage: getImage( fileName )");
        return 1;
    }
    args->arg_type[0] = STRING_RESULT;
    if ( !(initid->ptr = 
        (char *) malloc( sizeof(char) * MAX_IMAGE_SIZE ) ) ) {
        strmov(message, "Couldn't allocate memory!");
        return 1;
    }
    bzero( initid->ptr, sizeof(char) * MAX_IMAGE_SIZE );
    return 0;
}

/* This routine frees the memory allocated by the image buffer */
void getImage_deinit(UDF_INIT *initid) {
    if (initid->ptr)
        free(initid->ptr);
}

/* The main routine does most of the work for this function. The argument
   passed to the function is the filename of an image file. This function first
   uses 'chdir' to set the directory to the directory of the image file. This
   prevents misuse of this function by using it to read an arbitrary file on
   the filesystem. (We also remember the old working directory so that we can
   reset it at the end.)

   Once we've set the current directory, we open the image file using the
   Filename passed in as the argument. If the image opens successfully, we read
   the contents of the file into the empty data buffer. 

   Before returning the data, we set the length pointer to tell MySQL how
   Large our image buffer is. Finally, we return the pointer to the data 
   as the string result of this function.
*/
char *getImage(UDF_INIT *initid, UDF_ARGS *args, char *result,
               unsigned long *length, char *is_null, char *error) {
    FILE *img_file;
    size_t img_size;
    char filename[MAX_FILE_NAME];
    char old_dir[MAX_FILE_NAME];

    bzero(filename, MAX_FILE_NAME);
    bzero(old_dir, MAX_FILE_NAME);

    getcwd(old_dir, MAX_FILE_NAME);
    chdir("/path/to/my/pictures");
    strncpy( filename, args->args[0], args->lengths[0] );
    img_file = fopen( filename, "r" );
    if (! img_file ) {
        *is_null = 1;
        return 0;    
    }
    
    img_size =
        fread( (void *)initid->ptr, sizeof(char),
            MAX_IMAGE_SIZE, img_file );
    
    *length = (unsigned long)img_size;
    chdir(old_dir);
    fclose(img_file);
    return initid->ptr;
}

14.1.2. Aggregate Functions

As mentioned earlier, aggregate functions work just like standard functions except that they have a couple of extra routines. Those routines drive the aggregation for each group of data:

reset
This routine is called at the beginning of each group of data. It resets any local data used to aggregate the data so that each group starts with an empty aggregation.

add
This routine is called for each row of data within a group. The purpose of the function is to perform any data aggregation required for the group.

The order in which routines are called illustrates the differences between standard and aggregate functions:

init
  reset
      add
      add
      add
      ...
      main
  reset
      add
      add
      add
      ...
      main
   reset
      add
      add
      add
      ...
      main
  ...
deinit

While a standard function is comprised of a single loop, an aggregate function is a loop of loops. For a standard function, the main routine is called for each row of data, which corresponds directly with each row of the result set. For an aggregate function, the rows of the result set are made up of groups, in which each row represents a group of rows of data. The add routine is called for each row of data (except the first), while the main routine is called only once for each group (resulting in one row of the result set).

While the signatures of the routines called within a standard function do not differ for an aggregate function, their roles are somewhat different.

14.1.2.6. Aggregate example

Example 14-2 is a UDF that calculates the average distance from the center of a range of points. This function takes two arguments, which correspond to the x and y coordinates of points on a graph. The function then aggregates all the points given to it by the SQL query and returns the distance from the center of the graph (x=0, y=0) to the average of all the points.

Example 14-2. An aggregate UDF that calculates the distance from the center of a graph

#include <mysql/my_global.h>
#include <mysql/my_sys.h>
#include <mysql/mysql.h>
#include <mysql/m_ctype.h>
#include <mysql/m_string.h>
   
/* This is the structure that defines the aggregate of all of the points
 * It contains the sum of all 'x' values, the sum of all 'y' values,
 * and the total number of points counted (for a single group).
 */
struct point {
	double x;
	double y;
	long count;
};

/* Our 'init' routine has several duties, this time:
      - Check the argument count. This function must have exactly
             two arguments, x and y, so we return an error if there are
             any other number.
      - Coerce the types of the arguments. Since we are dealing with decimal
              values, we don't want to penalize the user for passing in
              integers, or even numeric values stored within a string, so
              we coerce the incoming arguments into a 'real' type.
      - Set the 'maybe_null' flag to zero. This function will never return
              a null value. Maybe a 'zero' value, but not null.
      - Set the 'decimals' flag to 5 and the 'max_length' flag to 10.
              This will cause the results to be in the format +/-###.#####
              with a guarantee of 5 decimal places and up to three integer
              places.
     - Allocate the point structure. This structure will be used to house
              the aggregate point data as we accumulate it.
*/
my_bool distanceFromCenter_init( UDF_INIT* initid, UDF_ARGS* args, 
	char* message ) {

	/* Check number of arguments */
    if (args->arg_count != 2) {
		strmov(message,"Usage: distanceFromCenter( x, y )");
		return 1;
	}

    /* Coerce arguments to 'real' */
	args->arg_type[0] = REAL_RESULT;
	args->arg_type[1] = REAL_RESULT;

   /* Set flags */
	initid->maybe_null = 0;
	initid->decimals = 5;
	initid->max_length = 10;

    /* Allocate pointer */
	initid->ptr = malloc( sizeof( struct point ) );
    if (! initid->ptr ) {
        strmov(message, "Cannot allocate memory for 'point' structure");
        return 1;
    }
	return 0;
}

/* The 'deinit' routine only has to de-allocate the 'point' structure */
void distanceFromCenter_deinit( UDF_INIT* initid ) {
	free(initid->ptr);
}

/* The 'add' routine is called once for each row of data. It is responsible
   for aggregating the point information. It simply adds the 'x' and 'y'
   values to the totals stored in the point structure and then increments
   the count in the same structure.
*/
void distanceFromCenter_add( UDF_INIT* initid, UDF_ARGS* args, 
	char* is_null, char *error ) {
	((struct point *)initid->ptr)->x += *((double *)args->args[0]);
	((struct point *)initid->ptr)->y += *((double *)args->args[0]);
	((struct point *)initid->ptr)->count++;
}

/* The 'reset' routine is called at the beginning of each group. It simply
   zeros out the information  in the point structure in preperation for a new
   group of data. As mentioned earlier, this is the only routine called on the
   first row of data in a group. Therefore, we must also call the 'add' routine
   from this routine, to include the first row of data in the aggregate set.
*/
void distanceFromCenter_reset( UDF_INIT* initid, UDF_ARGS* args, 
	char* is_null, char *error ) {
	((struct point *)initid->ptr)->x = 0.0;
	((struct point *)initid->ptr)->y = 0.0;
	((struct point *)initid->ptr)->count = 0;
	distanceFromCenter_add( initid, args, is_null, error );
}

/* The 'main' routine has a very simple role here. It just has to take the
   aggregate data from the point structure and calculate a point (x,y) value
   for that data. It then runs the classic a^2 + b^2 = c^2 formula to determine
   the distance to the center of the grid for that point.
double distanceFromCenter( UDF_INIT* initid, UDF_ARGS* args, 
	char* is_null, char *error ) {
	double avg_x, avg_y;

    /* Calculate the average 'x' and the average 'y'
	avg_x = ((struct point *)initid->ptr)->x / 
		(double)((struct point *)initid->ptr)->count;
	avg_y = ((struct point *)initid->ptr)->y / 
		(double)((struct point *)initid->ptr)->count;
	
    /* Return the square root of a^2 + b^2 */
	return sqrt( avg_x * avg_x + avg_y * avg_y );

}

14.1.3. Calling a UDF

Now that we have successfully created the routines necessary for a UDF, the next step is to call those functions from an SQL query. This is accomplished in three steps:

  1. Compiling the function routines

  2. Loading the UDF

  3. Running the UDF

As discussed earlier, the routines that make up a UDF can be implemented in any programming language that can be compiled into an operating system-level shared library. On most Unix-based systems, a shared library is indicated by a .so extension. In Microsoft Windows, shared libraries have .dll extensions.

Once the routines are written, use whatever procedure is generally followed for compiling source code into a shared library. For example, compiling a shared library from a C or C++ file using the GNU C Compiler on a Unix-based system can be accomplished with the following command:

gcc -shared -o myfunc.so myfunc.c

The -shared flag indicates that a shared library should be created from the functions within the source file. Different compilers, operating systems, and programming languages will have different methods for creating a shared library.

As mentioned earlier, multiple UDFs can be defined and implemented within a single shared library file. However, all the routines necessary for each function must be present within that file. That is, you can place multiple functions within a single shared library, but you cannot split the routines necessary for a single function between multiple libraries.

Once a shared library has been created, it should be placed somewhere on the operating system's filesystem that is designated for shared libraries. On Unix-based systems, this generally means the /lib and /usr/lib directories as well as any directories specified in the LD_LIBRARY_PATH environment variable (although the name of the environment variable differs on some Unix-based systems). For Microsoft Windows systems, this generally means the WINDOWS\System32 directory (where WINDOWS is the main Windows directory, often C:\Winnt for Windows NT and 2000 systems).

Finally, it may be necessary to restart the MySQL server for it to find the shared library file. This may not be necessary, though, so if restarting the MySQL server is an issue, first try to load the function library without restarting.

Once the shared library file is in a place that is accessible to the MySQL server, it is possible to access the UDF directly from MySQL. To do this, the UDF must be loaded, which generally means loading the shared library into MySQL's memory so it can access the function routines.

Loading a new UDF is accomplished through the SQL CREATE FUNCTION statement:

CREATE [AGGREGATE] FUNCTION function_name 
RETURNS {STRING|REAL|INTEGER}
SONAME shared_library_name

Basically, this statement indicates the name of the UDF, its return type, and the shared library where it is located. The optional AGGREGATE keyword indicates that the function is an aggregate function.

The name of the function must be the name of the main routine used to implement the function. The name is case sensitive, so it must be capitalized the same way it is in the shared library.

The return type of the function is specified by one of the keywords STRING, REAL, or INTEGER. This value must match the return type of the main routine.

The shared library is given as an SQL literal string. It should be only the filename (not a directory path), and it is case sensitive. The shared library given here must be accessible to MySQL.

Behind the scenes, loading a UDF involves adding and removing data from the func table of the system mysql database. Therefore, to execute the CREATE FUNCTION statement, a user must have INSERT and DELETE privileges for that table (or for the whole database). In general, this should be restricted to administrative users, because loading arbitrary UDFs can give a user almost complete control over the MySQL system. (Imagine loading a function that kills the main MySQL process. Because the function is run from MySQL itself, nothing will stop it from shutting down the system abruptly.)

The following SQL calls add two UDFs, one standard and the other aggregate:

CREATE FUNCTION getImage RETURNS STRING SONAME 'myfuncs.so';
CREATE AGGREGATE FUNCTION distanceFromCenter 
                  RETURNS REAL SONAME 'myfuncs.so';

Once a UDF has been loaded into MySQL, it can be accessed just as if it were a built-in MySQL function, because MySQL functions are case insensitive. Even though it was necessary to specify the correct case when loading the function, once loaded, the function can be executed with any case:

SELECT GETIMAGE( image_name ) AS image FROM Product;
SELECT distanceFromCenter( x, y ) FROM Point GROUP BY shape_id


Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.