Friday, February 29, 2008

PROCEDURE ANALYSE()

MySQL DBA Certification candidates need to know about a little known but important function.

PROCEDURE ANALYSE() is a rarely used function in these days of inexpensive disks. It is mainly used to suggest optimal column sizes. Forty years ago you needed to worry how to properly encode data so that it would fit on an eighty column punch card. Thirty years ago when hard disks were the size of stove or dishwasher, there was a need to conserve as much space as possible. Today people carry gigs of data in their pocket and a terabyte of disk is available at local stores next to other consumer products. So you can be less than optimal in your storage of data for the most part. There are exceptions. Remember that a smaller width index is going to be a faster index.

Let us pretend that we need to create a table to track item numbers and our data architect, boss, or SWAG tells us there will only be twenty item numbers, ranging from 1 to 20. And by the way, the boss wants efficiency.

mysql> CREATE table x1 (item INT(4) NOT NULL);

No insert the values 1 to 20 and analyze.


mysql> SELECT item FROM x1 PROCEDURE ANALYSE()\G

You will get information on the minimum and maximal values, the minimal and maximal data length, and a suggested optimal field type. In this case a TINYINT(2) UNSIGNED NOT NULL is recommended.

mysql> ALTER TABLE x1 MODIFY item TINYINT(2) UNSIGNED NOT NULL;

You can show your boss that the table is optimized and the rest of the company is so inspired by the news that they start creating more items. Then one day you get a call from two very upset project managers. It seems that item #255 is being used for two items. What happened?

Well, TINYINT UNSIGNED holds the values 0 to 255 and will mangle values over 255 to 255. So you rerun the PROCEDURE ANALYSE() and receive the recommendation to change the filed in question to TINYINT(3) UNSIGNED. But remember TINYINT is not going to let you store anything bigger than 255. So you check the numeric types manual page and go to the next larger data type, SMALLINT.

PROCEDURE ANALYSE() can be a double edged tool. In the first case it helped. In the second case it had no idea you needed bigger values and gave you a recommendation based on bad data. Like most tools, PROCEDURE ANALYSE() needs to be used properly to get the desired results.

Running PROCEDURE ANALYSE() on your tables from time to time can help you spot potential fields getting squeezed for space.

2 comments:

Sheeri K. Cabral said...

I've also found PROCEDURE ANALYSE() very helpful when I'm migrating a legacy system where very few people know what the appropriate value ranges are. Because it's legacy there's already data in there, so it's a matter of analyzing it to figure out what the data range currently is, and then figure out what it should be from that.

Rob said...

I have been hosed a number of times tiny and medium ints. I have never lost data to a int. It is my opinion that using a too restrictive data type is a far worse sin than using an inefficient data type. Premature optimizations of data types often produce tiny gains, can hose some poor bastard at a later date, and in my opinion waste time.

1. Get it working.
2. Get it working right.
3. Get it working fast.

Optimizing data types might be necessary for #3, but may well not, and the time may well be better spent on unit testing.

Data lost to a crappy optimization sucks.