Tuesday, November 3, 2009

Cardinality

Last night I was asked about index cardinality. One of the members of the North Texas MySQL Users Group was using phpMyadmin and noticed an element marked 'cardinality' and asked me what it meant. And I will admit I was stumped.

The manual says:

ANALYZE TABLE determines index cardinality (as displayed in the Cardinality column of SHOW INDEX output) by doing ten random dives to each of the index trees and updating index cardinality estimates accordingly. Because these are only estimates, repeated runs of ANALYZE TABLE may produce different numbers. This makes ANALYZE TABLE fast on InnoDB tables but not 100% accurate because it does not take all rows into account.

MySQL uses index cardinality estimates only in join optimization. If some join is not optimized in the right way, you can try using ANALYZE TABLE. In the few cases that ANALYZE TABLE does not produce values good enough for your particular tables, you can use FORCE INDEX with your queries to force the use of a particular index, or set the max_seeks_for_key system variable to ensure that MySQL prefers index lookups over table scans. See Section 5.1.3, “Server System Variables”, and Section B.5.6, “Optimizer-Related Issues”.


Clear, huh? Well not 100% for me so I went back to searching. Other finds on the net say that cardinality is a measure of how accurate an index is and that a UNIQUE index would have the highest cardinality. So the more unique the index entries are, the more higher the cardinality.

Which brings up another question: How often do you maintain your indexes? Whay clues tell you to do maintenance? Please share your recommendations!


BTW we will have pizza at the next meeting of the North Texas Users Group, so see you December 7 at 7PM at the Sun offices, 16000 Dallas Tollway in suite 700!

2 comments:

Bradley C. Kuszmaul said...

The cardinality of an index is the number of unique values in the indexed field.

For example, if you were a car manufacturer and you had a database containing one row for every car that rolls off the assembly line, the primary key might be the VIN, and a secondary key might be the color. If you make cars that are red, blue, or green, then the cardinality of the color index would be three. If you are Henry Ford building model Ts, then the cardinality would be one, since any customer can have a car painted any colour that he wants so long as it is black.

Timothy Little said...

I use the example of Gender.

If you have a table with the demographic data on 180,000 hacker conference attendees, then having an index on Gender (by itself) would be less than useful (or even smart).

Assume that you have 60,000 Females and 120,000 Males in the database table, then you will probably generate a full-table scan no matter which value you select for gender.

The reason is that Gender has a LOW cardinality.

Cardinality is a measure of specificity.

If you use State, then it's probably a better choice since you are much more likely to have merely two states (in the US there are 50, but that's maximum).

If you choose their names, then you can see that each type of choice is generating a more and more increasingly specific or unique association per row.

Of course, a primary key guarantees the best cardinality -- being 1:1... but something like first-name+last-name or social-security number or seat-number would be equally good (even if there are maybe 300 John Smiths, then it's still 300 out of 180,000 so that's pretty darned specific).

Does that clear things up?