Let's look into selectivity, as this is an important topics when looking at index performance. (Oooh, I said "performance", watch everyone's ears perk up!).
This will probably answer the questions "Why isn't MySQL using my index?" or "Why is my query so slow when I have an index on it?"
Selectivity describes how different values of a field are. It is a number from 0-1, although you can also think of it as a percentage. A value of 1, or 100%, means that each value in the field is unique. This happens with UNIQUE and PRIMARY keys, although non-unique fields may have a selectivity of 1 -- for example, a timestamp value in a not-often-used table.
To calculate this, you take the total number of DISTINCT records and divide by the total number of records.
My company has a large Users table, so I grabbed some statistics off of that:
+----------+
| count(*) |
+----------+
| 817666 |
+----------+
1 row in set (0.63 sec)
+--------------------------+
| count(distinct username) |
+--------------------------+
| 817666 |
+--------------------------+
1 row in set (1.63 sec)
So the selectivity is 81766/81766, or 1. If this were not a UNIQUE KEY already, it's a good candidate for one.
the "created" field is a timestamp for when the user record was created
+-------------------------+
| count(distinct created) |
+-------------------------+
| 811227 |
+-------------------------+
1 row in set (2.04 sec)
As I expect, there are *some* duplicates, but for the most part, everyone has a different creation time. 811227/817666 = 0.99.
+--------------------+
| count(distinct IP) |
+--------------------+
| 544694 |
+--------------------+
1 row in set (2.35 sec)
This is interesting -- lots of people logon from public places, or use their friends' computers, so there are duplicate IP's (the last IP used to login is associated with a user, so it's a 1-to-1 relationship). The selectivity here is 0.67.
+-------------------------+
| count(distinct browser) |
+-------------------------+
| 25699 |
+-------------------------+
1 row in set (1.70 sec)
This is what the server reports the user's browser is. It records the last browser used by the user. This gives us about a 0.03 for selectivity.
+---------------------+
| count(distinct age) |
+---------------------+
| 83 |
+---------------------+
1 row in set (0.63 sec)
There are only 83 different reported ages on our site. That makes the selectivity of age 0.000101508. That is very low, effectively zero.
So why is this important? I'm glad you asked....
MySQL has a cost-based optimizer. This means that MySQL calculates the costs of different ways of performing a query and then chooses the cheapest one. Sounds reasonable, right? Well, calculating the costs is an inexact science. In order to calculate the exact cost, the optimizer would actually have to run the query. So an estimate is taken, and the estimate is wrong sometimes. Most of the time the estimate is correct.
In contrast, some database systems allow a rule-based optimizer. This means that no matter what the data state, the database uses rules to figure out the "optimal" path to the query. In most enterprise-level database systems, a cost-based optimizer performs better than a rule-based optimizer. In other words, there are so many exceptions to the rules that the calculation overhead is worth it.
(Just to clarify, in both systems, the correct result set will be generated. The optimizer determines the path to the information.)
This cost-based optimizer uses selectivity information when it decides whether or not to use an index.
But what does this mean for me?
Well, in the example above, this means if you want to query folks by age or age group, it's useless to put an index on it. It's a waste of cpu time and disk I/O to have an index for something with such a low selectivity. The optimizer will NEVER use it.
I've heard that the optimizer will do a full table scan if it calculates it will return more than 30% of the table. Why is that number so low, why not more like 50 or 75%? Well, first the server has to go to the index, search the index, and find if the index record matches. Then it needs to follow the index record's pointer to the real record on disk. And the MySQL gurus have decided that around 30% is the place where using the index is slower than just doing a full table scan.
(Note: I'm not exactly sure what the # is but I've heard it's around 30%. As well, at the user conference I saw graphs that showed that for the most part this was true. I thought it was in Jay Pipes' Performance Tuning presentation, but the graphs are not in the slides. Pointers are appreciated.)
So in this case, should we put an index on browser? Well, this is one of those cases where I'd think about how often we'd be doing queries and how much we care about server performance doing a report versus server performance while inserting or updating a record. If we really care one way or another, go that way. And document!
Another thing to consider is the nature of the data. For something like age, that's not going to change. Sure, we might have some 120 year olds eventually, but there's not going to be that much variance. For browsers, there will only be more and more types put out, considering different version numbers and OS configurations of standard browsers as well as mobile phone browsers.
However, if it does not matter, or if it's too difficult to decide which is more important (your boss says "we can't be slow during normal usage OR during reporting!") I default to MySQL -- it's better at optimization than I am. I'd probably put an index so MySQL could decide whether to do a full table scan or use an index, particularly given that I expect the number of browsers to keep increasing.
For something like IP, that has a selectivity of 0.67, so putting an index on it is worth it if we query on IPs a lot.
I hope this article has been helpful!
Jay Pipes has a great article on using the INFORMATION_SCHEMA to find out selectivity:
http://tinyurl.com/kfffp
What selectivity value
What selectivity value corresponds to MySQL thinking that 30% or more of the table will be returned by a query? (e.g. At what selectivity value or below would we say "MySQL won't use this, so I may not bother indexing"?)
Thanks.
Marc, data is stored in
Marc, data is stored in pages in the typically used engines. Estimate the probability that a page can be skipped when 30% of all records are estimated to match and there are say 20 records per page. 0.7 ^ 20 = 0.00078 or 1 in 1300 roughly. Reading the index pages would be very likely to add more work than could be saved.
[...] Sheeri has a great
[...] Sheeri has a great post on Selectivity and Index Performance [...]
Hmm, I suppose my suggestion
Hmm, I suppose my suggestion was more along the line of "in the event that the cost optimizer determines that a table scan would be more efficient than an [standard] index scan, due to >30% of the data matching, instead of just first performing a full table scan, why not use the index to find the data we DON'T want [using the index], so we know what portions of the table to EXCLUDE while performing a full table scan". Effectively saying "look, we know from the index that this 20% of the data will NOT match. Knowing that, we only need to do a full table scan on the remaining 80%."
Just a thought. Maybe I'm way off here. Or maybe that would still result in excessive calculations/comparisons... :-)
Hi Marc, So, I believe each
Hi Marc,
So, I believe each row of data has a pointer to the "next" row of data. A full table scan wouldn't use an index at all, just start with the first row, get the data, check the value, go to the next row, get the data, etc.
Going to the index and then retrieving a row for 80% of the data is still tedious. Remember that calculations or functions or even more comparisons may need to be done on the data. Why do this:
go to index
see if it's the value we want
get the data using the pointer in the index
perform calculations and additional comparisons
instead of this
get data
see if it has the values we want
perform calculations and additional comparisons
get the next data using the pointer from the data
The index is stored in a different place on disk than the data, so there's a lot of seeking involved when you're going back and forth like you suggest.
Yes, I also herd the 30%
Yes, I also herd the 30% from various people inside MySQL AB. That value is a lot lower for Oracle AFAIK .. like half.
One thing to note though is that in some situations databases can use a covering index. That is they do not need to follow the row pointer in the index if the index already contains all data that is queried upon. In that case I would assume that the optimizer realizes this before discounting the use of an index on a column with low selectivity.
I should clarify.. This
I should clarify.. This would be in the context of a WHERE condition returning in excess of 30% of the rows from the table (and therefore, the optimizer deciding to do a full table scan instead of using the index).
In the scenario where it returns in excess of 30% of the rows -- perhaps even more than that -- say, 80% of the rows from the table, would it be efficient for the optimizer to examine the index to eliminate the 20% that we don't want? Upon scanning the table to return data, would knowing in advance (from the index, based on the row pointers) which rows we don't want, allow MySQL to skip over the unwanted data more quickly rather than examining the actual data in each row?
I have no idea of database
I have no idea of database internals, so perhaps this is a silly question...but.... in cases where the optimizer decides that the WHERE condition is not selective enough to use the index on that column, would there be any opportunity for a performance gain by considering using the index to determine the rows that we *don't want* (i.e. the opposite of what it normally does on an index), and then proceed from there?
In such a case, by reversing the meaning of the WHERE condition to the opposite meaning -- ONLY when examining the index -- (i.e. change WHERE x=1 to WHERE NOT x=1), wouldn't this help it to know which rows to skip reading, resulting in higher selectivity?