Given the following scenario:
| Addresses | ||
|---|---|---|
| name | group | city |
| Johnny | Friends | Cambridge |
| Kevin | Friends | Allston |
| Justin | Coworkers | Easton |
| Aya | Coworkers | North Andover |
| Josh | Family | Groton |
| Connie | Family | Easton |
In a database, that could be represented as such:
CREATE TABLE GroupNum (
groupId TINYINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
groupName VARCHAR(20)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE AddressNum (
firstName VARCHAR(20),
groupId TINYINT UNSIGNED NOT NULL,
city VARCHAR(20),
KEY (groupId),
FOREIGN KEY (groupId) REFERENCES GroupNum (groupId)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO GroupNum (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressNum (firstName,groupId,city) VALUES
('Johnny',1,'Cambridge'),
('Kevin',1,'Allston'),
('Justin',2,'Easton'),
('Aya',2,'North Andover'),
('Josh',3,'Groton'),
('Connie',3,'Easton');
SELECT firstName,groupName,city
FROM AddressNum INNER JOIN GroupNum USING (groupId);
firstName groupName city Johnny Friends Cambridge Kevin Friends Allston Justin Coworkers Easton Aya Coworkers North Andover Josh Family Groton Connie Family Easton
or, using descriptive foreign keys:
CREATE TABLE GroupText (
groupName VARCHAR(20) NOT NULL PRIMARY KEY
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE AddressText (
firstName VARCHAR(20),
groupName VARCHAR(20) NOT NULL,
city VARCHAR(20),
KEY (groupName),
FOREIGN KEY (groupName) REFERENCES GroupText (groupName)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO GroupText (groupName) VALUES ('Friends'),('Coworkers'),('Family');
INSERT INTO AddressText (firstName,groupName,city) VALUES
('Johnny','Friends','Cambridge'),
('Kevin','Friends','Allston'),
('Justin','Coworkers','Easton'),
('Aya','Coworkers','North Andover'),
('Josh','Family','Groton'),
('Connie','Family','Easton');
SELECT firstName,groupName,city
FROM AddressText;
firstName groupName city Johnny Friends Cambridge Kevin Friends Allston Justin Coworkers Easton Aya Coworkers North Andover Josh Family Groton Connie Family Easton
So what are the pros and cons of using descriptive foreign keys, as opposed to surrogate keys?
Con:
1) An auto-increment surrogate key (a number) is referenced faster than strings, so INSERTS will be slightly slower.
2) Data redundancy. Foreign keys need to be updated (and/or deleted) whenever the data is updated.
Pro:
1) Using a surrogate key means that to get user-readable data (ie, the text) a JOIN must be made. Using a descriptive foreign key, no JOIN needs to be made to get user-readable data. This includes reading with SELECTs and when INSERTing or modifying data. For example, to insert into AddressNum the groupId is needed.
2) Updates/deletes on a foreign key can be set to cascade.
3) Human-readable data in the database. This will make it easier for the DBA to debug data issues, as they can use a SHOW CREATE TABLE to find the foreign key references, and then get a sample of data from one table and understand the scenario, without having to JOIN.
My thinking right now is that I would use a descriptive foreign key where the forign key is "static" data, or the foreign key references are used in a small number of tables.
For instance, in a user-based application, I would not use "username" as a descriptive foreign key, I would stick with a surrogate key of "uid", because the foreign key is ubiquitous, and if a user changed their username, every single table would need to cascade -- or else the update would fail.
However, I would use descriptive foreign keys for "static" data like locale (language) data, country and state names, because they likely will not change, even though they might affect many changes. Athough note that in these places, I'm also using standards -- "locale" for language, ISO country and state/province codes, etc.
I would also use it in the scenario above, where the data is not static, but also does not touch many tables -- the 'group name' only affects the 'listings' table if it gets changed.
So, what are the pros and cons of this method? When would you and would you not use it?
Sorry, that's Sheeri. Never
Sorry, that's Sheeri. Never trust spell checkers.
Sherri, Maybe I'm just
Sherri,
Maybe I'm just splitting hairs, but I think there is a difference between taking up more space (workaround: get bigger drives), and the effect that has on query performance. For large tables, query performance is largely dependant on the number of pages physically read from the disk. That in turn is largely dependent on the number of rows that can fit on a page (ignoring page splits and holes created by deletes).
So for a ten million row table with 50 bytes per row, you can get about 160 rows per 8K page (assuming a bit of overhead used by the system) which means 62500 pages in a table scan. Figure about 150 reads/sec on a decent drive, and that takes about 420 seconds. If your rows are 45 bytes, you get about 177 rows per page, and 56480 pages or 370 seconds which is a noticeable improvement. This is similar to going from a 7-8 character "pretty" key to an int.
For indexes it gets worse. Since only the index key is stored, going from an 8 character key to a 4 byte int key means twice as many index keys per trunk page which will ends up eliminating half the leaf nodes (again assuming no holes) which means one read saved per seek. For 30,000 indexed lookups in that table, you save about 15,000 page reads or about 100 seconds. I'm assuming the worst case of the leaf nodes not being cached (since the entire page cache is taken up by the data pages from the scan we just did in the previous example), but in my experience, the real world difference is still usually fairly dramatic.
John, You're going to have
John,
You're going to have to give some more concrete data for that. I already noted that there's more space taken up in "con #1".
If you have a lot of rows,
If you have a lot of rows, the space saved by the surogate key can make a significate difference. You get more rows per page which reduces disk access, increases the percent of the table in kept in the cache, and reduces the depth of indexes. Also integer comparisions are faster than string comparisions especially with case-insensitive strings, so joins are improved. All of these can have a dramatic effect on perfomance in large tables.
One way around the e.g.
One way around the e.g. Country question is to use an acronym - we have used the inernational 3-character airline code for this. Then the data in the database isn't meaningless (the acronyms are quite easy to decipher), you don't waste much space, and you can have a single-table update if the name-acronym does change
[...] Pros and Cons of
[...] Pros and Cons of Descriptive Foreign Keys? - My-ess-queue-ell vs. My-see-quell [...]
James, you bring up a great
James, you bring up a great point -- knowing your data is extremely important. For instance, if your ports application, it was better to use surrogate keys because the data changed frequently.
However, for someone for whom the data does not change -- let's say they're keeping the data for provinces in a few countries, key to their business....if the data does not change often, then the foreign keys work well.
So that's one instance where what's "static" for one person may not be "static" for another.
Country and locale are
Country and locale are pretty bad choices for your descriptive key example. Much better to use surrogate keys there because it helps when the country changes name or splits. That's harder to imagine in the US than in Europe where it's been fairly common recently. :) In a past job I had the joy of keeping up to date the country of several thousand ports. Changes were routine and the application had originally made the mistake of using country names in the main row data. Language changes included those in the Baltic states, where Russian ceased being the local language.
Even official character short names for countries have changed, unfortunately. Was a pain to update some of those.
If a politician can change it, it's not really static. :)
Lukas, why are you so big on
Lukas, why are you so big on portability? Isn't the point to use the features of the DBMS you have? none of the DBMS are true relational database systems, really, so why not use the advantages? Migration will always be a pain, even if you try to make it as portable as possible.....
My view is skewed, because I deal with an application with a web front-end with a backend that we own. Obviously if you're distributing an application to customers that they can use their own backend, then of course you care about portability....but I'd think you'd still use the features of each individual system, to make your application the best possible on all systems, instead of just mediocre because you're trying to be portable......
Also, you're information is incorrect. You can ALTER TABLE MODIFY enumCol enum('oldval1','oldval2','oldval3','newval1' . . . ) without losing data. If you forget one of the old values (or want to 'delete' all values on purpose), the ALTER TABLE will fail.
If you want to change the name of an enum, you have to ALTER TABLE twice -- once to add the new name, then change the data to the new name, and then alter again to remove the old name.
Much easier to have a foreign key with ON UPDATE and/or DELETE CASCADE.
Well I am big on
Well I am big on portability. So when I use SHOW in application its nicely hidden behind a portable API.
As for ENUM and ALTER TABLE. IIRC you can only add new values at the end, otherwise all data is deleted. Which goes to show again its a non-standard hack *g* (though very convinient at times)
Sheeri, your YesNo table
Sheeri, your YesNo table reminds me of this great post: What Is Truth?
Lukas -- standards,
Lukas -- standards, shmandards. Do you never use SHOW statements? No one database is fully standard compliant. I'm hesitant to use ENUM and SET because there is a very very very limited use for them, just as there's a very very very limited use for the "Boolean" type (as MySQL doesn't index them very well).
Jan, you bring up a good
Jan, you bring up a good point. But the problem with that is that if the ENUM isn't *really* static -- if you ever need to change it, you have to ALTER the table, which might affect many records and lock a table for a long time.
Check out http://sheeri.net/archives/13 and all the comments.
In fact, very little data is static "enough" to require ENUMs. Even "Yes" and "No" rarely are the end of it -- sometimes you have "Maybe" or "Partially" or "Sometimes".
And yes, I have a silly table called "YesNo" with 3 values - "Yes","No" and "".
And the ENUM doesn't help when you have lots of data, such as static location data (postal codes, city codes, etc).
Another important reason to
Another important reason to not use surrogate keys is that if anything in the application screws up you are not left with rather meaningless foreign keys. A generated integer tells you nothing about the data. The same issue arises if you look at logs.
@Jan: ENUM (and SET) is not standard. So I would be hesitant to use it.
For the case in which you
For the case in which you like descriptive foreign keys, it seems an ENUM would be appropriate. It will have the speed of an integer, and the readability of a string.
I always try to make my static foreign keys descriptive by making them ENUMs.