Most developers are used to programming in procedural or object-oriented languages. SQL, as a declarative language, is quite different. In declarative languages like SQL, you program what you want the result to be, not the procedure to get it. For instance, "give me all the people with the first name starting with the letter S from a certain table." Unlike procedural programming (or even methods in object-oriented languages), you do not say how to get the information. This is, I believe, why many developers want to give the query optimizer "hints" on how to do its job.
That being said, I will list the top 8 Basic SQL Practices I live by, and attempt to enforce. Please feel free to comment adding your own (or post your own, linking back here).
In no particular order:
1) Always use explicit joins. If I mean INNER JOIN, then I use INNER JOIN. No use of just plain "JOIN". Never, ever, ever use a comma join -- I consider that a mistake. If I explicitly state "CROSS JOIN" then I know I have consciously made that decision. Also, keep join conditions in an ON or USING clause; they should not go in the WHERE clause. I also put my join conditions in parentheses; for whatever reason, I find:
ON (foo=bar AND baz=bop) WHERE a=b
is easier to see that the join condition contains 2 conditions than
ON foo=bar AND baz=bop WHERE a=b
2) Always define field names. No using SELECT * or INSERT INTO table VALUES. It's a pain, and more so of a pain given that mysqldump does not specify INSERT fields. However, if it's important enough to save in a text file (ie, it's seed data or a migration script) then it gets explicit field names.
3) Always use the database server's timestamp. Web servers may have disparate times. Reports may come from different servers than the inserted data.
4) Store IPs as integers with INET_ATON and retrieve them with INET_NTOA.
5) When doing reports, the network traffic is usually the biggest bottleneck. If you're going to receive information, it's better to receive in chunks, which will likely be larger than a logical piece. For instance, state reporting -- instead of making 50 connections for states in the US, get them all at once. If the dataset is very large and folks do not want to stare at a blank page while the report is loading, use paging with LIMIT to grab, say, 1000 entries at a time and display them on the screen so people can start looking at the data while the rest is being grabbed.
6) Running a query in a loop is usually a bad idea. If you are executing the same query with different data, consider building a query string using UNION and executing it at the end of the loop, so you can execute multiple queries with only one trip across the network to the database.
7) Do not be afraid of JOINs. They are not necessarily resource intensive, given good indexing. Most of the time a denormalized schema without a join ends up being worse than a normalized one using a join. When there is redundant data, ensuring data integrity takes up more cycles than providing a framework for data integrity in the first place.
8) Limit the use of correlated subqueries; often they can be replaced with a JOIN.
(I also try to put SQL commands in capital letters to help me easily spot fields and variables I use).
I just wanted to discuss
I just wanted to discuss point number 6. Has anyone tried executing multiple un-related select queries on Oracle database using JDBC. In the API(Statement class), there is something called getMoreResults() that fetches multiple result sets. I tried it but its not working.
[...] Top 8 SQL Best
[...] Top 8 SQL Best Practices (tags: sql) 查看 涂鸦 分类的所有文章. [...]
[...] Sheeri Kritzer » Blog
[...] Sheeri Kritzer » Blog Archive » Top 8 SQL Best Practices - My-ess-queue-ell vs. My-see-quell (tags: mysql) Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover new web pages. [...]
Sheeri, Nice list! For #1,
Sheeri,
Nice list!
For #1, I (personally) am comfortable with not specifying "INNER" as my mind considers it the default -- but that's a personal thing.
I completely agree with you that old-style "comma joins" should be avoided at all costs. There's some documented examples published for MS SQL Server in which old-style joins generate ambiguous results. I haven't been able to replicate them in Oracle (and haven't tried the examples in MySQL), but why leave any chance for ambiguity? Also, having recently come from an organization where the report-writers were switching from Focus to SQL, it's my experience that newbie sql writers are FAR less likely to inadvertently cross-join if they're stating their joins explicitly and separating them from their where conditions.
Peter -- 1 is definitely
Peter -- 1 is definitely syntax, but I find it easier to read, and if people use JOINs explicitly then I know they really meant it. Particularly with a comma join.
I feel that specifying INNER vs. LEFT or RIGHT or CROSS helps them understand that they're not saying "I want you to join the tables", ie, take an action, they're saying, "I want the result if you join the tables this way," ie, "this is what I want back." It's more a 'leading by example' thing.
As for 5, mostly I'm talking about medium-sized datasets. If you're getting thousands of rows from one group and hundreds (or dozens) from another, with a total of a few hundred thousand rows, lots of times it does not make sense to go across the network and make 50 queries when 1 will suffice. I have found that, for a few hundred thousand rows, it's faster to go across the network 10 times, even with throwing the data away, than to go 50 times where many of the times result in just a few hundred listings.
There's also perception -- we're doing web-based reporting, so waiting for 30 seconds for 1 very long query to run to aggregate into 50 different groups is not "acceptable" (except, of course, where it is. :) ) However, it is acceptable for it to take a long time if some data shows up right away, so while the queries are technically wasteful (using an OFFSET to LIMIT), the users are far happier. And again, far happier than they were with one query per group.
7 - Note that the rule isn't "never used denormalized data". Most people take what you're saying and say "well, we have lots of data, therefore we need to denormalize." Denormalization should happen as a LAST resort, when you already know other things won't work. Everything you say is correct, JOINs are more expensive for an access type, but when you need to enforce data integrity with extra queries, you realize that maybe a JOIN is better than 3-4 queries to make sure all the data gets into the right places (and stays that way. And we do not even use triggers for that stuff).
So yes, a JOIN is more expensive than looking at one table. However, denormalized data causes many problems. We use it in certain places -- we have an online dating site, where most of the people look for "who's online NOW" to chat with them, and thus denormalizing some of the data is useful for us -- we can put their interests and location in with their session information, and easily find out how many people are online in what locations, with what interests, etc. Again, it's not "NEVER use denormalized data" but I find that most people denormalize while creating their tables, without having any evidence that denormalization is the right thing to do.
In other words, #7 could be "Keep the data in 3NF unless you have good reason to do otherwise." Again, anyone reading #7 might think, "JOINs are the most expensive type of access, so I'll never do them," which is not the right way of thinking.
Sheeri, Just curious where
Sheeri,
Just curious where does your 1 item comes from ? I personally feel this is syntax matter, unless you use LEFT JOIN. Some people find it more readable, some not - I would for example like to see clauses grouped by tables as for inner join there is no difference between where and on clauses for optimizer and I prefer to see how query can be executed easily :)
5 - I find this reporting unusual. Unless you run actoss WAN network speed should rarely be the issue, especially for group by with join. Also positioning with LIMIT is very suboptimal for large result sets - LIMIT 10000,10000 means first 10000 rows will also be generated and thrown away. When you come to LIMIT 1000000,10000 you may have your queries doing 100 times more job then you need.
7 - Joins are expensive on large data sets (which do not fit in memory) and denormalized data actually performs much better. Even for in memory data joins are the most expensive access type - try to see how much rows/sec you can generate from eq join of 5 tables vs how much rows you get doing full table scan of single table :)