Web Performance and Scalability with MySQL

Some of these may be conflicting, not applicable to everyone.

1) think horizontal -- everything, not just the web servers. Micro optimizations are boring, as or other details
2) benchmarking techniques;. Not "how fast" but "how many". test force, not speed.
3) bigger and faster vertical scaling is the enemy.
4) horizontal scaling = add another box
5) implementation, scale your system a few times, but scale your ARCHITECTUREa dozens or hundreds of time.
6) start from the beginning with architecture implementation.
7) don't have "The server" for anything
8) stateless good, stateful bad
9) "shared nothing" good
10) don't keep state within app server
11) caching good.
12) generate static pages periodically, works well for not millions of pages or changes.
13) cache full output in application
14) include cookies in the "cache key" so diff browsers can get diff info too
15) use cache when this, not when that
16) use regexp to insert customized content into the cahed page
17) set Expires header to control cache times, or rewrite rule to generate page if the cached file does not exist (rails does this)
18) if content is dynamic this does not work, but great for caching "dynamic" images
19) parial pages -- pre-generate static page snippets, have handler just assemble pieces.
20) cache little snippets, ie sidebar
21) don't spend more time managing the cadche than you sav
22) cache data that's too slow to query, fetch, calc.
23) generate page from cached data
24) use same data to generate api responss
25) moves load to web servers
26) start with things you hit all the time
27) if you don't use it, don't cache it, check db logs
28) don't depend on MySQL Query cache unless it actually helps
29) local file system not so good because you copy page for every server
30) use process memory, not shared
31) mysql cache table -- id is the "cache key" type is the "namespace", metadata for things like headers for cached http responses; purge_key to make it easier to delete data from cache (make it an index, too, primary index on id,type, expire index on expire field) fields
32) why 31 fails, how do you load balance, what if mysql server died, now no cache
33) but you can use mysql scaling techniques to deal, like dual-master replication
34) use memcached, like lj, slashdot, wikipedia -- memory based, linux 2.6(epoll) or FreeBsD(kqueue), low overhead for lots of cxns, no master, simple!
35) how to scale the db horizontally, use MySQL, use replication to share the load, write to one master, read from many slaves, good for heavy read apps (or insert delayed, if you don't need to write right away) -- check out "High Performance MySQL"
36) relay slave replication if too much bandwidth on the master, use a replication slave to replicate to other slaves.
37) writing does not scale with replication -- all servers need to do the same writes. 5.1's row-level replication might help.
38) so partition the data, divide and conquer. separate cluster for different data sets
39) if you can't divide, use flexible partitioning, global server keeps track for which "cluster" has what info. auto_increment columns only in the "global master". Aggressively cache "global master" data.
40) If you use a master-master setup like 39, then you don't have replication slaves, no latency from commit to data being available. if you are careful you can write to both masters. Make each user always use the same master, so primary keys won't be messed up. If one master fails, use the other one.
41) don't be afraid of the data duplication monster. use summary tables, to avoid things like COUNT(*) and GROUP BY. do it once, put result into a table -- do this periodically, or do it when the data is inserted. Or data affecting a "user" and a "group" goes into both the "user" and "group" partitions (clusters). so it's duplicating data.
42) but you can go further, and use summary dbs! copy data into special dbs optimized for special queries, ie FULLTEXT searches, anything spanning more than one or all clusters, different dbs for different latency requirements, ie RSS feeds from a replicated slave db -- RSS feeds can be late).
43) save data to multiple "partitions" like the application doing manual replication -- app writes to 2 places OR last_updated and deleted columns, use triggers to add to "replication_queue" table, background program to copy data based on queue table or last_updated column
44) if you're running oracle, move read operations to MySQL with this manual replication idea. Good way to sneak MySQL into an oracle shop.
45) make everything repeatable, build summary and load scripts so they can restart or run again -- also have one trusted eata place, so summaries and copies can be (re)created from there.

BREATHE! HALFWAY THERE!!

46) use innodb because it's more robust. except for big read-only tables, high volume streaming tables (logging), lcoked tables or INSERT DELAYED, specialized engines for special needs, and more engines in the future -- but for now, InnoDB
47) Multiple MySQL instances -- run diff instances for diff workloads, even if they share the same server. moving to separate hardware is easier, of course. optimize the server instance for the workload. e4asy to set up with instance manager or mysqld_multi, and there are init scripts that support the instance manager.
48) asynchronous data loading when you can -- if you're updating counts or loading logs, send updates through Spread (or whatever messaging something) to a daemon loading data. Don't update for each request (ie, counts), do it every 1000 updates, or every few minutes. This helps if db loses net connection, the frontend keeps running! or if you want to lock tables, etc.
49) preload, dump and process -- let the servers pre-process, as much as possible. dump never changing data structures to js files for the client to cache (postal data maybe), or dump to memory, or use SQLite, or BerkeleyDB and rsync to each webserver, or mysql replica on webserver
50) stored procedures are dangerous because they're not horizontal, more work than just adding a webserver-- only use if it saves the db work (ie send 5 rows to app instead of 5,000 and parsing in app)
51) reconsider persistent db connections because it requires a thread = memory, all httpd processes talk to all dbs, lots of caching might mean you don't need main db, mysql cxns are fast so why not just reopen?
52) innodb_file_per_table, so OPTIMIZE TABLE clears unused space. innodb_buffer_pool_soze set to 80% of total mem (dedicated mysql server). innodb_flush_log_at_trx_commit, innodb_log_file_size
53) have metadata in db, store images in filesystem, but then how do you replicate? or store images in myisam tables, split up so tables don't get bigger than 4G, so if gets corrupt fewer problems. metadata table might specify what table it's in. include last modified date in metadata, and use in URLs to optimize caching, ie with squid: /images/$timestamp/$id.jpg
54) do everything in unicode
55) UTC for everything
56) STRICT_TRANS_TABLE so MySQL is picky about bad input and does not just turn it to NULL or zero.
57) Don't overwork the DB -- dbs don't easily scale like web servers
58) STATELESS. don't make cookie id's easy to guess, or sequential, etc. don't save state on one server only, save it on every one. put the data in the db, don't put it in the cookie, that duplicates efforts. important data into db, so it gets saved, unimportant transient data puts in memcache, SMALL data in cookie. a shopping cart would go in db, background color goes in cookie, and last viewed items go in memcache
59) to make cookies safer, use checksums and timestamps to validate cookies. Encryption usually a waste of cycles.
60) use resources wisely. balance how you use hardware -- use memory to save I/O or CPU, don't swap memory to disk EVER.
61) do the work in parallel -- split work into smaller pieces and run on different boxes. send sub-requests off as soon as possible and do other stuff in the meantime.
62) light processes for light tasks -- thin proxy servers for "network buffers", goes between the user and your heavier backend application. Use httpd with mod_proxy, mod_backhand. the proxy does the 'net work, and fewer httpd processes are needed to do the real work, this saves memory and db connections. proxies can also server static files and cache responses. Avoid starting main app as root. Load balancing, and very important if your background processes are "heavy". Very EASY to set up a light process. ProxyPreserveHostOn in apache 2
63) job queues -- use queues, AJAX can make this easy. webserver submits job to database "queue", first avail worker picks up first job, and sends result to queue. or ue gearman, Spread, MQ/Java Messaging Service(?)
64) log http requests to a database! log all 4xx and 5xx requests, great to see which requests are slow or fast. but only log 1-2% of all requests. Time::HiRes in Perl, microseconds from gettimeofday system call.
65) get good deals on servers http://www.siliconmechanics.com, server vendor of lj and others.

IN SUMMARY: HORIZONTAL GOOD, VERTICAL BAD

for jobs: ask@develooper.com (jobs, moonlighters, perl/mysql etc)
slides will be up at http://develooper.com/talks/
Phew! That was a lot of fast typing (60 words per minute, baby!). Ask is smart, but QUICK!!!! His slides will be VERY useful when they appear. He said there were 53 tips, but I numbered each new line (and not smartly with OL and LI) and I have more than that...

Reply

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options