06.26.08
Posted in Technology, Linux, MySQL at 4:07 pm by Stoner
A bit of advice to anyone wanting to write an article on MySQL that includes setting up users: familiarize yourself with the concept of Least Privileges. That is, only grant those privileges absolutely necessary to do a job and nothing more. I just finished reading an article on how to set up RSyslog to log to a MySQL database. Halfway through the article is a listing showing the grant statement. I’ll share just the fun part:
grant ALL ON Syslog.* …
My first reaction when I see a “grant all” is to ask: why? Why does an application need every database privilege? Well, I finished the article, then went to the RSyslog web site and spent all of 4 minutes researching why the app needs so much privilege. As it turns out, it doesn’t. Right there in blank and white:
“It is sufficient to grant it INSERT privileges to the systemevents table, only.”
Whew. Okay. The app doesn’t need all those privileges. When I checked the author’s credentials, he presents himself as someone who manages firewalls, antispam and antivirus systems - someone you would hope would be security-conscious and aware of privileges. Apparently not - or at least, not when it comes to databases. I couldn’t find an email address to write to the author nor does the web site have a form to post feedback on articles so I can’t point this out nor the fact that he completely left out how to configure RSyslog to connect to MySQL.
<sigh>
Permalink
06.24.08
Posted in Technology, Linux at 5:51 pm by Stoner
The Redhat Summit was a success for me. I got a lot of useful information from the presentations I attended, especially the “Performance tuning Jboss Enterprise Application Platform” session given by Andy Miller. Andy talked about tuning many of the pools in Jboss, like the thread pool and database connection pools. The juiciest piece of info was running Jboss with HugeTLB. We’re going to hit that one hard in terms of testing, validating and deploying in our production environments.
The “Optimizing the SOA Enterprise: Using Jboss and Redhat Enterprise Linux virtualization” session, given by Isaac Christofersen of Booz Allen Hamilton, was also very interesting. The built a cluster of Xen VMs, then build a cluster of Jboss instances in it. They used GFS on top of iSCSI LUNs for shared mounts. This got them away from HBA cards and deploying a separate SAN fabric. The trickiest part they encountered was setting up VLANs on the VM bridges. They had to modify a script or two to get it to work correctly. The benefit they got was being able to provision a Jboss app server in minutes on a VM infrastructure. Everything was highly available (a cluster in a cluster) and they actually realized greater performance then Jboss app servers sitting on real hardware.
The “What’s the fuss about fastboot and the new kernel crash dumping?” session, given by Vivek Goyal of Redhat, was another very interesting session. It talked about kdump and kexec, booting another kernel without rebooting the hardware and more. The ability to jump from one kernel into another (kexec) provides some really good benefits. kdump is built on kexec and provides a great framework for kernel developers and dealing with some really tricky kernel crashes. If you have unexplained kernel crashe, look into this as it stands to provide more and better information for resolving issues.
Finally, the “Augeas: A Linux configuration API” session, given by David Lutterkort of Redhat, was good. Augeas will fill a hole in Linux nicely, if they can come up a way to effectively manage “complex” configuration files, like DHCP, Apache and any XML-formatted file. Augeas uses regular language to pick apart and piece back together a config file and those mentioned do not lend themselves to simple regular expressions.
I went to other sessions but they were either a wash (mostly things I already knew but was deceived by a poorly written description) or a dog & pony show trying to sell me products/services.
Other highlights of the Summit:
- IBM sponsored Wednesday night’s dinner at Fenway Park. Fenway is a great baseball stadium. I got some grainy photos on my cell phone and sat on the Green Monster.
- Redhat provided lunch during the conference, which meant I could spend more on dinner.
- Schwag I got: a Redhat backpack, a baseball with Redhat and IBM’s logo on it, a plushTux penguin from Trusted Computer Systems, Inc., a tee shirt from QLogic, a foam Tux penguin rom R1Soft, a 1 GB USB drive from the Fedora Project, a cheesy monitor duster from BakBone Software
- The seafood in Boston is great. I had several varieties from chowder, which were all better than anythig you’ll get from a can. Oyster bars rule! Free wifi rules over $10.95/day hotel wired connections (always use a VPN for any network connections, regardless.)
Permalink
04.17.08
Posted in Technology, MySQL at 4:31 pm by Stoner
I was curious about some of the databases I get to work with. “How do my clients store their data? What data types are most prevalent?” Well, a few keystrokes later, I had my answers:
SELECT c.data_type, count(c.data_type) AS frequency
FROM information_schema.columns AS c
INNER JOIN information_schema.tables AS t
ON c.table_schema = t.table_schema AND
c.table_name = t.table_name
WHERE c.table_schema NOT IN ('information_schema','mysql') AND
t.table_type = 'base table'
GROUP BY data_type;
Which gave me a nice “data type distribution” table:
| data_type |
frequency |
| blob |
7 |
| char |
611 |
| date |
85 |
| datetime |
125 |
| decimal |
133 |
| double unsigned |
1 |
| enum |
677 |
| float |
5 |
| int |
2334 |
| mediumblob |
21 |
| mediumint |
3 |
| set |
9 |
| smallint |
7 |
| text |
57 |
| time |
551 |
| timestamp |
20 |
| tinyint |
6 |
| varchar |
946 |
Not content with just that, I whipped up another statement to show me the top 10 tables that used a particular data type (replace the WHATEVER with the data type you’re interested in) :
SELECT c.table_schema, c.table_name, count(c.data_type) AS count
FROM information_schema.columns c
INNER JOIN information_schema.tables AS t
ON c.table_schema = t.table_schema AND
c.table_name = t.table_name
WHERE c.table_schema NOT IN ('information_schema','mysql') AND
t.table_type = 'base table' AND
c.data_type = 'WHATEVER'
GROUP BY c.table_schema, c.table_name
ORDER BY count DESC
LIMIT 10;
Which showed me:
| table_schema |
table_name |
count |
| database_1 |
object_payment |
14 |
| database_2 |
object_payment |
12 |
| database_3 |
object_payment |
11 |
| database_2 |
object_space |
9 |
| database_1 |
object_space |
9 |
| database_1 |
aggregation_table_1 |
8 |
| database_2 |
aggregation_table_1 |
8 |
| database_2 |
aggregation_table_3 |
7 |
| database_1 |
aggregation_table_4 |
7 |
| database_1 |
object_user |
7 |
Finally, I wrote them up into stored procedure form (and included a 3rd that gives the data type distribution for specified databases) and committed them to my project on Sourceforge.
Permalink
03.18.08
Posted in Technology, MySQL at 8:48 pm by Stoner
I downloaded Maatkit 1877 and installed it on a data archive server, then did a parallel dump of the data. I also did a regular dump of the same data using mysqldump. This isn’t a benchmark or even a real comparison, just me playing with Maatkit to get a feel for it and see how it can make my life easier. First, some stats about the database:
mysql> select table_name, table_rows from information_schema.tables where table_schema = 'somedb';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| table1 | 24758808 |
| table2 | 33713412 |
| table3 | 221633443 |
| table4 | 56869734 |
| table5 | 14017081 |
| table6 | 77559278 |
| table7 | 48453337 |
+--------+------------+
mysql> call database_size_profile('somedb')\G
*************************** 1. row ***************************
myisam_index_size: 15603748864
myisam_data_size: 14895580924
innodb_index_size: NULL
innodb_data_size: NULL
total_index_size: 15603748864
total_data_size: 14895580924
database_size: 30499329788
1 row in set (0.00 sec)
The dump using maatkit
$ mk-parallel-dump --tab --databases somedb --no-gzip
default: 7 tables, 7 chunks, 7 successes, 0 failures, 1239.59 wall-clock time, 3558.11 dump time
$ ls -l default/somedb/
-rw-rw-rw- 1286 Mar 18 18:25 table1.000000.sql
-rw-rw-rw- 847081157 Mar 18 18:28 table1.000000.txt
-rw-rw-rw- 1274 Mar 18 18:25 table2.000000.sql
-rw-rw-rw- 1131602870 Mar 18 18:29 table2.000000.txt
-rw-rw-rw- 1304 Mar 18 18:25 table3.000000.sql
-rw-rw-rw- 8508494974 Mar 18 18:45 table3.000000.txt
-rw-rw-rw- 1318 Mar 18 18:25 table4.000000.sql
-rw-rw-rw- 2327666926 Mar 18 18:32 table4.000000.txt
-rw-rw-rw- 1213 Mar 18 18:28 table5.000000.sql
-rw-rw-rw- 441408946 Mar 18 18:29 table5.000000.txt
-rw-rw-rw- 1431 Mar 18 18:29 table6.000000.sql
-rw-rw-rw- 4001690723 Mar 18 18:45 table6.000000.txt
-rw-rw-rw- 1339 Mar 18 18:29 table7.000000.sql
-rw-rw-rw- 1625718296 Mar 18 18:35 table7.000000.txt
$ du -sh default/
18G default/
The dump using mysqldump.
$ mysqldump --user=joe -p --routines --trigger --quick --complete-insert --master-data=2 somedb > somedb.18mar08.sql
Elapsed time to dump: 2676
$ ls -l
-rw-r--r-- 21905573763 Mar 18 19:49 somedb.18mar08.sql
The server has two dual-core athlons and 8 gigs of ram, reading from a fiber-attached SAN and writing to local SCSI drives (RAID-1.) I haven’t performed a restore yet. That’s next but not tonight. I have a code push to deal with first.
Permalink
03.07.08
Posted in Technology, MySQL at 8:52 pm by Stoner
I learned something new about myisamchk. When doing a recovery of the data file, it creates not one but two temporary files. I knew about the TMD it creates in $datadir when rebuilding the MYD file but it creates a second temporary file using an old and rather annoying technique (which involves creating a file using the standard C ‘open()’ call, then deleting the file without closing the file descriptor.) I discovered this when myisamchk told me “Disk is full writing ‘/tmp/STQ3p9UF’ (Errcode: 28).” I looked in /tmp and didn’t see that file. So I did a
# lsof | grep delete
and low-and-behold, there it was. You know, as a sys admin, that really annoys me. I don’t care that it creates temporary files to do its thing. In fact, I expect it, but I at least want to know about it so I can plan accordingly. My /tmp partition only had 2.5 gigs of space and the table’s MYD was over 4 gigs in size. If I had known about this extra temp file, I would have used –tmpdir=/some/place/else before I started and not had myisamchk crap out on me in the first place.
Permalink
02.26.08
Posted in Technology, Linux, MySQL at 5:49 pm by Stoner
At $WORK we typically use Redhat Cluster to make MySQL highly available. We have a number of 2-node clusters (active/passive) and it works nicely, for the most part. One thing I find very annoying is RHCS has no notification framework to notify you when a service is relocated to another node (because of failure or otherwise.)
In a perfect world, the only reason MySQL would pass between nodes is for server maintenance, under the control of a human. In reality, crap happens. Redhat Cluster manages all this, but I still want to know that a failover happened (is hardware going bad? did a software bug cause a failure? etc.)
Most monitoring tools weren’t designed to report on a clustered service. I can add a check that connects to MySQL using the cluster-managed IP - but that only tells me if MySQL has failed completely. The failover between nodes generally takes under a minute so it’s easy for that type of monitoring check to miss a failover. Parsing syslog output is less than optimal because syslog is unstructured text. Blech. Process and disk checks can’t be used on each node because they’ll only exist on one node, the active node.
My current solution is a bit of a kludge but it works. We configure MySQL to use standard names for binary logs and error logs (they never change, regardless of which node the database is running on.) The only file name that isn’t standardized is the pid file, which defaults to the hostname of the server running MySQL. So, with that bit of info, I wrote a script that connects to MySQL using the cluster-managed IP and issues the query “show variables like ‘pid_file’” and parses the result. It then compares the hostname against a stored value in a file, which contains the hostname of the current active node. If the hostname changed, then MySQL has changed nodes (whether due to failure or manual relocation.) This changing of hostnames is considered an error condition and the monitoring system sends alerts appropriately. The NOC staff consult procedures to determine the level of failure and call people if necessary. Clearing the error requires editing the file and setting the current hostname as the active node.
What would be nice would be some informational functions in MySQL to give a bit more data about the server itself, things like hostname, IP address MySQL is bound to, port, etc. Then I could use “select hostname()” or similar instead of relying on a file name. Before someone says “you connected to the database so you know all that already,” not all of this information is known to a database connection, particularly if your application uses a database connection pool or the connection goes through a proxy, masking the real database.
Permalink
02.25.08
Posted in Technology, Linux at 6:10 pm by Stoner
I always get a chuckle when I see a new product released and within hours/days/weeks, there is a Slashdot article about someone porting the Linux kernel to it. What is this fascination with running Linux on everything?
“Hey John, I just picked up this safety pin at Worst Buy!”
“Cool, Bob. Let’s put Linux on it!”
Personally, I don’t care that you can install Linux on a Roomba or some other device, complete with Asterisk, and control a Roomba to vacuum your house with your cell phone (yes, someone actually did this.) I don’t foresee myself in a hotel room halfway around the world thinking to myself “I wish there were a way I could vacuum my house right now so it’d be clean when I get home.”
The way I feel about this is:
- Just because you can does not mean you should.
- If you do, then don’t complain if hardware doesn’t work right and the vendor doesn’t give you any help.
Permalink
02.21.08
Posted in Technology, MySQL at 11:49 pm by Stoner
Jay Pipes blogged about this earlier this month. I read his post, nodded a knowing nod and finished my cup of coffee. It wasn’t until yesterday when it really struck me. I was looking through the database schema of a popular forum software package (won’t say which) when I came across this column definition in a central table (by ‘central’ I mean one that typically holds a lot of rows):
ip varchar(100),
One hundred characters to store a at-most 15-character IPv4 string? I know what you’re about to say, “But, it’s a varchar so you’re not really using 100 bytes.” Well, you *might* not be. Jay’s discussion on how MySQL uses temporary tables illustrates how you can use all 100 bytes in memory - not a good thing.
So, the minimum number of bytes you will be using is 7 if each octet is a single digit (don’t forget the ‘.’ separators.) At most, you use 15 bytes. If you store the IP as an unsigned integer, you use, at most, 4 bytes per row. Those bytes really start to add up, especially when pulling the data into memory to satisfy queries.
But wait…what if it stores the fully qualified domain name? I didn’t dive into the PHP to see if it stores IP addresses or hostname or both in this column. Is 100 characters enough? Probably. Some of the hostnames from the major ISPs are decently long but I don’t think I’ve seen one that’s 100 characters. I did set up a site for a client that was 61 characters long (thank goodness for bookmarks.)
This begs the question: is it better to store the IP address or hostname…or both? Resolving an IP to hostname or vice versa is an “expensive” process, whether you do it in your code at the time of storage or as a batch process later on. This is a question for another day.
Where does all this lead? You need to clearly define the data you want to store, then be frugal with the data type used to store it.
Permalink
01.29.08
Posted in Technology, MySQL at 9:35 pm by Stoner
Finally got around to creating a project on Sourceforge: MyHelper. Right now, it’s just some stored procedures and functions for determining data and index sizes for databases using the MyISAM and InnoDB engines. Each one is individually licensed (using the BSD license.) I have 2 more stored procedures to add (return a list of tables with no primary key and a list of tables with the primary key) but I haven’t committed my changes to Subversion yet because I’m still testing them. I’ll be adding more code as I refine the stuff I’ve written over the years (their current state is highly environment-specific so I need to “genericise” them - if that’s a word.)
A word about licensing. I’m opting to license each script/procedure/function individually. This is because the project, as a whole, is made up of individual pieces that may or may not have a dependency on each other and the project is not a “software package” that you install and use. Each script, stored procedure or function may stand on its own and you have the option to use what you want. It’s an ala carte repository of stuff. Use what you want, leave behind what you don’t need. Putting a license on the whole package didn’t make sense given this distribution model so each has its own license.
So, if you develop with MySQL or manage MySQL databases, check ‘em out and let me know what you think.
Permalink
01.28.08
Posted in Technology, Linux at 12:45 pm by Stoner
Open Source means freedom. Freedom to change the product to do what you want it to. Freedom to make your version available to others. Freedom to integrate your changes back into the original. This freedom doesn’t come without a price.
That price is proliferation and decay. I’ll give you a prime example, the GNU/Linux operating system. DistroWatch watches over 350 Operating systems (some are BSD and not GNU/Linux - getting exact numbers isn’t important at this time.) Some are geared toward specific uses, such as use as a firewall, use as a forensic analysis toolset, use as a data center server system or use as a desktop system, to name a few. The problems become evident:
- Which do you use? How do you evaluate and compare distributions?
- How is the distribution governed? Is it one or a few people “throwing darts at a dart board” or is there a formal governing body overseeing the project? How stable is the governing body? What are their motivations?
- What happens if support for a distribution falls off? Do you pick up maintenance? Do you switch? Do you do nothing and hope/pray that someone else will swoop in and pick up the maintenance?
- What happens if someone forks the code to implement a feature or fix you need? Do you switch completely to the new branch? Do you back-port the feature to the branch you’re on?
- What happens if a developer (either for the OS or in the company using the OS) dies or leaves for greener pastures? Who is going to take over their position? Will the replacement follow the road-map or take the project in a new direction?
These are not easy questions to answer. For small and mid-sized businesses, they may be forced to stick with the 800 lb gorillas. Going with a smaller or lesser known distribution may be too much of a business risk. If the 800 lb gorillas don’t offer what you need and a less-known distribution does, do you take that risk? These questions aren’t limited to the operating system, they can be asked of any OSS project, like OpenOffice.org, Audacity, Apache, GNU Cash, etc.
Open Source is great at leveling playing fields, offering choice and spurring innovation but all that comes with a price that some may not be able or willing to pay. Choose wisely.
Permalink
« Previous entries