04.17.08

ints and blobs and varchars, oh my!

Posted in Technology, MySQL at 4:31 pm by Stoner

I was curious about some of the databases I get to work with. “How do my clients store their data? What data types are most prevalent?” Well, a few keystrokes later, I had my answers:

SELECT c.data_type, count(c.data_type) AS frequency
FROM information_schema.columns AS c
INNER JOIN information_schema.tables AS t
ON c.table_schema = t.table_schema AND
c.table_name = t.table_name
WHERE c.table_schema NOT IN ('information_schema','mysql') AND
t.table_type = 'base table'
GROUP BY data_type;

Which gave me a nice “data type distribution” table:

data_type frequency
blob 7
char 611
date 85
datetime 125
decimal 133
double unsigned 1
enum 677
float 5
int 2334
mediumblob 21
mediumint 3
set 9
smallint 7
text 57
time 551
timestamp 20
tinyint 6
varchar 946

Not content with just that, I whipped up another statement to show me the top 10 tables that used a particular data type (replace the WHATEVER with the data type you’re interested in) :

SELECT c.table_schema, c.table_name, count(c.data_type) AS count
FROM information_schema.columns c
INNER JOIN information_schema.tables AS t
ON c.table_schema = t.table_schema AND
c.table_name = t.table_name
WHERE c.table_schema NOT IN ('information_schema','mysql') AND
t.table_type = 'base table' AND
c.data_type = 'WHATEVER'
GROUP BY c.table_schema, c.table_name
ORDER BY count DESC
LIMIT 10;

Which showed me:

table_schema table_name count
database_1 object_payment 14
database_2 object_payment 12
database_3 object_payment 11
database_2 object_space 9
database_1 object_space 9
database_1 aggregation_table_1 8
database_2 aggregation_table_1 8
database_2 aggregation_table_3 7
database_1 aggregation_table_4 7
database_1 object_user 7

Finally, I wrote them up into stored procedure form (and included a 3rd that gives the data type distribution for specified databases) and committed them to my project on Sourceforge.

03.18.08

Will Maatkit make me sane?

Posted in Technology, MySQL at 8:48 pm by Stoner

I downloaded Maatkit 1877 and installed it on a data archive server, then did a parallel dump of the data. I also did a regular dump of the same data using mysqldump. This isn’t a benchmark or even a real comparison, just me playing with Maatkit to get a feel for it and see how it can make my life easier. First, some stats about the database:

mysql> select table_name, table_rows from information_schema.tables where table_schema = 'somedb';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| table1 | 24758808 |
| table2 | 33713412 |
| table3 | 221633443 |
| table4 | 56869734 |
| table5 | 14017081 |
| table6 | 77559278 |
| table7 | 48453337 |
+--------+------------+

mysql> call database_size_profile('somedb')\G
*************************** 1. row ***************************
myisam_index_size: 15603748864
 myisam_data_size: 14895580924
innodb_index_size: NULL
 innodb_data_size: NULL
 total_index_size: 15603748864
  total_data_size: 14895580924
  database_size: 30499329788
1 row in set (0.00 sec)

The dump using maatkit

$ mk-parallel-dump --tab --databases somedb --no-gzip
 default: 7 tables, 7 chunks, 7 successes, 0 failures, 1239.59 wall-clock time, 3558.11 dump time

$ ls -l default/somedb/
-rw-rw-rw- 1286 Mar 18 18:25 table1.000000.sql
-rw-rw-rw- 847081157 Mar 18 18:28 table1.000000.txt
-rw-rw-rw- 1274 Mar 18 18:25 table2.000000.sql
-rw-rw-rw- 1131602870 Mar 18 18:29 table2.000000.txt
-rw-rw-rw- 1304 Mar 18 18:25 table3.000000.sql
-rw-rw-rw- 8508494974 Mar 18 18:45 table3.000000.txt
-rw-rw-rw- 1318 Mar 18 18:25 table4.000000.sql
-rw-rw-rw- 2327666926 Mar 18 18:32 table4.000000.txt
-rw-rw-rw- 1213 Mar 18 18:28 table5.000000.sql
-rw-rw-rw- 441408946 Mar 18 18:29 table5.000000.txt
-rw-rw-rw- 1431 Mar 18 18:29 table6.000000.sql
-rw-rw-rw- 4001690723 Mar 18 18:45 table6.000000.txt
-rw-rw-rw- 1339 Mar 18 18:29 table7.000000.sql
-rw-rw-rw- 1625718296 Mar 18 18:35 table7.000000.txt

$ du -sh default/
18G default/

The dump using mysqldump.

$ mysqldump --user=joe -p --routines --trigger --quick --complete-insert --master-data=2 somedb > somedb.18mar08.sql

Elapsed time to dump: 2676

$ ls -l
-rw-r--r-- 21905573763 Mar 18 19:49 somedb.18mar08.sql

The server has two dual-core athlons and 8 gigs of ram, reading from a fiber-attached SAN and writing to local SCSI drives (RAID-1.) I haven’t performed a restore yet. That’s next but not tonight. I have a code push to deal with first.

03.07.08

An odd thing happened on my way to recovery

Posted in Technology, MySQL at 8:52 pm by Stoner

I learned something new about myisamchk. When doing a recovery of the data file, it creates not one but two temporary files. I knew about the TMD it creates in $datadir when rebuilding the MYD file but it creates a second temporary file using an old and rather annoying technique (which involves creating a file using the standard C ‘open()’ call, then deleting the file without closing the file descriptor.) I discovered this when myisamchk told me “Disk is full writing ‘/tmp/STQ3p9UF’ (Errcode: 28).” I looked in /tmp and didn’t see that file. So I did a

# lsof | grep delete

and low-and-behold, there it was. You know, as a sys admin, that really annoys me. I don’t care that it creates temporary files to do its thing. In fact, I expect it, but I at least want to know about it so I can plan accordingly. My /tmp partition only had 2.5 gigs of space and the table’s MYD was over 4 gigs in size. If I had known about this extra temp file, I would have used –tmpdir=/some/place/else before I started and not had myisamchk crap out on me in the first place.

02.26.08

Monitoring MySQL and Redhat Cluster

Posted in Technology, Linux, MySQL at 5:49 pm by Stoner

At $WORK we typically use Redhat Cluster to make MySQL highly available. We have a number of 2-node clusters (active/passive) and it works nicely, for the most part. One thing I find very annoying is RHCS has no notification framework to notify you when a service is relocated to another node (because of failure or otherwise.)

In a perfect world, the only reason MySQL would pass between nodes is for server maintenance, under the control of a human. In reality, crap happens. Redhat Cluster manages all this, but I still want to know that a failover happened (is hardware going bad? did a software bug cause a failure? etc.)

Most monitoring tools weren’t designed to report on a clustered service. I can add a check that connects to MySQL using the cluster-managed IP - but that only tells me if MySQL has failed completely. The failover between nodes generally takes under a minute so it’s easy for that type of monitoring check to miss a failover. Parsing syslog output is less than optimal because syslog is unstructured text. Blech. Process and disk checks can’t be used on each node because they’ll only exist on one node, the active node.

My current solution is a bit of a kludge but it works. We configure MySQL to use standard names for binary logs and error logs (they never change, regardless of which node the database is running on.) The only file name that isn’t standardized is the pid file, which defaults to the hostname of the server running MySQL. So, with that bit of info, I wrote a script that connects to MySQL using the cluster-managed IP and issues the query “show variables like ‘pid_file’” and parses the result. It then compares the hostname against a stored value in a file, which contains the hostname of the current active node. If the hostname changed, then MySQL has changed nodes (whether due to failure or manual relocation.) This changing of hostnames is considered an error condition and the monitoring system sends alerts appropriately. The NOC staff consult procedures to determine the level of failure and call people if necessary. Clearing the error requires editing the file and setting the current hostname as the active node.

What would be nice would be some informational functions in MySQL to give a bit more data about the server itself, things like hostname, IP address MySQL is bound to, port, etc. Then I could use “select hostname()” or similar instead of relying on a file name. Before someone says “you connected to the database so you know all that already,” not all of this information is known to a database connection, particularly if your application uses a database connection pool or the connection goes through a proxy, masking the real database.

02.25.08

Should Linux be everywhere?

Posted in Technology, Linux at 6:10 pm by Stoner

I always get a chuckle when I see a new product released and within hours/days/weeks, there is a Slashdot article about someone porting the Linux kernel to it. What is this fascination with running Linux on everything?

“Hey John, I just picked up this safety pin at Worst Buy!”

“Cool, Bob. Let’s put Linux on it!”

Personally, I don’t care that you can install Linux on a Roomba or some other device, complete with Asterisk, and control a Roomba to vacuum your house with your cell phone (yes, someone actually did this.) I don’t foresee myself in a hotel room halfway around the world thinking to myself “I wish there were a way I could vacuum my house right now so it’d be clean when I get home.”

The way I feel about this is:

  1. Just because you can does not mean you should.
  2. If you do, then don’t complain if hardware doesn’t work right and the vendor doesn’t give you any help.

02.21.08

Use your database, don’t abuse it

Posted in Technology, MySQL at 11:49 pm by Stoner

Jay Pipes blogged about this earlier this month. I read his post, nodded a knowing nod and finished my cup of coffee. It wasn’t until yesterday when it really struck me. I was looking through the database schema of a popular forum software package (won’t say which) when I came across this column definition in a central table (by ‘central’ I mean one that typically holds a lot of rows):

ip varchar(100),

One hundred characters to store a at-most 15-character IPv4 string? I know what you’re about to say, “But, it’s a varchar so you’re not really using 100 bytes.” Well, you *might* not be. Jay’s discussion on how MySQL uses temporary tables illustrates how you can use all 100 bytes in memory - not a good thing.

So, the minimum number of bytes you will be using is 7 if each octet is a single digit (don’t forget the ‘.’ separators.) At most, you use 15 bytes. If you store the IP as an unsigned integer, you use, at most, 4 bytes per row. Those bytes really start to add up, especially when pulling the data into memory to satisfy queries.

But wait…what if it stores the fully qualified domain name? I didn’t dive into the PHP to see if it stores IP addresses or hostname or both in this column. Is 100 characters enough? Probably. Some of the hostnames from the major ISPs are decently long but I don’t think I’ve seen one that’s 100 characters. I did set up a site for a client that was 61 characters long (thank goodness for bookmarks.)

This begs the question: is it better to store the IP address or hostname…or both? Resolving an IP to hostname or vice versa is an “expensive” process, whether you do it in your code at the time of storage or as a batch process later on. This is a question for another day.

Where does all this lead? You need to clearly define the data you want to store, then be frugal with the data type used to store it.

02.08.08

My thoughts on the Superb*wl

Posted in Life at 1:40 pm by Stoner

Tom Petty and the Heartbreakers still rock.

01.29.08

My MySQL project

Posted in Technology, MySQL at 9:35 pm by Stoner

Finally got around to creating a project on Sourceforge: MyHelper. Right now, it’s just some stored procedures and functions for determining data and index sizes for databases using the MyISAM and InnoDB engines. Each one is individually licensed (using the BSD license.) I have 2 more stored procedures to add (return a list of tables with no primary key and a list of tables with the primary key) but I haven’t committed my changes to Subversion yet because I’m still testing them. I’ll be adding more code as I refine the stuff I’ve written over the years (their current state is highly environment-specific so I need to “genericise” them - if that’s a word.)
A word about licensing. I’m opting to license each script/procedure/function individually. This is because the project, as a whole, is made up of individual pieces that may or may not have a dependency on each other and the project is not a “software package” that you install and use. Each script, stored procedure or function may stand on its own and you have the option to use what you want. It’s an ala carte repository of stuff. Use what you want, leave behind what you don’t need. Putting a license on the whole package didn’t make sense given this distribution model so each has its own license.

So, if you develop with MySQL or manage MySQL databases, check ‘em out and let me know what you think.

01.28.08

It’s good and bad

Posted in Technology, Linux at 12:45 pm by Stoner

Open Source means freedom. Freedom to change the product to do what you want it to. Freedom to make your version available to others. Freedom to integrate your changes back into the original. This freedom doesn’t come without a price.

That price is proliferation and decay. I’ll give you a prime example, the GNU/Linux operating system. DistroWatch watches over 350 Operating systems (some are BSD and not GNU/Linux - getting exact numbers isn’t important at this time.) Some are geared toward specific uses, such as use as a firewall, use as a forensic analysis toolset, use as a data center server system or use as a desktop system, to name a few. The problems become evident:

  • Which do you use? How do you evaluate and compare distributions?
  • How is the distribution governed? Is it one or a few people “throwing darts at a dart board” or is there a formal governing body overseeing the project? How stable is the governing body? What are their motivations?
  • What happens if support for a distribution falls off? Do you pick up maintenance? Do you switch? Do you do nothing and hope/pray that someone else will swoop in and pick up the maintenance?
  • What happens if someone forks the code to implement a feature or fix you need? Do you switch completely to the new branch? Do you back-port the feature to the branch you’re on?
  • What happens if a developer (either for the OS or in the company using the OS) dies or leaves for greener pastures? Who is going to take over their position? Will the replacement follow the road-map or take the project in a new direction?

These are not easy questions to answer. For small and mid-sized businesses, they may be forced to stick with the 800 lb gorillas. Going with a smaller or lesser known distribution may be too much of a business risk. If the 800 lb gorillas don’t offer what you need and a less-known distribution does, do you take that risk? These questions aren’t limited to the operating system, they can be asked of any OSS project, like OpenOffice.org, Audacity, Apache, GNU Cash, etc.

Open Source is great at leveling playing fields, offering choice and spurring innovation but all that comes with a price that some may not be able or willing to pay. Choose wisely.

01.02.08

New year, no resolutions

Posted in Life at 4:13 pm by Stoner

Why do people make a big deal about New Year’s resolutions? The percentage of people who keep their resolutions is absurdly low. You know people won’t keep them. It’s like this, people: if something is that important to you (loosing weight, quit smoking, quit drinking, quit/stop/whatever) then you’ll do it. If you need a once-a-year holiday to get up enough gumption to make a change in your life, then guess what? It isn’t that important to you and you will fail.

Stop fooling yourself. If you want to loose weigh, you will only loose weigh when loosing it becomes important in your life. You’ll quit smoking when stopping becomes important enough to stop.

« Previous entries