02.21.08
Use your database, don’t abuse it
Jay Pipes blogged about this earlier this month. I read his post, nodded a knowing nod and finished my cup of coffee. It wasn’t until yesterday when it really struck me. I was looking through the database schema of a popular forum software package (won’t say which) when I came across this column definition in a central table (by ‘central’ I mean one that typically holds a lot of rows):
ip varchar(100),
One hundred characters to store a at-most 15-character IPv4 string? I know what you’re about to say, “But, it’s a varchar so you’re not really using 100 bytes.” Well, you *might* not be. Jay’s discussion on how MySQL uses temporary tables illustrates how you can use all 100 bytes in memory - not a good thing.
So, the minimum number of bytes you will be using is 7 if each octet is a single digit (don’t forget the ‘.’ separators.) At most, you use 15 bytes. If you store the IP as an unsigned integer, you use, at most, 4 bytes per row. Those bytes really start to add up, especially when pulling the data into memory to satisfy queries.
But wait…what if it stores the fully qualified domain name? I didn’t dive into the PHP to see if it stores IP addresses or hostname or both in this column. Is 100 characters enough? Probably. Some of the hostnames from the major ISPs are decently long but I don’t think I’ve seen one that’s 100 characters. I did set up a site for a client that was 61 characters long (thank goodness for bookmarks.)
This begs the question: is it better to store the IP address or hostname…or both? Resolving an IP to hostname or vice versa is an “expensive” process, whether you do it in your code at the time of storage or as a batch process later on. This is a question for another day.
Where does all this lead? You need to clearly define the data you want to store, then be frugal with the data type used to store it.

http://hubert.roksor.myopenid.com/ said,
February 22, 2008 at 12:27 am
I guess they want to be able to store IPv6 addresses? In which case they should have used varchar(39) but at least it would make sense as to why use a varchar instead of an int.
http://sure.would.myopenid.com/ said,
February 22, 2008 at 12:56 pm
IPv4 can be converted to 4 bytes and IPv6 into 16.
So 16 bytes expressed as hexadecimal is 32.
A good idea would be to use triggers or a stored procedure to perform the conversion.
One can certainly leave out the colons at the point that the data has a handler.
My conclusion would be to use char(32) and store v4 as converted v6.
This provides data consistency. The conversion should not be a big deal as the concepts are clearly covered by the IPv6 standards documents.
themattreid.pip.verisignlabs.com said,
February 22, 2008 at 1:13 pm
I think that’s a very good point for well tuned production software - or rather production software that should be tuned as you found here. However, some things to take into account:
1. there is someone out there somewhere that is going to try to put in a massive URL > 100 chars perhaps to mess around or they just have a long string of sub-domains.
2. the 100 part of the varchar(100) is dependent upon the sql mode - if the mode is not strict then data larger than 100 will be allowed to be inserted, although showing a warning. To prevent that, using strict mode, it will error on insert
3. If you’re defining an INT(x) column the x isn’t a matter of storage but rather a value for display width of the column value. So INT(2) will use the same memory as INT(32) in regard to storage.
awfief@livejournal said,
February 23, 2008 at 7:33 am
Did you do a PROCEDURE_ANALYSE() to figure out the data that’s actually being put in there?