Archive for the 'databases' Category

Scratching an Itch: The Open Data Bank

Tuesday, March 11th, 2008

The Open Data BankEngineers, especially those of the software variety, have various types of projects to work on. Some pay the bills, some are for learning, some are to help others’ goals, and then there are the ones that we say “scratch an itch.”

It’s hard not to operate in a world of ideas without having a few of your own, and some ideas just keep popping up. If you’re lucky, someone else does it right and you can reap the benefits, but often you just have to go out and do it. These projects are often done at personal expense “to see if it works” or “because I can”, and not for fame or fortune. I have a few of these kicking around, and it was a new year’s resolution of mine to actual get some of them into the wild. So, as the first of these, I’d like to officially announce a new project that I’ve been working on (and one of the reasons for the lack of blog posts). The Open Data Bank.

The ODB is a simple idea. While tinkering with other projects, I’m often in need of data. Sometimes this is to test things out, sometimes it’s to get things started, but everytime it seems like I have to go and find it anew and coax it into some useful format. I assume that others like me have the same problem, and hopefully ODB will be a useful contribution to the tinkering ecosystem to complement other tools like open source libraries.

For the layman, the ODB is a place where we can put “open data”, that is to say, data that can be shared without restriction. Not only is the data open, but the formats it is shared in are open as well. Formats like XML and JSON don’t have to be licensed from anyone, and therefore people are free to write tools to read it.

If you’re interested in participating or just keeping track of the ODB, there’s a Google Group you can join and share info, ask questions, or offer ideas to improve it.

Database Naming Conventions

Friday, September 21st, 2007

Time for another naming convention. This time it’s something people care more about than Java packages, we’re talking about databases. Here are the rules, and the reasons behind them.

Use lowercase for everything

We’ve got 4 choices here:

  • Mixed case

    • Some servers are case sensitive, some are not. MySQL for example, is case-insensitive for column names, case-insensitive on Windows for table names, but case-sensitive on Linux for table names.
    • Error prone
  • No convention

    • Same reasons as mixed case
  • Upper case

    • SQL is easier to scan when the reserved words are uppercase. This is valuable when scanning log files looking for things like WHERE statements and JOINs.
    • MySQL will always dump table names on Windows in lowercase.
  • Lowercase

    • Works everywhere. Some servers, like Oracle, will appear to convert everything to uppercase, but it’s just case-insensitive and you can use lowercase.

Only use letters and underscores and numbers (sparingly)

  • Most servers support other characters, but there are no other characters which all the major servers support.
  • Numbers should be used as little as possible. Frequent use is typically a symptom of poor normalization.

    • address1, address2 is OK
  • Whitespace isn’t allowed on most servers, and when it is you have to quote or bracket everything, which gets messy.

Table and column should be short, but not abbreviated.

  • You’ve seen the same thing abbreviated every way possible, like cust_add, cus_addr, cs_ad, cust_addrs, why not just customer_address? We’re not writing code on 80 character terminals any more, and most people aren’t even writing SQL, so let’s keep it clear, OK?
  • 30 characters is considered the safe limit for portability, but give some serious thought before you go past 20.

Table names should be singular.

Yes, singular! Oh yes, I went there. I used to use plural names, because it’s more semantically accurate. After all if each record is a person, then a group of them would be people, right? Right, but who cares. SELECT * FROM person isn’t any less clear than people, especially if you’ve got a solid convention. You don’t use plurals when you’re declaring class names for a vector of generics do you? Also:

  • English plurals are crazy, and avoiding them is good.

    • user -> users
    • reply -> replies
    • address -> addresses
    • data -> data (unless its geographic, then it’s datum -> data)
  • Singular names means that your primary key can always be tablename_id, which reduces errors and time.

Double Underscores for Associative Tables.

You’ve got your person table, and your address table, and there’s a many-to-many between them. This table should be called address__person. Why? Well what if you have a legacy_customer table that also ties to address. Now you’ve got address__legacy_customer. A new developer can easily pick up this convention and will be able to break down the names accordingly. Remember, no matter what the Perl/Lisp/Ruby/etc guys say, clarity of code is judged by how someone reads it, not how they write it.

Component Names of Associative Tables in Alphabetical Order.

This rule is somewhat arbitrary, but still beneficial. There’s no good way to determine which goes first. Table size, “importance”, age, who knows what else, and those assessments may change over time. Or, you might find that your manager assigned the same task to two people, and now you’ve got an address__person and a person__address table co-existing peacefully, when you only need one. Everyone putting them in the same order makes reading and writing queries easier.

That’s all I’ve got for now, but I encourage you to offer your own, or even refute some of the ones above (with some reasoning, of course).