Some guidelines for Drupal data modeling

The freedom that Drupal gives you to easily create content types where you can choose from a wide variety of field types, and combine them in any imaginable number of ways is, to my mind, Drupal's "killer feature". The fact that it's so easy also makes it dangerous. If I had to name one trait that really excellent Drupal sites share, it's a solid data model that aligns with the nature and purpose of the data, and is naturally extendable as the site expands, without resorting to ad-hoc hackish workarounds.

To be perfectly honest, I'd say that only one of my sites comes close to this Platonic ideal. I'm fairly pleased with the data modeling for my work from the last year or so. But I only reached that point through a string of sites that are a real mess under the hood-- and this site is no exception. Every time I create a new node, I cringe slightly and wish I had time to tear it down and rebuild it in Drupal 7, with better data modeling.

Getting the data model right for Drupal is an art as much as a science. There are things you can do really wrong, things you can do sub-optimally, and things you can do where you and I might reasonably disagree. Some of it comes down to experience, and learning to predict in what direction a site might evolve. You can create a data model that works beautifully for a site as you've been instructed to build it, only to discover that the person you're working with loves it so much that they suddenly want it to do eight more things you never planned for. Once people who haven't used a robust, extendable content management system get their hands on a Drupal site you've built them, I've found that the scope of what it needs to do doesn't just creep, it often snowballs. What initially sounded like a peripheral taxonomy can evolve into an elaborate content type of its own, leaving you with a combination of manual data-shuffling, Views Bulk Operations voodoo, and/or looking for someone who can write you some SQL.

On top of all this, there's some data that just doesn't fit elegantly into any of the ways that Drupal likes to store data, and module requirements and quirks can introduce their own set of constraints. I have a Drupal 6 site that relies heavily on Editview (which provides a View display where a user can edit all the node fields you include), but Editview doesn't work well for taxonomy fields, even though using a taxonomy to store a certain piece of data was the "right answer" from a data modeling perspective. Chances are, not every site you build is going to have a beautiful data model, no matter how much thought you put into it. You'll make compromises, and that's okay. The way I see it, the goal of Drupal data modeling is minimizing the amount of work you (the developer) have to do to extend the site, and make it as easy-- and minimally confusing-- as possible for other people to use the site. If your inner librarian (other people have one of those too, right?) is pleased with the aesthetics of the result, that's a bonus.

For me, seeing real-life examples of how good sites are built, and recognizing the consequences of my own data modeling follies, have been the best ways to learn how to do a better job. I hope to write a series of data modeling case studies, and link to them here, to help others who learn the same way. In the meantime, though, here's some thoughts on things to consider when contemplating what your data model should look like.

How many content types do you need?

My aesthetic preference is to consolidate content types where I can. When I decide (or hear it's been decided) to add a new kind of content to a site, if it would have the same fields as an existing content type, I usually consider extending what I've already created first. It doesn't really matter how many content types your site has, so long as it's clear to your users what each one is for, but there's a little extra configuration work that falls on you as a developer for every new content type you add. Here's some thoughts on how to decide whether you've got a new content type on your hands, or whether you should extend what you've already got:

Think about content type settings

Out-of-the-box Drupal 7 comes with two content types: Article ("Use articles for time-sensitive content like news, press releases or blog posts.") and Basic page ("Use basic pages for your static content, such as an 'About us' page."). The differences between them are subtle-- the Article displays author and date information and has open comments, with 50 per page, whereas the Basic Page does not display author and date information and has hidden comments. The Article has an image field by default, and a field for tags; the Basic page just has a title and body. Those are pretty reasonable default settings for each one (though I'd probably turn comments off altogether for basic pages), and serve as a good example of one factor for how to decide if you need a new content type: content type settings matter.

If you want one group of pages to show the author and date, and another group of pages to omit that information, they should be two different content types. Same thing with comment settings-- sure, there are ways to configure them on a node-by-node basis, but if there's a set of pages that you consistently want to have comments, and another set that you consistently want to not have comments, it makes sense to have two content types.

Think about Pathauto

While it's not part of Drupal core, Pathauto is a module I won't launch a site without. With Pathauto, you configure the default URL for your pages based on content type. Sure, you can override Pathauto on a node-by-node basis, but if you consistently want your blog posts to look like mysite.university.edu/2012/01/01/my-blog-post while your pages are supposed to look like mysite.university.edu/my-page, you should create two content types.

That said, Pathauto concerns can lead you down the path of possibly-needless content type proliferation. Let's say you have a set of pages whose needs (in terms of content type settings and fields) align perfectly with Basic page, except you want them to have a different path, and you want them to appear in a View of their own, without all the other Basic pages. You could create a separate content type, configure the Pathauto to generate the path you want, and use a Content type filter in Views to get the listing of nodes that you want. Alternately, you could add a field-- like field type List (Text), or even a term reference field pointing to a taxonomy-- where the user can specify a value if the node they're creating is of a special type. You can then leverage that field in your Pathauto, by including it in the pattern, with the knowledge that it'll only appear in the path if you've selected a value. This is easiest to do (and makes the most sense) if your Pathauto settings for the two kinds of content differ only by a single value-- if you want the nodes living at mysite.university.edu/page-title, mysite.university.edu/stuff/page-title, and mysite.university.edu/things/page-title all to be of a single content type. The more your paths differ, the tricker it starts to get, and you may reasonably conclude that the additional work isn't worth it for you, and having to select between multiple options in the node edit form will be confusing for your user.

Here's an example of how this could almost work out easily, until complications start setting in. On this site, I have normal pages (e.g. my CV at /cv), pages with digital humanities data (e.g. quotes from Project Bamboo workshops about metadata, at /dh/data/metadata), and pages about Drupal (e.g. "Drupal jargon explained!" at /drupal/tutorials/drupal-jargon-explained). If my paths were just /page-title, /dh/page-title, and /drupal/page-title, I could create an optional List (Text) field for my Basic page content type (let's name the field "topic") with the values "Drupal" and "DH", and for Drupal or DH posts I could select the appropriate value. I'd then configure my pathauto to be [node:field_topic]/[node:title]. When neither "Drupal" nor "DH" is selected for a node, it'd just appear at /page-title, but when "DH" is selected, it'd appear at /dh/page-title.

But what about the second part of the path? The paths I'm using in reality are more complicated: /dh/data/page-title, /drupal/tutorials/page-title, etc. Where would that data come from? You can make it work by creating more List (Text) fields where you can select a sub-topic, and using Conditional Fields so that the Drupal sub-topics only show up after you've selected "Drupal" as the topic can make it less confusing for your user. If you go down that route, your Pathauto would look something like [node:field_topic]/[node:field_drupal_subtopic]/[node:field_dh_subtopic]/[node:title]. It starts getting messy, and at a certain point it's just better to create different content types.

Think about fields

How many differences in fields does it take before you should make the call that you're dealing with different content types? If you have an option for selecting what kind of content a certain node is (like the discussion of Basic page and its "DH" and "Drupal" variants, above), you can use Conditional Field to display fields specific to that kind of content-- for instance, if I wanted every "Drupal" node to have an image, but not the "DH" nodes, selecting "Drupal" could cause an image field to show up. A different field or two doesn't necessarily mean you should create a whole new content type, but the more differences in fields there are, the more you should consider whether you're dealing with content that is, by its nature, fundamentally different.

Think about the display

Display Suite provides incredibly powerful tools for displaying individual nodes in a variety of ways, but for most projects, if you find yourself reaching for Display Suite to make different node displays for two variants of something you're treating as a single content type (e.g. maybe one needs some data in a tabbed fieldgroup, and another just needs a standard display of the content with inline labels for each field), you might want to think about the nature of the content. Is it really the same thing, and should it be the same content type?

Think about your users

Sometimes technology doesn't even factor into the decision of how many content types to create. Let's say you're going to be posting some 19th century literary texts, and essays about those texts written by your users. In both cases, all you want is a title and a body field. Maybe you want the pathauto for both of them to just show the page title, or you've got another workaround for the path (like the one described above). You can differentiate them, for purposes of Views listings and such, using a List (Text) field with options "Text" and "Essay". Should you use a single content type?

Probably not.

In Drupal's eyes, your essays and your 19th century literature are just nodes with a title and a body field. Most users, though, think of them as very different things. One is a primary source, one is their scholarship on that primary source. It "feels wrong" to use a single content type for both, pasting in either essay text or literary prose into the same "body" field. Then there's the question of how you'd even phrase it as a menu item. "Add essay" and "Add literature" are meaningful items in a menu, but "Add essay or literature" is clunky and strange, and "Add content" is vague. You've probably got two content types on your hands.

There's always gray areas-- for Bamboo DiRT, I created one content type that can accommodate both tools and collections (e.g. text repositories). The focus of the site is helping people find stuff that can help them do digital research, and while there's ways to limit your browsing/searching to only tools or only collections, I didn't see a huge amount of value in differentiating them through different content types, at least for now. As a result, users have the option to "Add resource", regardless of the nature of the thing they want to contribute.

Content type or taxonomy?

In Drupal 7, you can add fields to users (sidestepping the various clumsy ways of creating "user profiles" in Drupal 6) and to taxonomy terms. It's extremely handy, but it does muddy what was previously a clear distinction between nodes and taxonomy terms. When do you create a taxonomy, and use a term reference field to connect different information, and when do you create a content type and use a node reference field (enabled by the References module)?

Is it data or metadata?

One of the first things I think about is whether the content in question is, within the scope and focus of the site, data or metadata. The tricky thing is, sites evolve-- what's metadata today may become data tomorrow, if it takes on a life of its own. When you're doing data modeling, think about what other features might grow out of the ones you already know about. If you're creating a site to display and annotate comments from a survey about classrooms, you might be tempted to treat the classroom as a taxonomy term-- the focus is on the content of the comments, not the rooms themselves. But what if the survey comments turn out to be just the beginning? What if your user now wants those comments to display like reviews on a "profile page" for each room, a profile page which also includes a map showing where the classroom is located-- or two versions, where one traces the handicap-accessible path to the classroom from the building's entrance! How about some photos? And can we include a data feed with scheduling information? If this happens, and you've chosen to store the classrooms as taxonomy terms, you should seriously consider regenerating them as nodes, and changing your term reference fields to node reference fields, because the classrooms really aren't metadata anymore, they're the star of the show. You can't plan ahead for every possible development, but as you build more Drupal sites, you'll get a better sense of how they tend to evolve.

Technical considerations

When building DHCommons, I'd already imported a bunch of events as taxonomy terms (with the thought that people would edit their own profiles and type in the conferences they'd be attending with the help of autocomplete) when it occurred to me that the Flag module would provide the much more user-friendly option of allowing people to click on an "I am attending" button when looking at the events list. After installing Flag, I realized that it allows users to "flag" other users, comments, or nodes... but not taxonomy terms. Admittedly, there is a separate, add-on Flag Terms module, but it didn't have many users and the 7.x-1.x-dev version hadn't been updated in a while. I was able to rationalize to myself the decision to switch events to a content type by considering the important role that events will play in the ongoing DHCommons outreach program, but at the end of the day, I made the decision mostly because of a module constraint.

To dispel any concerns on one particular technical front: the autocomplete (tagging) widget option for term reference fields-- which adds new taxonomy terms to the database when a user types in something that doesn't already exist-- has an equivalent if you're using node references, namely Node Reference Create. It'll create an empty node of the content type you choose in the widget configuration, whose title will match whatever the user types. I use it a lot.

List (text) field or term reference field?

Depending on how you've configured things, the user experience for selecting terms from a taxonomy (via a term reference field) and from a list (text) field might be identical. So what are some criteria for choosing one or the other?

Taxonomies

Taxonomies have the benefit of allowing a hierarchy of terms. By default, they display within a node as a link that will take you to a list of all the content on the site that has that taxonomy term-- which can be a useful way to bring together related content from different content types. Taxonomies have always allowed a description, and in Drupal 7 they can also include fields, so if the options that you're choosing between when creating a node might need some explanation or additional data (e.g. you might want to include a link to a university's webpage when you're using universities as a set of taxonomy terms), go with a taxonomy.

List (text)

I tend to think of metadata that's stored using a list field as something of a one-off. It may be important information-- I've got a dialectology site where all the crucial grammatical information about words in a text is stored this way-- but it's not a major way of organizing the data. (That said, you can still use it as a filter in Views.) You have to pre-populate lists, and people have to choose between existing options-- unlike term reference fields where you can use the autocomplete widget, which allows users to easily add new values when they're adding or editing nodes.

Your site will evolve messily

Things fall apart, the data model cannot hold,
Mere anarchy is loosed upon the Drupal site (— with apologies to Yeats)

So you've gotten your site up and running with a data model that you've lovingly and painstakingly crafted, and everything's working great. Then you get a request that sounds simple enough: "Why don't we add a front page slideshow?" But as soon as you install Views Slideshow, you realize that you don't have any content types with an image field. And they want a slideshow that includes content types where users would be confused about what to upload into an image field, even if presented with the option (imagine, say, a glossary of technical terms). You'd probably be unsuccessful in talking them out of the slideshow idea-- and what's more, it seems reasonable enough to highlight some of the most interesting technical terms-- but how are you going to pull off creating a slideshow with images without screwing up your data model?

There's multiple ways to tackle the myriad situations like this that will inevitably arise, and sometimes the one that makes the most sense for you, the site, and your users is one that introduces some chaos into your data model. That said, if you get a request and your first reaction is to add a new field or content type (assuming the request is something other than "we need a new field on user profiles"), step back for a moment and think about the best way to do it from the perspective of your overall data model. A few general tips:

  1. I've worked with some wonderful student assistants, but at this point I'm inclined to recommend not giving student (or any other kind of) assistant permissions to add content types or fields, unless you're confident in their ability to think through the data modeling aspect of site development and choose wisely. I've had to untangle sites where things like slideshows have been implemented in a maddeningly convoluted way, because Drupal makes it easy to add a field here and a field there as a way of executing whatever hackish vision first leaps into your mind.
  2. In the name of all that is holy, document quirks. I'm as bad as anyone when it comes to documentation, and if you're going to skimp on something (and you will), there are worse things than falling short on documenting the parts of your data model that intuitively make sense. But when you have to make a judgment call and choose the least-bad way to implement something like a slideshow with images, at least document that part.
  3. If things get out of control, don't be afraid to go back and rationalize. This, of course, assumes you have time and funding, and many projects have run out of one or both by the time you want to do this. Site refreshes and/or major version upgrades provide good excuses for doing some of this work; consider throwing in a few more hours to your time/funding estimate to address data model creep.

Questions? Comments?

I'm sure there's things I've missed, and I'm happy to expand these musings further in response to feedback. Leave a comment below if you've got any data modeling questions or suggested topics. In addition to writing full-on how-to's for the various sites I've built, I'm hoping to turn some of them into data modeling case studies that I'll post here, as soon as I can scare up some time (read: it might be a while, even with my New Year's Resolution to write more Drupal documentation).

Project: 

Tags: 

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.