by Scott W. Ambler
From: http://www.tdan.com/view-articles/5010/A database refactoring is a small change to a database schema which improves its design without changing, at a practical level, the semantics of the database. In other words, it is a simple database transformation which neither adds nor breaks anything. The process of database refactoring defines how to safely evolve a database schema in small steps. Database refactoring enables data professionals to work in an evolutionary manner, just as modern application developers do. It also provides a coherent strategy for organizations to dig their way out of the legacy database hole.
In the seminal text Refactoring, Martin Fowler [1] describes the programming technique called refactoring, which is a disciplined way to restructure code in small steps. Refactoring enables you to evolve your code slowly over time, to take an evolutionary (iterative and incremental) approach to programming. A critical aspect of a refactoring is that it retains the behavioral semantics of your code. You do not add functionality when you are refactoring, nor do you take it away. A refactoring merely improves the design of your code - nothing more and nothing less. A database refactoring [2, 3] is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics - in other words, you cannot add new functionality or break existing functionality, you cannot add new data, and you cannot change the meaning of existing data. A database schema includes both structural aspects, such as table and view definitions, and functional aspects, such as stored procedures and triggers. I use the terms code refactoring to refer to traditional refactoring as described by Martin Fowler and database refactoring to refer to the refactoring of database schemas. The process of database refactoring is the act of making these simple changes to your database schema.
There are two fundamental reasons why you want to adopt database refactoring:
Sometimes a project team finds itself in a relatively simple, "single-application database" situation, and if so they should consider themselves lucky. With this simple architecture database refactoring is fairly simple - you merely change your database schema and update your application to use the new version of the schema. What is more typical is to have many external programs interacting with your database, some of which are beyond the scope of your control. In this situation you cannot assume that all the external programs will be deployed at once, and must therefore support a transition period (also referred to as a deprecation period) during which both the old schema and the new schema are supported in parallel. For the rest of this article I will assume that you're in this situation. To put database refactoring into context, let's step through a quick example. You have been working on a banking application for a few weeks and have noticed something strange about the Customer table depicted in Figure 1[1] - one of the column names isn't easy to understand. You decide to apply the Rename Column refactoring to the FName column to rename it to FirstName.
Figure 1 - The initial database schema for Customer.
Agilists typically work together in pairs; one person should have application programming skills and the other data skills, and ideally both people have both sets of skills. The pair begins by determining whether the database schema needs to be refactored. Perhaps the programmer is mistaken about the need to evolve the schema, and how best to go about the refactoring. The refactoring is first developed and tested within the developer's sandbox. When it is finished, the changes are promoted into the project-integration environment, and the system is rebuilt, tested, and fixed as needed.
To apply the Rename Column refactoring in the development sandbox, the pair first runs all the tests to see that they pass. Next, they write a test because they are taking a Test-Driven Design (TDD) approach [5, 6, 7]. A likely test is to access a value in the FirstName column. After running the tests and seeing them fail, they implement the actual refactoring. To do this they introduce the FirstName column and the SynchronizeFirstName trigger.
The trigger is required to keep the values in the columns synchronized - each external program accessing the Customer table will at most work with one but not both columns. At first, all production applications will work with FName, but over time they will be reworked to access FirstName instead. There are other options to do this, such as views or synchronization after the fact, but I find that triggers work best.
The FirstName column must be populated with values from the FName column. You then need to run both columns in parallel during a "transition period" of sufficient length to give the development teams time to update and redeploy all of their applications. This transition period could be several years in length, depending on the ability of your project teams to get new releases into production. In this case we've decided that the transition period will run to November 14, 2007.
The pair reruns the tests and sees that they now pass. They then refactor the existing tests to work with the FirstName column rather than the FName column. Once the database refactoring is completed in their development work environment, the pair promotes their work into the team's integration sandbox where they rebuild and rerun the tests, fixing any problems which they find. To update the database schema, the pair runs the appropriate change and migration scripts in the appropriate order.
This promotion strategy continues into your pre-production integration testing environment and then eventually into production. Depending on your need, you could implement and then deploy the refactoring within a single day, although more realistically it would be several months until the next major release of your application that you would deploy the refactoring along with any other updates that you've made.
After the transition period, you remove the original column plus the trigger(s), resulting in the final database schema of Figure 3. You remove these things only after sufficient testing to ensure that it is safe to do so. At this point, your refactoring is complete.
Figure 2 - The final database schema for Customer.
There is a little more to successfully implementing a database refactoring than what I've described. You need a way to coordinate the refactoring efforts of all the development teams within your organization, clearly something that may prove quite difficult. You also need to get good at deploying refactorings in production, once again coordinating the efforts of several teams. In Refactoring Databases [3], my co-author Pramod Sadalage and I discuss several strategies for doing each of these things.
I am often told by existing data professionals that the real solution is to model everything up front, and then you would not need to refactor your database schema. Although that is an interesting vision, and I have seen it work in a few situations, experience from the past three decades has shown that this approach does not seem to be working well in practice for the overall IT community. The traditional approach to data modeling does not reflect the evolutionary approach of modern methods such as the RUP and XP, nor does it reflect the fact that business customers are demanding new features and changes to existing functionality at an accelerating rate. The old ways simply aren't sufficient any more, if they ever were [8].
I suggest that you take an Agile Model-Driven Development (AMDD) approach [9, 10], in which you do some high-level modeling to identify the overall "landscape" of your system, and then model storm the details on a just-in-time (JIT) basis. Take advantage of the benefits of modeling without suffering from the costs of over-modeling, over-documentation, and the resulting bureaucracy of trying to keep too many artifacts up-to-date and synchronized with one another. Your application code and your database schema evolve as your understanding of the problem domain evolves, and you maintain quality through refactoring both.
Database refactoring is a database implementation technique, just like code refactoring is an application implementation technique. You refactor your database schema to ease additions to it. You often find that you have to add a new feature to a database, such as a new column or stored procedure, but the existing design is not the best one possible to easily support that new feature. You start by refactoring your database schema to make it easier to add the feature, and after the refactoring has been successfully applied, you then add the feature. The advantage of this approach is that you are slowly, but constantly, improving the quality of your database design. This process not only makes your database easier to understand and use, it also makes it easier to evolve over time; in other words, you improve your overall development productivity.
Мой опыт показывает, что данные специалисты могут воспMy experience is that data professionals can benefit from adopting modern evolutionary techniques similar to those of developers, and that database refactoring is one of several important skills that data professionals require. Evolutionary development has arguably become the norm within the IT community, and agile software development approaches extend evolutionary methods to become more effective. My advice to data professionals is to take evolutionary and agile concepts and techniques seriously: they're real, they work, and they're here to stay.