Saturday, October 18, 2008

Testing Methodologies

Objectives

  • Look at the various methodologies for testing software.
  • Learn about test-driven development.

Overview

There are numerous methodologies available for developing and testing software. The methodology you choose depends on factors such as the nature of project, the project schedule, and resource availability. Although most software development projects involve periodic testing, some methodologies focus on getting the input from testing early in the cycle rather than waiting for input when a working model of the system is ready. Those methodologies that require early test involvement have several advantages, but also involve tradeoffs in terms of project management, schedule, customer interaction, budget, and communication among team members. This chapter discusses how testing fits into the various software development life cycles and then discusses the test-driven development practice in detail.

The Configuration Management Application Block (CMAB) is used to illustrate concepts in this chapter. The requirements for the CMAB are as follows:

  • It provides the functionality to read and store configuration information transparently in persistent storage media. The storage media are the Microsoft® SQL Server™ database system, the registry, and XML files.
  • It provides a configurable option to store the information in encrypted form and in plain text using XML notation.
  • It can be used with desktop applications and with Web applications that are deployed in a Web farm.
  • It caches configuration information in memory to reduce cross-process communication such as reading from any persistent medium. This caching reduces the response time of the request for configuration information. The expiration and scavenging mechanism for the data cached in memory is similar to the cron algorithm in UNIX.

Software Development Methodologies and Testing

Various generic software development life cycle methodologies are available for executing software development projects. Although each methodology is designed for a specific purpose and has its own advantages and disadvantages, most methodologies divide the life cycle into phases and share tasks across these phases. This section briefly summarizes common methodologies used for software development and describes their relationship to testing.

Waterfall Model

The waterfall model is one of the earliest structured models for software development. It consists of the following sequential phases through which the development life cycle progresses:

  • System feasibility. In this phase, you consider the various aspects of the targeted business process, find out which aspects are worth incorporating into a system, and evaluate various approaches to building the required software.
  • Requirement analysis. In this phase, you capture software requirements in such a way that they can be translated into actual use cases for the system. The requirements can derive from use cases, performance goals, target deployment, and so on.
  • System design. In this phase, you identify the interacting components that make up the system. You define the exposed interfaces, the communication between the interfaces, key algorithms used, and the sequence of interaction. An architecture and design review is conducted at the end of this phase to ensure that the design conforms to the previously defined requirements.
  • Coding and unit testing. In this phase, you write code for the modules that make up the system. You also review the code and individually test the functionality of each module.
  • Integration and system testing. In this phase, you integrate all of the modules in the system and test them as a single system for all of the use cases, making sure that the modules meet the requirements.
  • Deployment and maintenance. In this phase, you deploy the software system in the production environment. You then correct any errors that are identified in this phase, and add or modify functionality based on the updated requirements.

The waterfall model has the following advantages:

  • It allows you to compartmentalize the life cycle into various phases, which allows you to plan the resources and effort required through the development process.
  • It enforces testing in every stage in the form of reviews and unit testing. You conduct design reviews, code reviews, unit testing, and integration testing during the stages of the life cycle.
  • It allows you to set expectations for deliverables after each phase.

The waterfall model has the following disadvantages:

  • You do not see a working version of the software until late in the life cycle. For this reason, you can fail to detect problems until the system testing phase. Problems may be more costly to fix in this phase than they would have been earlier in the life cycle.
  • When an application is in the system testing phase, it is difficult to change something that was not carefully considered in the system design phase. The emphasis on early planning tends to delay or restrict the amount of change that the testing effort can instigate, which is not the case when a working model is tested for immediate feedback.
  • For a phase to begin, the preceding phase must be complete; for example, the system design phase cannot begin until the requirement analysis phase is complete and the requirements are frozen. As a result, the waterfall model is not able to accommodate uncertainties that may persist after a phase is completed. These uncertainties may lead to delays and extended project schedules.

Incremental or Iterative Development

The incremental, or iterative, development model breaks the project into small parts. Each part is subjected to multiple iterations of the waterfall model. At the end of each iteration, a new module is completed or an existing one is improved on, the module is integrated into the structure, and the structure is then tested as a whole.

For example, using the iterative development model, a project can be divided into 12 one- to four-week iterations. The system is tested at the end of each iteration, and the test feedback is immediately incorporated at the end of each test cycle. The time required for successive iterations can be reduced based on the experience gained from past iterations. The system grows by adding new functions during the development portion of each iteration. Each cycle tackles a relatively small set of requirements; therefore, testing evolves as the system evolves. In contrast, in a classic waterfall life cycle, each phase (requirement analysis, system design, and so on) occurs once in the development cycle for the entire set of system requirements.

The main advantage of the iterative development model is that corrective actions can be taken at the end of each iteration. The corrective actions can be changes to the specification because of incorrect interpretation of the requirements, changes to the requirements themselves, and other design or code-related changes based on the system testing conducted at the end of each cycle.

The main disadvantages of the iterative development model are as follows:

  • The communication overhead for the project team is significant, because each iteration involves giving feedback about deliverables, effort, timelines, and so on.
  • It is difficult to freeze requirements, and they may continue to change in later iterations because of increasing customer demands. As a result, more iterations may be added to the project, leading to project delays and cost overruns.
  • The project requires a very efficient change control mechanism to manage changes made to the system during each iteration.

Prototyping Model

The prototyping model assumes that you do not have clear requirements at the beginning of the project. Often, customers have a vague idea of the requirements in the form of objectives that they want the system to address. With the prototyping model, you build a simplified version of the system and seek feedback from the parties who have a stake in the project. The next iteration incorporates the feedback and improves on the requirements specification. The prototypes that are built during the iterations can be any of the following:

  • A simple user interface without any actual data processing logic
  • A few subsystems with functionality that is partially or completely implemented
  • Existing components that demonstrate the functionality that will be incorporated into the system

The prototyping model consists of the following steps.

  1. Capture requirements. This step involves collecting the requirements over a period of time as they become available.
  2. Design the system. After capturing the requirements, a new design is made or an existing one is modified to address the new requirements.
  3. Create or modify the prototype. A prototype is created or an existing prototype is modified based on the design from the previous step.
  4. Assess based on feedback. The prototype is sent to the stakeholders for review. Based on their feedback, an impact analysis is conducted for the requirements, the design, and the prototype. The role of testing at this step is to ensure that customer feedback is incorporated in the next version of the prototype.
  5. Refine the prototype. The prototype is refined based on the impact analysis conducted in the previous step.
  6. Implement the system. After the requirements are understood, the system is rewritten either from scratch or by reusing the prototypes. The testing effort consists of the following:
    • Ensuring that the system meets the refined requirements
    • Code review
    • Unit testing
    • System testing

The main advantage of the prototyping model is that it allows you to start with requirements that are not clearly defined.

The main disadvantage of the prototyping model is that it can lead to poorly designed systems. The prototypes are usually built without regard to how they might be used later, so attempts to reuse them may result in inefficient systems. This model emphasizes refining the requirements based on customer feedback, rather than ensuring a better product through quick change based on test feedback.

Agile Methodology

Most software development life cycle methodologies are either iterative or follow a sequential model (as the waterfall model does). As software development becomes more complex, these models cannot efficiently adapt to the continuous and numerous changes that occur. Agile methodology was developed to respond to changes quickly and smoothly. Although the iterative methodologies tend to remove the disadvantage of sequential models, they still are based on the traditional waterfall approach. Agile methodology is a collection of values, principles, and practices that incorporates iterative development, test, and feedback into a new style of development. For an overview of agile methodology, see the Agile Modeling site at http://www.agilemodeling.com/.

The key differences between agile and traditional methodologies are as follows:

  • Development is incremental rather than sequential. Software is developed in incremental, rapid cycles. This results in small, incremental releases, with each release building on previous functionality. Each release is thoroughly tested, which ensures that all issues are addressed in the next iteration.
  • People and interactions are emphasized, rather than processes and tools. Customers, developers, and testers constantly interact with each other. This interaction ensures that the tester is aware of the requirements for the features being developed during a particular iteration and can easily identify any discrepancy between the system and the requirements.
  • Working software is the priority rather than detailed documentation. Agile methodologies rely on face-to-face communication and collaboration, with people working in pairs. Because of the extensive communication with customers and among team members, the project does not need a comprehensive requirements document.
  • Customer collaboration is used, rather than contract negotiation. All agile projects include customers as a part of the team. When developers have questions about a requirement, they immediately get clarification from customers.
  • Responding to change is emphasized, rather than extensive planning. Extreme Programming does not preclude planning your project. However, it suggests changing the plan to accommodate any changes in assumptions for the plan, rather than stubbornly trying to follow the original plan.

Agile methodology has various derivate approaches, such as Extreme Programming, Dynamic Systems Development Method (DSDM), and SCRUM. Extreme Programming is one of the most widely used approaches.

Extreme Programming

In Extreme Programming, rather than designing whole of the system at the start of the project, the preliminary design work is reduced to solving the simple tasks that have already been identified.

The developers communicate directly with customers and other developers to understand the initial requirements. They start with a very simple task and then get feedback by testing their software as soon as it is developed. The system is delivered to the customers as soon as possible, and the requirements are refined or added based on customer feedback. In this way, requirements evolve over a period of time, and developers are able to respond quickly to changes.

The real design effort occurs when the developers write the code to fulfill the specific engineering task. The engineering task is a part of a greater user story (which is similar to a use case). The user story concerns itself with how the overall system solves a particular problem. It represents a part of the functionality of the overall system. A group of user stories is capable of describing the system as a whole. The developers refactor the previous code iteration to establish the design needed to implement the functionality.

During the Extreme Programming development life cycle, developers usually work in pairs. One developer writes the code for a particular feature, and the second developer reviews the code to ensure that it uses simple solutions and adheres to best design principles and coding practices.

Discussion of the core practices of Extreme Programming is beyond the scope of this chapter. For more information, see the links referred to in "More Information" later in this section.

Test-driven development, which is one of the core practices in Extreme Programming, is discussed in greater detail later in this chapter.

When to Use Extreme Programming

Extreme Programming is useful in the following situations:

  • When the customer does not have a clear understanding of the details of the new system. The developers interact continuously with the customer, delivering small pieces of the application to the customer for feedback, and taking corrective action as necessary.
  • When the technology used to develop the system is new compared to other technologies. Frequent test cycles in Extreme Programming mitigate the risk of incompatibility with other existing systems.
  • When you can afford to create automated unit and functional tests. In some situations, you may need to change the system design so that each module can be tested in isolation using automated unit tests.
  • When the team size is not very large (usually 2 to 12 people). Extreme Programming is successful in part because it requires close team interaction and working in pairs. A large team would have difficulty in communicating efficiently at a fast pace. However, large teams have used Extreme Programming successfully.

More Information

For more information about the core practices in Extreme Programming, see the following resources:

Test-Driven Development

Test-driven development is one of the core practices of Extreme Programming. The practice extends the feedback approach, and requires that you develop test cases before you develop code. Developers develop functionality to pass the existing test cases. The test team then adds new test cases to test the existing functionality, and runs the entire test suite to ensure that the code fails (either because the existing functionality needs to be modified or because required functionality is not yet included). The developers then modify the functionality or create new functionality so that the code can withstand the failed test cases. This cycle continues until the test code passes all of the test cases that the team can create. The developers then refactor the functional code to remove any duplicate or dead code and make it more maintainable and extensible.

Test-driven development reverses the traditional development process. Instead of writing functional code first and then testing it, the team writes the test code before the functional code. The team does this in very small steps—one test and a small amount of corresponding functional code at a time. The developers do not write code for new functionality until a test fails because some functionality is not present. Only when a test is in place do developers do the work required to ensure that the test cases in the test suite pass. In subsequent iterations, when the team has the updated code and another set of test cases, the code may break several existing tests as well as the new tests. The developers continue to develop or modify the functionality to pass all of the test cases.

Test-driven development allows you to start with an unclear set of requirements and relies on the feedback loop between the developers and the customers for input on the requirements. The customer or a customer representative is the part of the core team and immediately provides feedback about the functionality. This practice ensures that the requirements evolve over the course of the project cycle. Testing before writing functional code ensures that the functional code addresses all of the requirements, without including unnecessary functionality.

With test-driven development, you do not need to have a well-defined architectural design before beginning the development phase, as you do with traditional development life cycle methodologies. Test-driven development allows you to tackle smaller problems first and then evolve the system as the requirements become more clear later in the project cycle.

Other advantages of test-driven development are as follows:

  • Test-driven development promotes loosely coupled and highly cohesive code, because the functionality is evolved in small steps. Each piece of the functionality needs to be self-sufficient in terms of the helper classes and the data that it acts on so that it can be successfully tested in isolation.
  • The test suite acts as documentation for the functional specification of the final system.
  • The system uses automated tests, which significantly reduce the time taken to retest the existing functionality for each new build of the system.
  • When a test fails, you have a clear idea of the tasks that you must perform to resolve the problem. You also have a clear measure of success when the test no longer fails. This increases your confidence that the system actually meets the customer requirements.

Test-driven development helps ensure that your source code is thoroughly unit tested. However, you still need to consider traditional testing techniques, such as functional testing, user acceptance testing, and system integration testing. Much of this testing can also be done early in your project. In fact, in Extreme Programming, the acceptance tests for a user story are specified by the project stakeholder(s) either before or in parallel to the code being written, giving stakeholders the confidence that the system meets their requirements.

Steps in Test-Driven Development

The test-driven development process consists of the steps shown in Figure 2.1.

Figure 2.1. Test-driven development process

The steps can be summarized as follows:

  1. Create the test code. Use an automated test framework to create the test code. The test code drives the development of functionality.
  2. Write/Modify the functional code. Write the functional code for the application block so that it can pass all test cases from the test code. The first iteration involves developing new functionality, and subsequent iterations involve modifying the functionality based on the failed test cases.
  3. Create additional tests. Develop additional tests for testing of the functional code.
  4. Test the functional code. Test the functional code based on the test cases developed in Step 3 and Step 1. Repeat steps 2 through 4 until the code is able to pass all of the test cases.
  5. Refactor the code. Modify the code so that there is no dead code or duplication. The code should adhere to best practices for maintainability, performance, and security.

The following subsections describe each step in detail.

Step 1: Create the Test Code

You create the test code before any code is written for the functionality of the application block. You can either write a custom test suite or use a testing framework (such as NUnit) to automate testing of the API. The general guidelines for writing the test code are as follows:

  • In Step 1, your goal is to write test code that tests basic functionality.
  • Write a test case for each code requirement. The developers will write the functional code based on the tests so that the test cases have an execution result of "pass."
  • Avoid writing code for multiple test cases at any one time. Write code for a single test case, and proceed through the remaining cycle of coding and refactoring.
  • Write code that tests only small pieces of functionality. If the test code is complex, divide it into smaller tests so that each one tests only a small piece of functionality.

After you complete Step 1, you should have various test suites written to test the functionality. For example, if you need to test the functionality of the CMAB using NUnit, create test code first, as follows:

 [Test]
public void GetXmlFileSection() {
string sourceSection = "XmlFile";
CustomConfigurationData custom = ConfigurationManager.Read( sourceSection ) as CustomConfigurationData;

Assert.Equals("Red", custom.Color, "Red color was expected" );
Assert.Equals(45, custom.Size, "45 size was expected" );
Assert.Equals("Some text", custom.SomeText, "Some text was expected" );
}
public class CustomConfigurationData{
public CustomConfigurationData() {}
public string Color{
get{ return _color; }
set{ _color = value; }
} string _color;

public string SomeText {
get{ return _some_text; }
set{ _some_text = value; }
} string _some_text;
public int Size{
get{ return _size; }
set{ _size = value; }
} int _size;
public override string ToString(){
return "Color = " + _color + "; FontSize = " + _size.ToString( System.Globalization.CultureInfo.CurrentUICulture );
}
}

You will not be able to compile the test code until the developer writes the functional code. The developer then writes the code so that the test compiles and has an execution result of "pass."

More Information

For more information about NUnit, see http://www.nunit.org.

Step 2: Write or Modify the Functional Code

In Step 2, your goal is to develop functionality that passes the test cases that were created in Step 1. If you are in the first iteration, in all probability the test code will not compile and you can assume that it failed. You must modify the functionality being tested so that the test code compiles and can be executed.

After the functionality passes all of the test cases created in Step 1, the developer stops developing the module.

The functionality in the following example passes the test case that was created in Step 1.

class ConfigurationManager {
public static object Read(string sourceSection){
CustomConfigurationData customConfiguration = new CustomConfigurationData();
customConfiguration.Color = "Red";
customConfiguration.SomeText = "Some text";
customConfiguration.Size = "45";
return obj;
}

The outcome of Step 2 is code that implements basic functionality. However, in the later iterations, when additional test cases are created to test the functionality, the code in the previous example will fail. The developers must make further changes in subsequent iterations so that all of the functionality can pass the additional test cases. The developers continue to improve the functionality until all of the existing test cases have an execution result of "pass."

Step 3: Create Additional Tests

After Step 2, the code has the basic functionality to pass all of the test cases that were created in Step 1. The tester must now test this functionality for various types of input.

You should ensure that the amount of time spent developing additional test cases is proportionate to the criticality of the feature. For example, if various code paths in an API need to be tested by providing different input, you can write multiple test stubs, with each stub catering to one possible code path.

The general guidelines for creating these additional tests are as follows:

  • Create additional tests that could possibly break the code. Continue to create these additional tests until you have no more test cases.
  • Write test cases that focus on the goal of the code (what the user needs the code to do), rather than the implementation. For example, if the goal of a function is to return values from a database based on some identifier, test the function by passing both valid and invalid identifier values, rather than testing only those values that the current implementation supports (which can be only a valid set of values).
  • If all of the test cases that were created in Step 1 have passed but still the functionality does not work as intended, you have probably missed an important test scenario and must develop additional test cases that reproduce the failure of the functionality. These test cases are in addition to the existing test cases that pass invalid input to the API to force an exception.
  • Avoid writing separate test cases for each unique combination of input. The number of test cases you write should be based on the risk assessment of the feature and on ensuring that all critical scenarios and the majority of test input have been accounted for.

The following additional test case will fail the functionality that was written for the example in Step 2.

 [Test]
public void GetSqlSection() {
string sourceSection = "SQLServer";
CustomConfigurationData custom = ConfigurationManager.Read( sourceSection ) as SqlData;

Assert.Equals("John", custom.FirstName, "John was expected");
Assert.Equals("Trovalds", custom.LastName, "Trovalds was expected");
}
public class SqlData{
public SqlData () {}
public string FirstName{
get{ return _firstname; }
set{ _firstname = value; }
} string _firstname;
public string LastName {
get{ return _lastName; }
set{ _lastName = value; }
} string _lastName;
}

Step 4: Test the Functional Code

The next step is to test the functionality by using the additional test cases. To do so, execute all of the test cases from Step 1and Step 3, and note any test cases that fail the test. The developers must then make changes to the functional code so that it passes all of the existing test cases.

Repeat steps 2, 3, and 4 until the testers can no longer create additional test cases that break the functionality of the code. Each time a test fails, the developers fix the functionality based on the test cases and submit the improved code to the testers for a new round of testing.

Step 5: Refactor the Code

After the completion of the previous steps, the code can be run and is capable of handling all of the test cases. In Step 5, the developers refactor the code to make it maintainable and extensible so that changes in one part of the module can be worked on in isolation without changing the actual interfaces exposed to the end user.

In addition to making code easier to maintain, refactoring removes duplicate code and can reduce code complexity. This helps to ensure that the basic principles of loosely coupled and highly cohesive design are followed without breaking the functionality in any way. For example, if you have the functionality shown in the first example, you can refactor it as shown in the second example. (In these examples, the helper class does not own the data it works on. Instead, the caller passes the data to the helper class. This is typical of the BOB class anti-pattern.)

 [Original Code]
Class BOBClass
{
void DoWork()
{
Helper hlpObject = new HelperClass();
If (hlpObject.CanDoWork)
{
Double x = hlpObject.SomeValue + hlpObject.SomeOtherValue;
hlpObject.NewValue = x;
}
}
}
Class Helper
{
bool CanDoWork = true;
double SomeValue;
double SomeOtherValue;
double NewValue;
}


[Refactored Code]
Class BOBClass{
void DoWork(){
Helper hlpObject = new HelperClass();
hlpObject.DoWork();
}
}
Class Helper{
bool CanDoWork = true;
double SomeValue;
double SomeOtherValue;
double NewValue;
void DoWork(){
If (CanDoWork)
NewValue = SomeValue + SomeOtherValue;
}
}

Depending on the size of the application you are developing, you should consider refactoring the code at regular intervals during the development cycle. Doing so helps to reduce code redundancy and to optimize the design, and it prevents the refactoring exercise from being a monumental task at the end of the project.

After completing Step 5, you repeat the test-driven development process for each subsequent feature requirement, until the system being developed is complete.

Although the test-driven model requires you to test each line of code as it is developed, you still need to perform other types of testing, such as performance testing, security testing, and globalization testing.

Agile Testing: Example

Agile methodology with Extreme Programming and test-driven development was used to develop the Smart Client Offline Application Block. The following are highlights of the approach taken on the project:

  • The test team and the development team were not formally separated. The developers worked in pairs, with one person developing the test cases and the other writing the functionality for the module.
  • There was much more interaction among team members than there is when following a traditional development model. In addition to using the informal chat-and-develop mode, the team held a 30 minute daily standup meeting, which gave team members a forum for asking questions and resolving problems, and weekly iterative review meetings to track the progress for each iterative cycle.
  • Project development began without any formal design document. The specifications were in the form of user stories that were agreed upon by the team members. In the weekly iterative review meetings, team members planned how to complete these stories and how many iterations to assign for each story.
  • Each story was broken down into several tasks. All of the stories and corresponding tasks were written down on small cards that served as the only source of design documentation for the application block.
  • While developing each task or story, NUnit test suites were written to drive the development of features.
  • No formal test plans were developed. The testing was primarily based on the tasks or stories for feature development. The development team got immediate feedback from the test team. Having the test team create the quick start samples gave the development team a perspective on the real-life usage of the application block.
  • After the task or story passed all of the NUnit test cases and was complete, quick start samples were developed to showcase the functionality. The quick start samples demonstrated the usage of the application block and were useful for further testing the code in the traditional way (functional and integration tests). Any discrepancies found in this stage were reported immediately and were fixed on a case-by-case basis. The modified code was tested again with the automated test suites and then was handed over to be tested again with the quick start samples.

Napster is back............. :-)

Readers :
I am back to blogging, I promise now you can find out some thing new.

Tuesday, February 12, 2008

Theoretical Foundation for Relational Databases

Introduction

An important aspect of most every business is record keeping. In our information society, this has become an important aspect of business, and much of the world's computing power is dedicated to maintaining and using databases.

Databases of all kinds pervade almost every business. All kinds of data, from emails and contact information to financial data and records of sales, are stored in some form of a database. The quest is on for meaningful storage of less-structured information, such as subject knowledge.

This is the first of a two-part article that will provide an introduction to relational databases and the SQL language. This first part describes some of the key elements of the technology with an emphasis on database normalization. The second part will describe a less theoretical approach to database design, as well as provide an introduction to the SQL language.

History

The concept of relational databases was first described by Edgar Frank Codd (almost exclusively referenced as E. F. Codd in technical literature) in the IBM research report RJ599, dated August 19th, 1969.1 However, the article that is usually considered the cornerstone of this technology is "A Relational Model of Data for Large Shared Data Banks," published in Communications of the ACM(Vol. 13, No. 6, June 1970, pp. 377-87). Only the first part of the article is available online.

Additional articles by E. F. Codd throughout the 1970s and 80s are still considered gospel for relational database implementations. His famous "Twelve Rules for Relational Databases"2 were published in two Computerworld articles "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" on October 14, 1985, and October 21, 1985, respectively. He has since expanded on the 12 rules, and they now number 333, as published in his book "The Relational Model for Database Management, Version 2" (Addison -Wesley, 1990).

Codd's twelve rules call for a language that can be used to define, manipulate, and query the data in the database, expressed as a string of characters. The language, SQL, was originally developed in the research division of IBM (initially at Yorktown Heights, N.Y., and later at San Jose, Calif., and Raymond Boyce and Donald Chamberlin were the original designers.)3 and has been adopted by all major relational database vendors. The name SQL originally stood for Structured Query Language. The first commercially available implementation of the language was named SEQUEL (for Sequential English QUEry Language) and was part of IBM's SEQUEL/DS product. The name was later changed for legal reasons. Thus, many long-time database developers use the pronunciation "see-quell."

SQL has been adopted as an ANSI/ISO standard. Although revised in 1999 (usually referenced as SQL99 or SQL3), most vendors are still not fully compliant with the 1992 version of the standard. The 1992 standard is smaller and simpler to reference for a user, and since only some of the 1999-specific requirements are typically implemented at this time, it may be a better starting point for learning the language.

Need-to-Know for the Database Developer

The Rules of the Game

Codd's Twelve Rules

Many references to the twelve rules include a thirteenth rule - or rule zero:
A relational database management system (DBMS) must manage its stored data using only its relational capabilities.
This is basically a corollary or companion requirement to rule #4

  1. Information Rule
    All information in the database should be represented in one and only one way -- as values in a table.
  2. Guaranteed Access Rule
    Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.
  3. Systematic Treatment of Null Values
    Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type.
  4. Dynamic Online Catalog Based on the Relational Model
    The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.
  5. Comprehensive Data Sublanguage Rule
    A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:
    a. data definition
    b. view definition
    c. data manipulation (interactive and by program)
    d. integrity constraints
    e. authorization
    f. transaction boundaries (begin, commit, and rollback).
  6. View Updating Rule
    All views that are theoretically updateable are also updateable by the system.
  7. High-Level Insert, Update, and Delete
    The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update, and deletion of data.
  8. Physical Data Independence
    Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.
  9. Logical Data Independence
    Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
  10. Integrity Independence
    Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
  11. Distribution Independence
    The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.
  12. Nonsubversion Rule
    If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language.

(Author's Note: I do not have access to E.F. Codd's original text, but the above wording appears to be commonly used in other documents. See source 2.)

The rules primarily address implementation requirements for relational database management system (RDBMS) vendors. However, some of them also have an impact on application design.

In the early days of relational database products, these twelve rules were often used to evaluate RDBMSs. In the academic community, the discussions of full compliance of RDBMS versus the Relational Model continues, as does the discussion about whether the SQL language satisfies all the requirements. But we will stick to more practical matters.

How to Design the Database

Theoretical Foundation

Purpose
The problem with data is that it changes. Not just its individual items' values change, but their structure and use, especially when kept over extended periods of time. Even for public records that may have been kept for hundreds of years, there are occasionally changes in what data elements are captured and recorded and how.

Therefore, a method to avoid problems due to duplication of data values and modification of structure and content has been developed. This method is called normalization.

You normalize a database in order to ensure data consistency and stability, to minimize data redundancy, and to ensure consistent updatability and maintainability of the data, and avoid update and delete anomalies that result in ambiguous data or inconsistent results.

Some Key Concepts
Before we continue, I would like to show the correlation between the formal names of Tables, Rows, and Columns in Relational Theory and their more common counterparts:

Formal Name

Common Name

Also Known As

Relation

Table

Entity

Tuple

Row

Record

Attribute

Column

Field

A Primary Key is one or more columns whose values uniquely identify a row in a table (See rule #2 above).

A Candidate Key is one or more columns whose values could be used to uniquely identify a row in a table. The Primary Key is chosen among a table's Candidate Keys.

Normalization
Normalization is the formalization of the design process of making a database compliant with the concept of a Normal Form. It addresses various ways in which we may look for repeating data values in a table. There are several levels of the Normal Form, and each level requires that the previous level be satisfied. I have used the wording (indicated in italicized text) for each normalization rule from the Handbook of Relational Database Design by Candace C. Fleming and Barbara von Halle.4

The normalization process is based on collecting an exhaustive list of all data items to be maintained in the database and starting the design with a few "superset" tables. Theoretically, it may be possible, although not very practical, to start by placing all the attributes in a single table. For best results, start with a reasonable breakdown.

First Normal Form
Reduce entities to first normal form (1NF) by removing repeating or multivalued attributes to another, child entity.

Basically, make sure that the data is represented as a (proper) table. While key to the relational principles, this is somewhat a motherhood statement. However, there are six properties of a relational table (the formal name for "table" is "relation"):

Property 1: Entries in columns are single-valued.
Property 2: Entries in columns are of the same kind.
Property 3: Each row is unique.
Property 4: Sequence of columns is insignificant.
Property 5: Sequence of rows is insignificant.
Property 6: Each column has a unique name.

The most common sins against the first normal form (1NF) are the lack of a Primary Key and the use of "repeating columns." This is where multiple values of the same type are stored in multiple columns. Take, for example, a database used by a company's order system. If the order items were implemented as multiple columns in the Orders table, the database would not be 1NF:

OrderNo

Line1Item

Line1Qty

Line1Price

Line2Item

Line2Qty

Line2Price

245

PN768

1

$35

PN656

3

$15

To make this first normal form, we would have to create a child entity of Orders (Order Items) where we would store the information about the line items on the order. Each order could then have multiple Order Items related to it.

OrderNo

245

OrderNo

Item

Qty

Price

245

PN768

1

$35

245

PN656

3

$15

Second Normal Form
Reduce first normal form entities to second normal form (2NF) by removing attributes that are not dependent on the whole primary key.

The purpose here is to make sure that each column is defined in the correct table. Using the more formal names may make this a little clearer. Make sure each attribute is kept with the entity that it describes.

Consider the Order Items table that we established above. If we place Customer reference in the Order Items table (Order Number, Line Item Number, Item, Qty, Price, Customer) and assume that we use Order Number and Line Item Number as the Primary Key, it quickly becomes obvious that the Customer reference becomes repeated in the table because it is only dependent on a portion of the Primary Key - namely the Order Number. Therefore, it is defined as an attribute of the wrong entity. In such an obvious case, it should be immediately clear that the Customer reference should be in the Orders table, not the Order Items table.

So instead of:

OrderNo

ItemNo

Customer

Item

Qty

Price

245

1

SteelCo

PN768

1

$35

245

2

SteelCo

PN656

3

$15

246

1

Acme Corp

PN371

1

$2.99

246

2

Acme Corp

PN015

7

$5

We get:

OrderNo

Customer

245

SteelCo

246

Acme Corp

OrderNo

ItemNo

Item

Qty

Price

245

1

PN768

1

$35

245

2

PN656

3

$15

246

1

PN371

1

$2.99

246

2

PN015

7

$5

Third Normal Form
Reduce second normal form entities to third normal form (3NF) by removing attributes that depend on other, nonkey attributes (other than alternative keys).

This basically means that we shouldn't store any data that can either be derived from other columns or belong in another table. Again, as an example of derived data, if our Order Items table includes both Unit Price, Quantity, and Extended Price, the table would not be 3NF. So we would remove the Extended Price (= Qty * Unit Price), unless, of course, the value saved is a manually modified (rebate) price, but the Unit Price reflects the quoted list price for the items at the time of order.

Also, when we established that the Customer reference did not belong in the Order Items table, we said to move it to the Orders table. Now if we included customer information, such as company name, address, etc., in the Orders table, we would see that this information is dependent not so much on the Order per se, but on the Customer reference, which is a nonkey (not Primary Key) column in the Orders table. Therefore, we need to create another table (Customers) to hold information about the customer. Each Customer could then have multiple Orders related to it.

OrderNo

Customer

Address

City

245

SteelCo

Works Blvd

Vinings

246

Acme Corp

North Drive

South Bend

247

SteelCo

Works Blvd

Vinings

OrderNo

Customer

245

SteelCo

246

Acme Corp

247

SteelCo

Customer

Address

City

SteelCo

Works Blvd

Vinings

Acme Corp

North Drive

South Bend

Why Stop Here?

Many database designers stop at 3NF, and those first three levels of normalization do provide the most bang for the buck. Indeed, these were the original normal forms described in E. F. Codd's first papers. However, there are currently four additional levels of normalization, so read on. Be aware of what you don't do, even if you stop with 3NF. In some cases, you may even need to de-normalize some for performance reasons.

Boyce/Codd Normal Form
Reduce third normal form entities to Boyce/Codd normal form (BCNF) by ensuring that they are in third normal form for any feasible choice of candidate key as primary key.

In short, Boyce/Codd normal form (BCNF) addresses dependencies between columns that are part of a Candidate Key.

Some of the normalizations performed above may depend on our choice of the Primary Key. BCNF addresses those cases where applying the normalization rules to a Candidate Key other than the one chosen as the Primary Key would give a different result. In actuality, if we substitute any Candidate Key for Primary Key in 2NF and 3NF, 3NF would be equivalent with BCNF.

In a way, the BCNF is only necessary because the formal definitions center around the Primary Key rather than an entity item abstraction. If we define an entity item as an object or information instance that correlates to a row, and consider the normalization rules to refer to entity items, this normal form would not be required.

In our example for 2NF above, we assumed that we used a composite Primary Key consisting of Order Number and Line Item Number, and we showed that the customer reference was only dependent on a portion of the Primary Key - the Order Number. If we had assigned a unique identifier to every Order Item independent of the Order Number, and used that as a single column Primary Key, the normalization rule itself would not have made it clear that it was necessary to move the Customer reference.

There are some less obvious situations for this normalization rule where a set of data actually contains more than one relation, which the following example should illustrate.

Consider a scenario of a large development organization, where the projects are organized in project groups, each with a team leader acting as a liaison between the overall project and a group of developers in a matrix organization. Assume we have the following situation:

  • Each Project can have many Developers.
  • Each Developer can have many Projects.
  • For a given Project, each Developer only works for one Lead Developer.
  • Each Lead Developer only works on one Project.
  • A given Project can have many Lead Developers.

In this case, we could theoretically design a table in two different ways:

ProjectNo

Developer

Lead Developer

20020123

John Doe

Elmer Fudd

20020123

Jane Doe

Sylvester

20020123

Jimbo

Elmer Fudd

20020124

John Doe

Ms. Depesto

Case 1: Project Number and Developer as a Candidate Key can be used to determine the Lead Developer. In this case, the Lead Developer depends on both attributes of the key, and the table is 3NF if we consider that our Primary Key.

Lead Developer

Developer

ProjectNo

Elmer Fudd

John Doe

20020123

Sylvester

Jane Doe

20020123

Elmer Fudd

Jimbo

20020123

Ms. Depesto

John Doe

20020124

Case 2: Lead Developer and Developer is another Candidate Key, but in this case, the Project Number is determined by the Lead Developer alone. Thus it would not be 3NF if we consider that our Primary Key.

In reality, these three data items contain more than one relation (Project - Lead Developer and Lead Developer - Developer). To normalize to BCNF, we would remove the second relation and represent it in a second table. (This also illustrates why a table is formally named a relation.)

ProjectNo

Lead Developer

20020123

Elmer Fudd

20020123

Sylvester

20020123

Elmer Fudd

20020124

Ms. Depesto

Lead Developer

Developer

Elmer Fudd

John Doe

Elmer Fudd

Jimbo

Sylvester

Jane Doe

Ms. Depesto

John Doe

Fourth Normal Form
Reduce Boyce/Codd normal form entities to fourth normal form (4NF) by removing any independently multivalued components of the primary key to two new parent entities. Retain the original (now child) entity only if it contains other, nonkey attributes.

Where BCNF deals with dependents of dependents, 4NF deals with multiple, independent dependents of the Primary Key. This is a bit easier to illustrate.

Let us say we wanted to represent the following data: Manager, Manager Awards, and Direct Reports. Here, a Manager could have multiple Awards, as well as multiple Direct Reports. 4NF requires that these be split into two separate tables, one for Manager - Awards, and one for Manager - Direct Reports. We may need to maintain a Managers table for other Manager attributes.

This table:

Manager

Awards

Direct Reports

Scrooge McDuck

Stingy John

Donald Duck

Minnie Mouse

Mouse of the Month

Mickey Mouse

Minnie Mouse

Mouse of the Year

Pluto

Clara

Goofy

becomes two tables:

Manager Awards Table

Manager

Awards

Scrooge McDuck

Stingy John

Minnie Mouse

Mouse of the Month

Minnie Mouse

Mouse of the Year

Clara

Direct Reports Table

Manager

Direct Reports

Scrooge McDuck

Donald Duck

Minnie Mouse

Mickey Mouse

Minnie Mouse

Pluto

Clara

Goofy

Fifth Normal Form
Reduce fourth normal form entities to fifth normal form (5NF) by removing pairwise cyclic dependencies (appearing within composite primary keys with three or more component attributes) to three or more parent entities.

This addresses problems that arise from representing associations between multiple entities with interdependencies. Making it 5NF consists of adding parent tables, one for each meaningful combination that has children in the original table.

A table with such information is 5NF if the information cannot be represented in multiple smaller entities alone.

An example of such a situation may be the representation of Actors, Plays, and Theaters. In order to know who plays what and where, we need the combination of these three attributes. However, they each relate to each other cyclically. So to resolve this, we would need to establish parent tables with Actor - Play, Play - Theater, and Theater - Actor. These would each contain a portion of the Primary Key in the Actor, Play, and Theater table.

Actor

Play

Theater

Billy Bob

Catcher in the Rye

West 42nd

Ann

Catcher in the Rye

West 42nd

John

Catch-22

Broadway

Lily

Hamlet

Broadway

Lisa

Cats

West 42nd

Andy

Cats

Darlington

Domain Key Normal Form
(Domain/key normal form (DKNF) is a normal form used in database normalization which requires that the database contains no constraints other than domain constraints and key constraints.

A domain constraint specifies the permissible values for a given attribute, while a key constraint specifies the attributes that uniquely identify a row in a given table.

The domain/key normal form is the Holy Grail of relational database design,[citation needed] achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies.

It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve. However, failing to achieve the domain/key normal form may carry long-term, hidden costs due to anomalies which appear in databases adhering only to lower normal forms over time.

A violation of DKNF occurs in the following table:

Wealthy Person

Wealthy Person

Wealthy Person Type

Net Worth in Dollars

Steve

Eccentric Millionaire

124,543,621

Roderick

Evil Billionaire

6,553,228,893

Katrina

Eccentric Billionaire

8,829,462,998

Gary

Evil Millionaire

495,565,211

(Assume that the domain for Wealthy Person consists of the names of all wealthy people in a pre-defined sample of wealthy people; the domain for Wealthy Person Type consists of the values 'Eccentric Millionaire', 'Eccentric Billionaire', 'Evil Millionaire', and 'Evil Billionaire'; and the domain for Net Worth in Dollars consists of all integers greater than or equal to 1,000,000.)

There is a constraint linking Wealthy Person Type to Net Worth in Dollars, even though we cannot deduce one from the other. The constraint dictates that an Eccentric Millionaire or Evil Millionaire will have a net worth of 1,000,000 to 999,999,999 inclusive, while an Eccentric Billionaire or Evil Billionaire will have a net worth of 1,000,000,000 or higher. This constraint is neither a domain constraint nor a key constraint; therefore we cannot rely on domain contraints and key constraints to guarantee that an inconsistent Wealthy Person Type / Net Worth in Dollars combination does not make its way into the database.

The DKNF violation could be eliminated by altering the Wealthy Person Type domain to make it consist of just two values, 'Evil' and 'Eccentric' (the wealthy person's status as a millionaire or billionaire is implicit in their Net Worth in Dollars, so no useful information is lost).

DKNF is frequently difficult to achieve in practice.

Conclusion

While we may not always observe all the rules or normalize our databases to the fifth and domain key normal form, it is important to have a basic understanding of the theoretical principles of database design. It will help us not only design normalized databases, but to build more powerful and flexible applications. Also, it will help us ensure that our data remains usable. Now that we have laid the theoretical foundation and defined the formal database design methods for normalization, it may be time to take a break. I need one anyway. :->

When we resume with Part 2 of this article, I will show how we can design a fairly well normalized database using nothing but some common sense, a few simple rules, and a piece of string, so get ready! Part 2 of this article will address a different approach to designing the database, normalization through synthesis, and will describe the SQL language.

Oh, and never mind the piece of string. :->

Powered By Mushu

Powered By Mushu