Tuesday, February 12, 2008

Theoretical Foundation for Relational Databases

Introduction

An important aspect of most every business is record keeping. In our information society, this has become an important aspect of business, and much of the world's computing power is dedicated to maintaining and using databases.

Databases of all kinds pervade almost every business. All kinds of data, from emails and contact information to financial data and records of sales, are stored in some form of a database. The quest is on for meaningful storage of less-structured information, such as subject knowledge.

This is the first of a two-part article that will provide an introduction to relational databases and the SQL language. This first part describes some of the key elements of the technology with an emphasis on database normalization. The second part will describe a less theoretical approach to database design, as well as provide an introduction to the SQL language.

History

The concept of relational databases was first described by Edgar Frank Codd (almost exclusively referenced as E. F. Codd in technical literature) in the IBM research report RJ599, dated August 19th, 1969.1 However, the article that is usually considered the cornerstone of this technology is "A Relational Model of Data for Large Shared Data Banks," published in Communications of the ACM(Vol. 13, No. 6, June 1970, pp. 377-87). Only the first part of the article is available online.

Additional articles by E. F. Codd throughout the 1970s and 80s are still considered gospel for relational database implementations. His famous "Twelve Rules for Relational Databases"2 were published in two Computerworld articles "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" on October 14, 1985, and October 21, 1985, respectively. He has since expanded on the 12 rules, and they now number 333, as published in his book "The Relational Model for Database Management, Version 2" (Addison -Wesley, 1990).

Codd's twelve rules call for a language that can be used to define, manipulate, and query the data in the database, expressed as a string of characters. The language, SQL, was originally developed in the research division of IBM (initially at Yorktown Heights, N.Y., and later at San Jose, Calif., and Raymond Boyce and Donald Chamberlin were the original designers.)3 and has been adopted by all major relational database vendors. The name SQL originally stood for Structured Query Language. The first commercially available implementation of the language was named SEQUEL (for Sequential English QUEry Language) and was part of IBM's SEQUEL/DS product. The name was later changed for legal reasons. Thus, many long-time database developers use the pronunciation "see-quell."

SQL has been adopted as an ANSI/ISO standard. Although revised in 1999 (usually referenced as SQL99 or SQL3), most vendors are still not fully compliant with the 1992 version of the standard. The 1992 standard is smaller and simpler to reference for a user, and since only some of the 1999-specific requirements are typically implemented at this time, it may be a better starting point for learning the language.

Need-to-Know for the Database Developer

The Rules of the Game

Codd's Twelve Rules

Many references to the twelve rules include a thirteenth rule - or rule zero:
A relational database management system (DBMS) must manage its stored data using only its relational capabilities.
This is basically a corollary or companion requirement to rule #4

  1. Information Rule
    All information in the database should be represented in one and only one way -- as values in a table.
  2. Guaranteed Access Rule
    Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.
  3. Systematic Treatment of Null Values
    Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type.
  4. Dynamic Online Catalog Based on the Relational Model
    The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.
  5. Comprehensive Data Sublanguage Rule
    A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:
    a. data definition
    b. view definition
    c. data manipulation (interactive and by program)
    d. integrity constraints
    e. authorization
    f. transaction boundaries (begin, commit, and rollback).
  6. View Updating Rule
    All views that are theoretically updateable are also updateable by the system.
  7. High-Level Insert, Update, and Delete
    The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update, and deletion of data.
  8. Physical Data Independence
    Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.
  9. Logical Data Independence
    Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
  10. Integrity Independence
    Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
  11. Distribution Independence
    The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.
  12. Nonsubversion Rule
    If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language.

(Author's Note: I do not have access to E.F. Codd's original text, but the above wording appears to be commonly used in other documents. See source 2.)

The rules primarily address implementation requirements for relational database management system (RDBMS) vendors. However, some of them also have an impact on application design.

In the early days of relational database products, these twelve rules were often used to evaluate RDBMSs. In the academic community, the discussions of full compliance of RDBMS versus the Relational Model continues, as does the discussion about whether the SQL language satisfies all the requirements. But we will stick to more practical matters.

How to Design the Database

Theoretical Foundation

Purpose
The problem with data is that it changes. Not just its individual items' values change, but their structure and use, especially when kept over extended periods of time. Even for public records that may have been kept for hundreds of years, there are occasionally changes in what data elements are captured and recorded and how.

Therefore, a method to avoid problems due to duplication of data values and modification of structure and content has been developed. This method is called normalization.

You normalize a database in order to ensure data consistency and stability, to minimize data redundancy, and to ensure consistent updatability and maintainability of the data, and avoid update and delete anomalies that result in ambiguous data or inconsistent results.

Some Key Concepts
Before we continue, I would like to show the correlation between the formal names of Tables, Rows, and Columns in Relational Theory and their more common counterparts:

Formal Name

Common Name

Also Known As

Relation

Table

Entity

Tuple

Row

Record

Attribute

Column

Field

A Primary Key is one or more columns whose values uniquely identify a row in a table (See rule #2 above).

A Candidate Key is one or more columns whose values could be used to uniquely identify a row in a table. The Primary Key is chosen among a table's Candidate Keys.

Normalization
Normalization is the formalization of the design process of making a database compliant with the concept of a Normal Form. It addresses various ways in which we may look for repeating data values in a table. There are several levels of the Normal Form, and each level requires that the previous level be satisfied. I have used the wording (indicated in italicized text) for each normalization rule from the Handbook of Relational Database Design by Candace C. Fleming and Barbara von Halle.4

The normalization process is based on collecting an exhaustive list of all data items to be maintained in the database and starting the design with a few "superset" tables. Theoretically, it may be possible, although not very practical, to start by placing all the attributes in a single table. For best results, start with a reasonable breakdown.

First Normal Form
Reduce entities to first normal form (1NF) by removing repeating or multivalued attributes to another, child entity.

Basically, make sure that the data is represented as a (proper) table. While key to the relational principles, this is somewhat a motherhood statement. However, there are six properties of a relational table (the formal name for "table" is "relation"):

Property 1: Entries in columns are single-valued.
Property 2: Entries in columns are of the same kind.
Property 3: Each row is unique.
Property 4: Sequence of columns is insignificant.
Property 5: Sequence of rows is insignificant.
Property 6: Each column has a unique name.

The most common sins against the first normal form (1NF) are the lack of a Primary Key and the use of "repeating columns." This is where multiple values of the same type are stored in multiple columns. Take, for example, a database used by a company's order system. If the order items were implemented as multiple columns in the Orders table, the database would not be 1NF:

OrderNo

Line1Item

Line1Qty

Line1Price

Line2Item

Line2Qty

Line2Price

245

PN768

1

$35

PN656

3

$15

To make this first normal form, we would have to create a child entity of Orders (Order Items) where we would store the information about the line items on the order. Each order could then have multiple Order Items related to it.

OrderNo

245

OrderNo

Item

Qty

Price

245

PN768

1

$35

245

PN656

3

$15

Second Normal Form
Reduce first normal form entities to second normal form (2NF) by removing attributes that are not dependent on the whole primary key.

The purpose here is to make sure that each column is defined in the correct table. Using the more formal names may make this a little clearer. Make sure each attribute is kept with the entity that it describes.

Consider the Order Items table that we established above. If we place Customer reference in the Order Items table (Order Number, Line Item Number, Item, Qty, Price, Customer) and assume that we use Order Number and Line Item Number as the Primary Key, it quickly becomes obvious that the Customer reference becomes repeated in the table because it is only dependent on a portion of the Primary Key - namely the Order Number. Therefore, it is defined as an attribute of the wrong entity. In such an obvious case, it should be immediately clear that the Customer reference should be in the Orders table, not the Order Items table.

So instead of:

OrderNo

ItemNo

Customer

Item

Qty

Price

245

1

SteelCo

PN768

1

$35

245

2

SteelCo

PN656

3

$15

246

1

Acme Corp

PN371

1

$2.99

246

2

Acme Corp

PN015

7

$5

We get:

OrderNo

Customer

245

SteelCo

246

Acme Corp

OrderNo

ItemNo

Item

Qty

Price

245

1

PN768

1

$35

245

2

PN656

3

$15

246

1

PN371

1

$2.99

246

2

PN015

7

$5

Third Normal Form
Reduce second normal form entities to third normal form (3NF) by removing attributes that depend on other, nonkey attributes (other than alternative keys).

This basically means that we shouldn't store any data that can either be derived from other columns or belong in another table. Again, as an example of derived data, if our Order Items table includes both Unit Price, Quantity, and Extended Price, the table would not be 3NF. So we would remove the Extended Price (= Qty * Unit Price), unless, of course, the value saved is a manually modified (rebate) price, but the Unit Price reflects the quoted list price for the items at the time of order.

Also, when we established that the Customer reference did not belong in the Order Items table, we said to move it to the Orders table. Now if we included customer information, such as company name, address, etc., in the Orders table, we would see that this information is dependent not so much on the Order per se, but on the Customer reference, which is a nonkey (not Primary Key) column in the Orders table. Therefore, we need to create another table (Customers) to hold information about the customer. Each Customer could then have multiple Orders related to it.

OrderNo

Customer

Address

City

245

SteelCo

Works Blvd

Vinings

246

Acme Corp

North Drive

South Bend

247

SteelCo

Works Blvd

Vinings

OrderNo

Customer

245

SteelCo

246

Acme Corp

247

SteelCo

Customer

Address

City

SteelCo

Works Blvd

Vinings

Acme Corp

North Drive

South Bend

Why Stop Here?

Many database designers stop at 3NF, and those first three levels of normalization do provide the most bang for the buck. Indeed, these were the original normal forms described in E. F. Codd's first papers. However, there are currently four additional levels of normalization, so read on. Be aware of what you don't do, even if you stop with 3NF. In some cases, you may even need to de-normalize some for performance reasons.

Boyce/Codd Normal Form
Reduce third normal form entities to Boyce/Codd normal form (BCNF) by ensuring that they are in third normal form for any feasible choice of candidate key as primary key.

In short, Boyce/Codd normal form (BCNF) addresses dependencies between columns that are part of a Candidate Key.

Some of the normalizations performed above may depend on our choice of the Primary Key. BCNF addresses those cases where applying the normalization rules to a Candidate Key other than the one chosen as the Primary Key would give a different result. In actuality, if we substitute any Candidate Key for Primary Key in 2NF and 3NF, 3NF would be equivalent with BCNF.

In a way, the BCNF is only necessary because the formal definitions center around the Primary Key rather than an entity item abstraction. If we define an entity item as an object or information instance that correlates to a row, and consider the normalization rules to refer to entity items, this normal form would not be required.

In our example for 2NF above, we assumed that we used a composite Primary Key consisting of Order Number and Line Item Number, and we showed that the customer reference was only dependent on a portion of the Primary Key - the Order Number. If we had assigned a unique identifier to every Order Item independent of the Order Number, and used that as a single column Primary Key, the normalization rule itself would not have made it clear that it was necessary to move the Customer reference.

There are some less obvious situations for this normalization rule where a set of data actually contains more than one relation, which the following example should illustrate.

Consider a scenario of a large development organization, where the projects are organized in project groups, each with a team leader acting as a liaison between the overall project and a group of developers in a matrix organization. Assume we have the following situation:

  • Each Project can have many Developers.
  • Each Developer can have many Projects.
  • For a given Project, each Developer only works for one Lead Developer.
  • Each Lead Developer only works on one Project.
  • A given Project can have many Lead Developers.

In this case, we could theoretically design a table in two different ways:

ProjectNo

Developer

Lead Developer

20020123

John Doe

Elmer Fudd

20020123

Jane Doe

Sylvester

20020123

Jimbo

Elmer Fudd

20020124

John Doe

Ms. Depesto

Case 1: Project Number and Developer as a Candidate Key can be used to determine the Lead Developer. In this case, the Lead Developer depends on both attributes of the key, and the table is 3NF if we consider that our Primary Key.

Lead Developer

Developer

ProjectNo

Elmer Fudd

John Doe

20020123

Sylvester

Jane Doe

20020123

Elmer Fudd

Jimbo

20020123

Ms. Depesto

John Doe

20020124

Case 2: Lead Developer and Developer is another Candidate Key, but in this case, the Project Number is determined by the Lead Developer alone. Thus it would not be 3NF if we consider that our Primary Key.

In reality, these three data items contain more than one relation (Project - Lead Developer and Lead Developer - Developer). To normalize to BCNF, we would remove the second relation and represent it in a second table. (This also illustrates why a table is formally named a relation.)

ProjectNo

Lead Developer

20020123

Elmer Fudd

20020123

Sylvester

20020123

Elmer Fudd

20020124

Ms. Depesto

Lead Developer

Developer

Elmer Fudd

John Doe

Elmer Fudd

Jimbo

Sylvester

Jane Doe

Ms. Depesto

John Doe

Fourth Normal Form
Reduce Boyce/Codd normal form entities to fourth normal form (4NF) by removing any independently multivalued components of the primary key to two new parent entities. Retain the original (now child) entity only if it contains other, nonkey attributes.

Where BCNF deals with dependents of dependents, 4NF deals with multiple, independent dependents of the Primary Key. This is a bit easier to illustrate.

Let us say we wanted to represent the following data: Manager, Manager Awards, and Direct Reports. Here, a Manager could have multiple Awards, as well as multiple Direct Reports. 4NF requires that these be split into two separate tables, one for Manager - Awards, and one for Manager - Direct Reports. We may need to maintain a Managers table for other Manager attributes.

This table:

Manager

Awards

Direct Reports

Scrooge McDuck

Stingy John

Donald Duck

Minnie Mouse

Mouse of the Month

Mickey Mouse

Minnie Mouse

Mouse of the Year

Pluto

Clara

Goofy

becomes two tables:

Manager Awards Table

Manager

Awards

Scrooge McDuck

Stingy John

Minnie Mouse

Mouse of the Month

Minnie Mouse

Mouse of the Year

Clara

Direct Reports Table

Manager

Direct Reports

Scrooge McDuck

Donald Duck

Minnie Mouse

Mickey Mouse

Minnie Mouse

Pluto

Clara

Goofy

Fifth Normal Form
Reduce fourth normal form entities to fifth normal form (5NF) by removing pairwise cyclic dependencies (appearing within composite primary keys with three or more component attributes) to three or more parent entities.

This addresses problems that arise from representing associations between multiple entities with interdependencies. Making it 5NF consists of adding parent tables, one for each meaningful combination that has children in the original table.

A table with such information is 5NF if the information cannot be represented in multiple smaller entities alone.

An example of such a situation may be the representation of Actors, Plays, and Theaters. In order to know who plays what and where, we need the combination of these three attributes. However, they each relate to each other cyclically. So to resolve this, we would need to establish parent tables with Actor - Play, Play - Theater, and Theater - Actor. These would each contain a portion of the Primary Key in the Actor, Play, and Theater table.

Actor

Play

Theater

Billy Bob

Catcher in the Rye

West 42nd

Ann

Catcher in the Rye

West 42nd

John

Catch-22

Broadway

Lily

Hamlet

Broadway

Lisa

Cats

West 42nd

Andy

Cats

Darlington

Domain Key Normal Form
(Domain/key normal form (DKNF) is a normal form used in database normalization which requires that the database contains no constraints other than domain constraints and key constraints.

A domain constraint specifies the permissible values for a given attribute, while a key constraint specifies the attributes that uniquely identify a row in a given table.

The domain/key normal form is the Holy Grail of relational database design,[citation needed] achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies.

It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve. However, failing to achieve the domain/key normal form may carry long-term, hidden costs due to anomalies which appear in databases adhering only to lower normal forms over time.

A violation of DKNF occurs in the following table:

Wealthy Person

Wealthy Person

Wealthy Person Type

Net Worth in Dollars

Steve

Eccentric Millionaire

124,543,621

Roderick

Evil Billionaire

6,553,228,893

Katrina

Eccentric Billionaire

8,829,462,998

Gary

Evil Millionaire

495,565,211

(Assume that the domain for Wealthy Person consists of the names of all wealthy people in a pre-defined sample of wealthy people; the domain for Wealthy Person Type consists of the values 'Eccentric Millionaire', 'Eccentric Billionaire', 'Evil Millionaire', and 'Evil Billionaire'; and the domain for Net Worth in Dollars consists of all integers greater than or equal to 1,000,000.)

There is a constraint linking Wealthy Person Type to Net Worth in Dollars, even though we cannot deduce one from the other. The constraint dictates that an Eccentric Millionaire or Evil Millionaire will have a net worth of 1,000,000 to 999,999,999 inclusive, while an Eccentric Billionaire or Evil Billionaire will have a net worth of 1,000,000,000 or higher. This constraint is neither a domain constraint nor a key constraint; therefore we cannot rely on domain contraints and key constraints to guarantee that an inconsistent Wealthy Person Type / Net Worth in Dollars combination does not make its way into the database.

The DKNF violation could be eliminated by altering the Wealthy Person Type domain to make it consist of just two values, 'Evil' and 'Eccentric' (the wealthy person's status as a millionaire or billionaire is implicit in their Net Worth in Dollars, so no useful information is lost).

DKNF is frequently difficult to achieve in practice.

Conclusion

While we may not always observe all the rules or normalize our databases to the fifth and domain key normal form, it is important to have a basic understanding of the theoretical principles of database design. It will help us not only design normalized databases, but to build more powerful and flexible applications. Also, it will help us ensure that our data remains usable. Now that we have laid the theoretical foundation and defined the formal database design methods for normalization, it may be time to take a break. I need one anyway. :->

When we resume with Part 2 of this article, I will show how we can design a fairly well normalized database using nothing but some common sense, a few simple rules, and a piece of string, so get ready! Part 2 of this article will address a different approach to designing the database, normalization through synthesis, and will describe the SQL language.

Oh, and never mind the piece of string. :->

Monday, February 11, 2008

SQL Concepts.

Introduction

Structured Query Language, commonly abbreviated to SQL and pronounced as sequel, is not a conventional computer programming language in the normal sense of the phrase. With SQL you don't write applications, utilities, batch processes, GUI interfaces or any of the other types of program for which you'd use languages such as Visual Basic, C++, Java etc. Instead SQL is a language used exclusively to create, manipulate and interrogate databases. SQL is about data and results, each SQL statement returns a result, whether that result be a query, an update to a record or the creation of a database table.

The purpose of this tutorial is to give an introduction to the key concepts of SQL, and to provide sample code illustrating the use of the most common and useful of the SQL commands. The tutorial is structured around the building of a simple database of software and network users.

Relational Databases

Before looking at SQL itself it's worth looking more closely at what is meant by a database. Although some people talk about SQL databases, there is in fact no such thing. SQL itself makes no references to the underlying databases which it can access, which means that it is possible to have a SQL engine which can address relational databases, non-relational (flat) databases, even spreadsheets. However, SQL is most often used to address a relational database, which is what some people refer to as a SQL database.

The relational model of a database was first defined by Dr E F Codd in 1970, working at the IBM Research Labs at San Jose, and it describes a way of structuring data which is mathematically consistent and abstracted away from the physical implementation of the database.

The key concept of the relational model is that data sits in tables, and that data elements within different tables can be related in some way to provide meaningful information rather than just lists of data. Each table in the database has a number of fields which define the content and a number of instances of that field - think of the fields as columns in a table and the instances as rows. For example if we had a table of LAN users, then it might have fields of first name, surname and userID. The rows in such a table would contain the values which define each instance of user, for example the entry to define user John Smith might be: John, Smith, JSmith.

A single table of LAN users does not provide very much, even though it is more useful than no table at all. However the user table becomes immediately more useful when there are other tables to which it can be linked dynamically. Imagine that there is also a table of software products, with fields of Product, Version, User Licence and Product Code. Again this is useful information, especially for inventory purposes, but its value is magnified when there is a relationship between the user table and the software table. For example a link between these tables could be used to record which users are licensed for which software products. From this it would be possible to find out how many users are licensed to use product X, or which products user Y is using. Add fields of department to the user table, for example, and it's possible to derive the relationship between departmental software requirements and the software that is licensed and so on.

Obviously there is much more to the relational model then this, but as far as SQL is concerned it is the concept of tables, fields and relationships which is important.

Designing A Database

Finally, we're almost ready to start looking at some SQL code. The first place to start is in the creation of our sample database. In every kind of database project, the first step is to design the database on paper rather than jumping straight in with the code. One of the aims of this example database is to record the software that is installed on a network and to link this to users and to their departments. Additionally, building on the inventory side of things, it will also keep track of the PC hardware we have installed.

Based on these requirements four tables are required:

  • User - to record users
  • Software - table for the installed software products
  • UserSoftware - to record the Software used by each user (more on why we need this table later)
  • PC - table to record the PC details of each user

Designing each table is a fairly straightforward process of deciding what information is required for each element in the table. For the User we obviously need First Name, Last Name and Department, but we have to beware of having two users with the same name, so we need to be able to differentiate between them. One way to do this would be to use some kind of unique UserID, an email address or network login, for example. An alternative would be to either create a unique code for each user, or else to use some pre-existing unique identifier, such as an ID from a payroll application. In this example a field called EmpNo will be used to contain this unique identifier, which we shall also assume is an integer rather than a mix of alphanumerics. Finally, we want to record the type of PC for each user, so we'll add a field called PCType. Our User table therefore looks like this:

  • FirstName - First Name - a text string
  • LastName - Last Name - a text string
  • UserID - Network User ID - a text string
  • Dept - Department - a text string
  • EmpNo - Employee Number from Payroll system - an integer
  • PCType - Unique identifier for type of PC - a text string

The Software table has the following fields:

  • ProductName - Name of Product - a text string
  • Vers - Product Version - a text string
  • ProductID - Unique identifier for product - a text string
  • Licenses - Number of users licensed - an integer
  • DateOfPurchase - Purchase data from invoicing system - a date
  • Cost - PPersecutors of product -a currency code

Our table of software used by each user we'll call UserSoftware. In the case of the User table a code for this was included for the PCType, why not do the same with the Software? Because each user is likely to have many pieces of software, and each piece of software is likely to have many different users. It therefore makes sense to create a separate table to record this many-to-many relationship. For this simple table we need fields only for EmpNo - to uniquely identify the user - and ProductID to uniquely identify the software.

Our final table stores the data for the different PCs, and provides the value to enter into the PCType field of the User table.

  • PCTypeName - a text string
  • Processor - CPU Processor - a text string
  • OpSys - Operating system name - a text string
  • RAM - Mb of RAM - an integer
  • Disk - Gb of hard disk - an floating point number
  • ScreenSize - Screen size in inches - a text string
  • DateOfPurchase - Purchase date from invoicing system - a date code
  • Cost - Cost of purchase -a currency code

A Word About Datatypes

Many kinds of data can be stored in a database, from simple text to various types of number to Boolean flags to binary objects and graphics. This is one area where there are differences between different products. Not just in the formats of data that can be stored, but also in how these formats are named. Boolean fields, for example, are called Yes/No fields in Microsoft Access, and BOOLEAN in the later releases of MySQL. This is one of the areas where you have to refer to the documentation of whatever database platform you are using. In this tutorial datatypes (or column types as they are also known), will conform as closely as possible to those which are most commonly supported on the majority of platforms.

Creating a Database

Many database systems have graphical interfaces which allow developers (and users) to create, modify and otherwise interact with the underlying database management system (DBMS). However, for the purposes of this tutorial all interactions with the DBMS will be via SQL commands rather than via menus, wizards or any of the other tools which sit between the developer and the database. This is, after all, a SQL tutorial…It is assumed, therefore, that you have access to a DBMS system that allows the use of SQL commands directly.

SQL commands follow a number of basic rules. SQL keywords are not normally case sensitive, though this in this tutorial all commands (SELECT, UPDATE etc) are upper-cased. Variable and parameter names are displayed here as lower-case. New-line characters are ignored in SQL, so a command may be all on one line or broken up across a number of lines for the sake of clarity. Many DBMS systems expect to have SQL commands terminated with a semi-colon character, a practice that is followed in this tutorial. Finally, although there is a SQL standard, actual implementaions vary by vendor/system, so if in doubt always refer to the documentation that comes with your DBMS.

Creating a database is remarkably straightforward. The SQL command is just:

CREATE DATABASE dbname;

In this example we'll call the database AssetReg, so the command is:

CREATE DATABASE AssetReg;

Once the database is created it, is possible to start implementing the design sketched out previously.

Creating Tables With SQL

Having created the database it's time to use some SQL to create the tables required by the design. Note that all SQL keywords are shown in upper case, variable names in a mixture of upper and lower case.

The SQL statement to create a table has the basic form:

CREATE TABLE name( col1 datatype, col2 datatype, …);

So, to create our User table we enter the following command:

CREATE TABLE User (FirstName TEXT, LastName TEXT, UserID TEXT, Dept TEXT, 
EmpNo INTEGER, PCType TEXT );

The TEXT datatype, supported by many of the most common DBMS, specifies a string of characters of any length. In practice there is often a default string length which varies by product. In some DBMS TEXT is not supported, and instead a specific string length has to be declared. Fixed length strings are often called CHAR(x), VCHAR(x) or VARCHAR(x), where x is the string length. In the case of INTEGER there are often multiple flavours of integer available. Remembering that larger integers require more bytes for data storage, the choice of int size is usually a design decision that ought to be made up front.

The commands to create the other tables are:

CREATE TABLE Software (ProductName VARCHAR(15), Vers VARCHAR(10), ProductID VARCHAR(10), 
Licenses INTEGER, DateOfPurchase DATE, Cost CURRENCY);

CURRENCY is another datatype that varies by vendor/product. Consult your documentation. Where an explicit CURRENCY type is not available, there is often a DECIMAL type which can be used as a replacement.

CREATE TABLE UserSoftware (EmpNo INTEGER, ProductID VARCHAR(10));
CREATE TABLE PC (PCTypeName VARCHAR(20), Processor VARCHAR(20), OpSys VARCHAR(10), 
RAM INTEGER, Disk FLOAT, ScreenSize VARCHAR(10), DateOfPurchase DATE, Cost CURRENCY);

FLOAT contains floating point numbers, a DOUBLE type often also exists for higher precision floating point numbers. DATE also comes in various flavours and formats.

There is more to the CREATE TABLE command, just as there is more to database design then we have allowed for in this example. Principally we have ignored any considerations of indexing our tables or of any kinds of constraints or data validation rules. For example you would normally want to ensure that the EmpNo field in the User is unique, that is, no two users should have the same EmpNo. Additionally, you would normally decide in advance the size of various fields, especially for text strings. SQL can code for all of these design considerations, though the syntax is likely to vary between implementations to a greater extent than the basic CREATE TABLE command.

As an example let's refine our User design so that the first name and surname fields can be no longer than 20 characters each; the userID entry can be no longer than 12 characters and that the UserID and EmpNo fields should be unique (which is what the UNIQUE keyword does for us). If you have already executed the original CREATE TABLE command your database will already contain a table called User, so let's get rid of that using the DROP command:

DROP TABLE User;

And now we'll recreate the User table we'll use throughout the rest of this tutorial:

CREATE TABLE User (FirstName VARCHAR (20), LastName VARCHAR (20), 
UserID VARCHAR(12) UNIQUE, Dept VARCHAR(20), EmpNo INTEGER UNIQUE, PCType VARCHAR(20);

Once a table is created it's structure is not necessarily fixed in stone. In time requirements change and the structure of the database is likely to evolve to match these. SQL can be used to change the structure of a table, so, for example, if we need to add a new field to our User table to tell us if the user has Internet access, then we can execute an SQL ALTER TABLE command as shown below:

ALTER TABLE User ADD COLUMN Internet BOOLEAN;

To delete a column the ADD keyword is replaced with DROP, so to delete the field we have just added the SQL is:

ALTER TABLE User DROP COLUMN Internet;

Additionally you can use the ADD and DROP commands to add or delete tables, constraints and table indexes.

Inserting Data

Having now built the structure of the database it is time to populate the tables with some data. In the vast majority of desktop database applications data entry is performed via a user interface built around some kind of GUI form. The form gives a representation of the information required for the application, rather than providing a simple mapping onto the tables. So, in this sample application you would imagine a form with text boxes for the user details, drop-down lists to select from the PC table, drop-down selection of the software packages etc. In such a situation the database user is shielded both from the underlying structure of the database and from the SQL which may be used to enter data into it. However we are going to use the SQL directly to populate the tables so that we can move on to the next stage of learning SQL.

The command to add new records to a table (usually referred to as an append query), is:

INSERT INTO target [(field1[, field2[, ...]])]
VALUES (value1[, value2[, ...]);

So, to add a User record for user Jim Jones, we would issue the following INSERT query:

INSERT INTO User (FirstName, LastName, UserID, Dept, EmpNo, PCType)
VALUES ("Jim", "Jones", "Jjones","Finance", 9, "DellDimR450");

Obviously populating a database by issuing such a series of SQL commands is both tedious and prone to error, which is another reason why database applications have front-ends. Even without a specifically designed front-end, many database systems - including MS Access - allow data entry direct into tables via a spreadsheet-like interface.

The INSERT command can also be used to copy data from one table into another. For example, if an organisation initially employs people on a 3-month trial period before making an offer for a permanent position, then it might want to store their details in a separate table. Let's call this table NewUsers and give it the same structure as the User, but with the addition of a new column called StartDate. If we also assume that at the end of the three-month trial period those people not offered a permanent position are removed from the table, leaving only those users who will become permanent employees, then we need to transfer the latter from the NewUsers table to the User. One way would be to re-key all the details into the User table, but this can lead to keying errors (this is obviously a bit of a contrived example, but it illustrates the point). A more elegant solution is to use an INSERT query to automate the task for us.

The SQL query to perform this is:

INSERT INTO User ( FirstName, LastName, UserID, Dept, EmpNo, PCType, Internet )
SELECT FirstName, LastName, UserID, Dept, EmpNo, PCType, Internet
FROM NewUsers;

Updating Data

The INSERT command is used to add records to a table, but what if you need to make an amendment to a particular record? In this case the SQL command to perform updates is the UPDATE command, with syntax:

UPDATE table
SET newvalue
WHERE criteria;

For example, let's assume that we want to move user Jim Jones from the Finance department to Marketing. Our SQL statement would then be:

UPDATE User
SET Dept="Marketing"
WHERE EmpNo=9;

Notice that we used the EmpNo field to set the criteria because we know it is unique. If we'd used another field, for example LastName, we might have accidentally updated the records for any other user with the same surname.

The UPDATE command can be used for more than just changing a single field or record at a time. The SET keyword can be used to set new values for a number of different fields, so we could have moved Jim Jones from Finance to marketing and changed the PCType as well in the same statement (SET Dept="Marketing", PCType="PrettyPC"). Or if all of the Finance department were suddenly granted Internet access then we could have issued the following SQL query:

UPDATE User
SET Internet=TRUE
WHERE Dept="Finance";

You can also use the SET keyword to perform arithmetical or logical operations on the values. For example if you have a table of salaries and you want to give everybody a 10% increase you can issue the following command:

UPDATE PayRoll
SET Salary=Salary * 1.1;

Deleting Data

Now that we know how to add new records and to update existing records it only remains to learn how to delete records before we move on to look at how we search through and collate data. As you would expect SQL provides a simple command to delete complete records. The syntax of the command is:

DELETE [table.*]
FROM table
WHERE criteria;

Let's assume we have a user record for John Doe, (with an employee number of 99), which we want to remove from our User we could issue the following query:

DELETE *
FROM User
WHERE EmpNo=99;

In practice delete operations are not handled by manually keying in SQL queries, but are likely to be generated from a front end system which will handle warnings and add safe-guards against accidental deletion of records. If you bear in mind that a single delete query can delete thousands of records in a single statement then you'll appreciate both the power and simplicity of SQL but also the dangers inherent in allowing users raw access to it.

Very often you would not want to issue DELETE commands on a single table in isolation. For example, if we delete one of the software packages in the Software table we may be left with a record in the UserSoftware table which refers to it (this record would be termed an orphan record). Where such relationships between tables occur then the delete operation needs to delete records in both tables at once. Some database systems, including MS Access, allow relationships between tables to be established including setting a cascading delete option, which means that deleting a record in one table will cause the delete of matching records in the linked tables. In such a case you need only issue the DELETE (or UPDATE) command to the primary table (in our case Software), and the deletion or change will ripple through to the linked tables (the UserSoftware table in this example).

Note that the DELETE query will delete an entire record or group of records. If you want to delete a single field or group of fields without destroying that record then use an UPDATE query and set the fields to Null to over-write the data that needs deleting. It is also worth noting that the DELETE query does not do anything to the structure of the table itself, it deletes data only. To delete a table, or part of a table, then you have to use the DROP clause of an ALTER TABLE query.

Populating The Sample Database

Rather than list a long sequence of SQL commands to populate the tables used for this article, the data is listed below and it will be referred to later on in this article. Please note that only those fields which will be of interest have been listed, for example the ScreenSize and DateOfPurchase fields in the PC table have not been included. Oh, and note the ancient technology this company is using, if you work for this type of organisation no wonder you're trawling the web looking for a free tutorial…

PC table

PCTypeName

Processor

OpSys

RAM

Disk

ASTAscP

Pentium 2

Win95

32

4.3

ASTAscC

Pentium 2

Win95

64

1.3

PilotDeskTop

Pentium Pro

Win98

64

4.3

DellDimR450

Pentium 2

Win98

128

12.9



User

FirstName

LastName

UserID

Dept

EmpNo

PCType

John

Smith

Jsmith

Marketing

23

ASTAscP

Mary

Jones

Mjones

Marketing

69

ASTAscP

Chloe

Feltham

Cfeltham

Finance

45

ASTAscC

Bharat

Patel

Bpatel

Development

12

PilotDeskTop

Terry

Jones

Tjones

Development

15

ASTAscC

Jim

Jones

Jjones

Development

9

DellDimR450

John

Smith

Jsmith2

Finance

78

DellDimR450



Software table

ProductName

Vers

ProductID

Licenses

MySQL

6

MySQL6

25

MySQL

5

MySQL5

25

OpenOffice

1

OpenOffice

10

J2SE1.3

5

J2SE1.3

10

VB6

6

VB6

10

PaintShop Pro

3

PSP3

1

MS Word

7

MSWord7

25



UserSoftware table

EmpNo

ProductID

69

MSWord7

69

OpenOffice

69

MySQL5

12

VB6

15

VB6

15

PSP3

9

VB6

9

MySQL6

Querying The Database

The Basic SELECT command

Having designed, built and populated our database with some sample data, we can now move on to what is considered the heart of SQL - the SELECT statement. This is the command that queries the database and which provides real value to any database.

Although the SQL SELECT statement can become extremely complex, the basic syntax is relatively straightforward:

 
SELECT columns
FROM table(s)
WHERE criteria;

So, to query the User to produce a list of surnames and employee numbers we could issue the command:

SELECT LastName, EmpNo
FROM User;

If we wanted to look at the entire table we could have used an asterisk as a wildcard for the entire table:

SELECT *
FROM User;

Or else you might want to catenate the LastName and FirstName fields to produce a single column with entries such as 'Jones, Terry', 'Patel, Bharat' and so on. This too is possible using the simplest of SELECT commands, though we have to create an alias column to contain the result - which in this instance we'll call FullName. SQL can then use the AS keyword to assign the result of an expression to the alias. In MSAccess the ampersand character is used to catenate the strings:

SELECT LastName & "," & FirstName AS FullName 
FROM User;

In contrast MySQL uses the CONCAT keyword to piece the strings together:

SELECT CONCAT(FirstName,",",LastName) AS FullName 
FROM User;

You can even add fixed text or expressions to SELECT queries to give results which are immediately readable, as shown in the query and results below:

SELECT FirstName & " " & Lastname & " works in the " & Dept & " department" AS Job
FROM User;

Results of query above

John Smith works in the Marketing department

Mary Jones works in the Marketing department

Chloe Feltham works in the Finance department

Bharat Patel works in the Development department

Terry Jones works in the Development department

Jim Jones works in the Marketing department

John Smith works in the Finance department

Now, what happens if we issue a simple query such as the following?

SELECT LastName
FROM User;

SQL will return a simple listing of the last names in the User, and, as you can see by looking at the data in our table, it will contain duplicates as there are three lots of Jones's listed in our database. If we want a listing of unique names then how can we generate it? SQL includes a number of 'predicates', which specifies which records the query is to return. In all of our examples so far we have assumed that we want all of the records to be returned, which is equivalent to including the ALL predicate with our statements.

If we want a listing of unique names then we can use the DISTINCT predicate to exclude those records which are not unique based on the SELECTed fields in the query. In other words the query becomes:

SELECT DISTINCT LastName
FROM User;

In other cases we may only want to exclude records where the entire record is duplicated, not just a given field or set of fields. To do this we use the DISTINCTROW predicate. In our database there are no records in the User which are duplicates, so using DISTINCTROW will return all the records in the table.

Where we are dealing with numeric data we may only be interested in a subset based on the position of the results, for example the top 10 records, or the bottom 20% and so on. The predicate for this is TOP, and differs from the previous ones in that it depends on your query to order the results so that it can pick out the relevant records.

The WHERE Clause

So far all the queries we have looked at operate on the complete contents of a table. What if you are only interested in a particular sub-set of a table? The WHERE keyword is used to provide a search mechanism which only produces records which meet a given search criterion. For example if we are only interested in our users in the Development department, we would modify one of our earlier commands as follows:

SELECT *
FROM User
WHERE Dept="Development";

We might equally have wanted to see a list of employees who do not work in development. In this case our WHERE statement can be amended to:

WHERE Dept<>"Development";

Multiple WHERE statements can be created to fine tune our queries and select very precise sub-sets of our data. For example, if we wanted to look at those employees not in development but who have Internet access, we can code the following query:

SELECT *
FROM User
WHERE Dept<>"Development" And PCType="DellDimR450";

The WHERE command can also be used with arithmetic and logical operators. For example in those systems that support a LIKE keyword, if we wanted all those employees whose surname begins with the letter J (which you might want when producing a company contact list):

SELECT *
FROM User
WHERE LastName Like "J*";

Or if your system includes a BETWEEN keyword you can select names in the group A-M:

WHERE LastName Between "A*" And "M*";

For a full list of legal WHERE criteria and keywords refer to the documentation for your particular implementation of SQL.

Sorting Query Results

Where a query might produce many records it is often useful to sort the result by a given field. SQL uses the ORDER BY keyword to do this. If we wanted to order our User by LastName, we would create the following query:

SELECT *
FROM User
ORDER BY LastName;

Or, if we wanted the table sorted by LastName within Department, the ORDER BY clause would become: ORDER BY Dept, LastName

You can also chose to have items sorted in ascending or descending order, as shown by our last ORDER BY example:

SELECT *
FROM User
Order By Dept DESC, LastName ASC;

Joining Data

What if we want to find out how much disk space the users in the development department have? Here we have data from two different tables to look and it is here that the power of the relational model becomes apparent. At this point we need to create a relationship between the User and the PC table, and this relationship is called a JOIN in SQL/relational database terminology. There are a number of different types of JOIN available, depending on the what it is you need to achieve.

The most common form of join is called an INNER JOIN, and is used to match records between tables where there are matching values in a field common to both tables. For example we have the PCType field in the User and the PCTypeName field in the PC table, and so we would want to perform an INNER JOIN on these fields. The syntax to carry this out is fairly simple:

FROM table1 INNER JOIN table2 ON table1.field1 operator table2.field2;

The operator is usually a simple '=', but other operators can be used, such as '<>', '>', '<' and so on. For the purposes of our example we want to join the tables where the User.PCType field is equal to the PC.PCTypeName field in order to extract the PC.Disk value. Our query would look as follows:

SELECT User.LastName, User.FirstName, User.PCType, PC.Disk
FROM User INNER JOIN PC ON User.PCType = PC.PCTypeName
WHERE User.Dept="Development";

What if we want to link users to software to produce a list of software packages for each user? The UserSoftware table lists the EmpNo and the ProductID code, the first of these links to the User and the second to the Software table, thus requiring two joins:

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName
FROM (UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) 
INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID;

This generates the following result:

FirstName

LastName

ProductName

Jim

Jones

MySQL

Mary

Jones

MySQL

Mary

Jones

OpenOffice

Jim

Jones

VB6

Bharat

Patel

VB6

Terry

Jones

VB6

Terry

Jones

PaintShop Pro

Mary

Jones

MS Word

To show the PCType in the result we would also need to link in the PC table, which we can join on the field of User.PCType:

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName, PC.PCTypeName
FROM ((UserSoftware INNER JOIN User ON UserSoftware.EmpNo = User.EmpNo) 
INNER JOIN Software ON UserSoftware.ProductID = Software.ProductID) 
INNER JOIN PC ON User.PCType = PC.PCTypeName;

As you can see, the useful information available from the linked tables is far higher than it is from a simple listing of the tables themselves. By linking the different tables together we can gather together the information which is inherent in the tables but is not otherwise easily accessible.

Although the INNER JOIN is the type of join most often used, it isn't the only one. There are in fact 'outer' joins as well as an inner join. A good illustration of a useful outer join follows on from our previous examples. The table of users and software only lists those users for whom software products are recorded, but what if we want to list all of the users in our database? Clearly we would need to join the User to UserSoftware, but we would want all of the records from User listed, not just those for which there is a matching employee number. We do this using a type of outer join called a LEFT JOIN.

We can thus amend our SQL as follows:

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductName
FROM (User LEFT JOIN UserSoftware ON User.EmpNo = UserSoftware.EmpNo) 
LEFT JOIN Software ON UserSoftware.ProductID = Software.ProductID;

This generates the following table of results:

FirstName

LastName

ProductName

Jim

Jones

VB6

Jim

Jones

MySQL

John

Smith

NULL

Mary

Jones

MS Word

Mary

Jones

OpenOffice

Mary

Jones

MySQL

Chloe

Feltham

NULL

Bharat

Patel

VB6

Terry

Jones

VB6

Terry

Jones

PaintShop Pro

Note that some systems report a missing value as NULL, others (including MS Access) would simply return a blank in the above result.

And what about those software products for whom there are no users? Again we need an outer join, only this time we want to include those records from the second table in the query, which is called a RIGHT JOIN.

SELECT DISTINCT User.FirstName, User.LastName, Software.ProductID
FROM (UserSoftware LEFT JOIN User ON UserSoftware.EmpNo = User.EmpNo) 
RIGHT JOIN Software ON UserSoftware.ProductID = Software.ProductID;

Aggregates

The SQL we have looked at so far has operated on single records, selecting them based on various criteria and linking them between tables. The next step is to look at how we can aggregate records so that we can calculate sums, averages, counts and so on.

One of the simplest operations to perform is a simple count of records in a table. To do this we need only add to the simplest of SELECT queries. First we create an alias to store the result, and then we use the COUNT command to count a given field - in our case EmpNo:

SELECT DISTINCT Count(User.EmpNo) AS Total
FROM User;

What if we want something a bit more complex, say a count of employees in each department. Here we want to group like records together based on a given field, in our case Dept. SQL provides the GROUP BY clause precisely for this task. The query then looks as follows:

SELECT Count(User.EmpNo) AS Total, User.Dept
FROM User
GROUP BY User.Dept;

Which generates the following table:

Total

Dept

2

Development

2

Finance

3

Marketing

Taking an average is not much more difficult. If we want to find the average Mb of RAM for the different processor types listed in the PC table we could issue the following SQL statement:

SELECT Avg(PC.RAM) AS Memory, PC.Processor
FROM PC
GROUP BY PC.Processor;

Yielding the result:

Memory

Processor

74.66

Pentium 2

64

Pentium Pro

Other aggregate functions include Min, Max, Sum etc, depending on your particular implementation of SQL. Any WHERE cause will have been applied before the GROUP BY clause takes effect, so it is easily possible to exclude certain records from the aggregate functions. This means, however, that it is not possible to use the WHERE clause to exclude some of the aggregated results. For example if you were interested in looking at the average disk size of machines for those machines not running Windows 95, we would use the WHERE clause to exclude OpSys='Win95', and then the GROUP BY clause to group by OpSys:

SELECT Avg(PC.Disk), PC.OpSys AS Disk
FROM PC
WHERE PC.OpSys<>'Win95'
GROUP BY PC.OpSys;

If we want to apply a selection process on the outcome of the GROUP BY process we have to resort to another type of selection clause. SQL uses the HAVING clause to make selections of data after the GROUP BY process has completed. For example, if we wanted a listing of the average memory per processor type but were only interested if the average were less than or equal to 64Mb, we could code a query as follows:

SELECT DISTINCTROW PC.Processor, Avg(PC.RAM) AS RAM
FROM PC
GROUP BY PC.Processor
HAVING Avg(PC.RAM) <=64;

The HAVING clause can have multiple terms, just as the WHERE clause can, and these can be linked with the normal logical operators AND, OR etc.


Thanks to : http://www.techbookreport.com/

Powered By Mushu

Powered By Mushu