Rational database design:
Getting it right the first time

ELC Syllabus - ASNR 42nd Annual Meeting

By Steven Willing, MD, MBA

Steven Willing, MD, MBA has reported no financial interest, arrangement or affiliation with a commercial organization that may have a direct or indirect influence in the subject matter of this presentation.

Contents

Databases 101

If it hasn’t happened to you, it’s happened to someone you know. Sometime, somewhere, you’ve been handed reams of information in a totally useless format. It may be on paper, a word processor file, or a hopelessly Gordian spreadsheet. Your task, or that of your colleague or secretary, is to repackage that data into a format that is useful. Or perhaps you’ve been asked to develop a database, or wanted another to develop one for you.

In this article, we will walk through the fundamentals of database development using as an example the creation of a “mini-RIS” for CT scans. The illustrations were created using Microsoft Access, but the principles and terminology apply to any standard relational database application.

What is a database? Technically, “database” refers to the stored data alone, its organization, and nothing more. In common usage, however, it is often used in reference to the entire software “wrapper” in which the data resides. That package allows creation, expansion, or modification of the database; addition, deletion, or modification of the data; storage of information about the database structure (metadata); security; and other useful operations. These operations are provided by a Database Management System (DBMS), the technically correct if cumbersome name for applications such as Microsoft Access or Filemaker Pro. The more powerful DBMS applications such as Oracle or SQL Server provide advanced functionality for programming, backup, concurrency control between multiple users, user authentication and privileges, and sharing across a network. The consumer-oriented DBMS’s are constrained by database size and number of concurrent users while the commercial applications are virtually unlimited in such respects.

Different forms of databases have been developed which differ according to their organization of the data and nature of relationships between database objects. There have been (and continue to be) network databases, object-oriented databases, and others, but the relational database has become the 800-pound gorilla of database architectures. All major database applications referred to herein use the relational model, whose structure is described below.

Anatomy of a relational database

Tables and columns

The table is the primary building block of a relational database. The concept of the table is really rather straightforward. A table is constructed by creating columns ("fields") to store the various properties for a set of related elements in the database. Unrelated, or distinctly different, elements should go into a separate table (see Normalization). Each column must have a name, and typically should be restricted to a specific data type. After the table is constructed, but before any data is entered, the table consists of columns, but no rows. After the columns are populated with data, one row (or "record") at a time, the table can be displayed as a grid, as shown in Figure 1.

As we proceed through this discussion, we will construct a simple database of CT scans in an imaginary department with two scanners.

[Figure 1]
Figure 1: A table of CT scan records is composed of columns defined at time of development and rows created with each new entry of data. Each element of data is contained within a cell.

Rows and Cells

The terms "record" and "row" are synonomous in this context. A record is the set of properties for a single elemental object in the database. In a relational database, the record is stored in a table; hence it is displayed as a row if the table is viewed in the form of a grid. More typically, some user interface will be developed to display data in a more customized and useful format.

A row is composed of cells. The cell is the most elemental unit of the relational database. Each cell contains a unique attribute for the record to which it belongs. It may contain a property of the record in the form of data, a unique identifier for the row (“key”), or a reference to a row in another table (“foreign key”). The contents of a column or a cell are also referred to as a “field” of the database. In a table comprised of many columns, not all cells may contain data for each row or entry. In that case, the column must be so defined as to allow “null” values within it. Obviously, null values will not be permitted in a field that is, or is part of, a primary key. On the opposite end, it is possible to define a default value for the cell that will be automatically assigned to every new record unless overridden at the time of record insertion.

In our imaginary CT department, Figure 1 represents an excerpt of scan activity from three days. The first column, “ID”, contains a unique identifier for each CT series representing the “key” (see below). The rightward columns represent, respectively, the date of the scan (ScanDate), time of the scan (ScanTime), which of the two scanners (Scanner), which series of the study (Series), the number of slices per series (SliceCount), patient view or position (Orientation), whether or not the scan was completed (Completed), an ID number for the referring physician (ReferringMD) and a patient ID number (Patient). Each individual box in Figure 1 is a cell. Each row represents a unique data entity (called a "tuple" in formal database parlance).


Figure 2: One creates a table by naming the columns ("Field Name"), assigning a data type, and specifying the primary key (identified here by the key symbol to the left of "ID"). The convention is to avoid spaces with table names and column names, although most current applications will support them.

In our imaginary database, Figure 2 shows the data type for each column of the table in Figure 1. The first column, ID, is a automatically generated consecutive integer unique for each row in the database. The ScanDate and ScanTime fields use a date/time datatype. Scanner, Series, SliceCount, ReferringMD, and Patient all use "Number". The "Number" datatype expressed here use any real number or just integers - that would be decided in advance by the database designer. The orientation column accepts text entries and treats any entry in this column as text, even numbers. Hence, one could enter numeric digits into this field, but since they would be understood and treated as text an attempt to add them would be seen as an attempt to "add" text and would generate an error message. The Completed column has only two choices, yes or no. This is displayed in the figure as "Yes/No", the MS Access representation of a Boolean datatype.

Data types

The data type mandates what kind of data must be stored in a particular column or field. Examples include whole numbers, integers, currency, plain text, time and date, Boolean (yes or no), or binary objects such as images. [Figure 2] The data type specification acts as a constraint upon the database, disallowing any entry that does not conform to the data type required.

When creating tables, one should always attempt to use the data type most specific and appropriate to the data being stored. While it is possible, for example, to store a date as text (the author has actually seen this done in a commercial radiology application), specifying the data type as “date” is much more practical. If records are retrieved where the date is stored as text, the results can be sorted by date in ascending or descending alphabetical order – most impractical. If the user wishes output to be sorted in a chronological order, the database must know that the field contains a date or time data type.

Common data types are integer, real number, text string (a short sequence of characters), memo (used for storing free text), date-time, Boolean (yes or no), or binary. Some of these categories may be further divided, such as small or large integers, varying sizes of text files, hyperlinks, etc. In some DBMS’s the binary data type can include even large binary files (such as images), known as Binary Large Objects or BLOBs. A special case of the integer data type is the “autonumber”. The autonumber data type allows the database application to assign a sequentially generated unique integer value to each row of the database, valuable in creating a unique identifier for each row (the “key”).

Keys

Each row of a table in a relational database should have its own unique identifier. This could be almost any data type - time, text string, or number - but is typically a unique integer. While most applications will allow one to create tables without specifying a key, keys are essential to create relationships with other tables and to create indexes, as well as other useful tasks. A "compound" key is a key formed out of more than one cell, whereas the combination is unique even if the individual cell values are not. For example, in a table of CT examinations on a single machine a compound key could be created out of individual entries for date and time. While the same date could appear more than once, or the same time, the combination of date and time would be unique for a particular scan. In our running example, the ID column in Figure 1 is the key to the CT scan table, and can be identified as such by the small key icon next to the definition of ID in Figure 2.

Relationships

A "relationship" is a link created by the developer between two or more tables to connect related elements of data between those tables. In our CT scan database, we would create additional tables for patient demographics, physician information, examination details, and reports. In Figure 3, we have added tables to contain demographic information about referring physicians (RefMD) and Patients. Once created, there must be some way to cross-reference these tables to match appropriate physician and patient information with a study and a report. The tables are cross-referenced through the use of keys and foreign keys. In Figure 3, the keys for tables "RefMD" and "Patients" are highlighted in bold text. The same value for the key is then carried over into the CT_Scan table as a foreign key. Current DBMS applications make it quite simple to define relationships between tables using a graphical interface with "drag and click" capabilities.


Figure 3: A relationship has been created between the CT_Scan table and the referring MD (RefMD) table, and the Patients and CT_Scan tables. The relationships are indicated by the presence of lines connecting the tables. Both are many-to-one relationships, where each referring MD or Patient appears only once its respective table but can be linked to many rows in the CT_Scan table. The ReferringMD column in "CT_Scan" is a foreign key linking to DoctorID, the primary key of the RefMD table.

Normalization and integrity

Normalization of databases is a complex and confusing subject that fortunately can be reduced to a simple principle: try to keep the tables small and separate the dataset into its smallest logical units. The idea is to avoid redundancy within the table. For example, a database of CT examinations could contain information on scanning times, scanning parameters, patient demographics, and referring physician information. It is possible to create one massive table with a single row for every CT series, and within that row store all of the scan parameters along with the date and time, patient demographics, and referring physician’s name and telephone number. The problem with this approach becomes evident if the referring physician changes his phone number. The number must be corrected in many rows throughout the table.

Normalization of the database is simply the process of splitting off the logical element of referring physician information into a separate table and creating a link to the original table. A single record (row) of the new table would contain his name, address, telephone number, and other relevant information along with a unique identifier (the key). An identifier for the referring MD table would be stored within a corresponding column in the CT Scan table as a “foreign key” and the two tables linked, key to foreign key. Now, if the physician changes his telephone number the correction need be made in only a single record in the physician table and the correction is valid for any query against the database. Normalization is done primarily to ensure data integrity, in this case the absence of contradictory data (two conflicting phone numbers without certainty which is correct). It has the secondary benefit of reducing database size through the elimination of duplicate data, but database performance as measured by query execution time is actually impaired somewhat.

Referential integrity applies to the preservation of relationships between tables. In our example above, we might wish to ensure that 1) no referring physician can be deleted from the physician table if he has referred patients for CT, and 2) every new record in the CT Scan table links to a valid referring physician in the physician table. By default the database will permit either event without a hiccup. To prevent such occurrences, the developer should specify that referential integrity be enforced at the time the relationship between the two tables is created. [Figure 4] It may be extremely difficult to impose this constraint after the tables have been populated with data, due to the possibility that violations of referential integrity already exist.


Figure 4: Specifying referential integrity in the creation of a relationship. This dialogue appears by right clicking on the bar connecting the Patients and CT_Scan tables and selecting "Edit".

Getting the data out

Obviously there is no data to be retrieved until the database has been populated, but we will begin with data retrieval because the concepts are more fundamental to understanding database operations.

Viewing tables

The most direct way to view the data is to open a table within the DBMS, as seen in figure 1. If foreign keys are used to reference data in other tables, this approach may not be useful. Newer versions of consumer DBMSs, such as Access, provide some friendlier workarounds to this problem [Figure 5], but except for the simplest of databases, a query is the recommended way to access the data. Queries are used to filter records according to particular criteria and to choose which columns should be returned.


Figure 5: In the Referring MD table, clicking the "+" sign to the left opens corresponding linked entries in the CT scan table. This would be impractical if the physician has referred a large number of cases.

Queries

The word "query" means to ask or inquire (verb) or a question or inquiry (noun). "Query" as used in database terminology is a misnomer, because it is used to "tell" as well as to ask. Database queries are used to retrieve data ("ask") but also to modify or add data, modify existing tables, or create new tables - in short, about anything one would want to do with the data.

SQL

The Structured Query Language (SQL) has achieved almost universal acceptance as the language of choice for creating queries upon relational databases. Its popularity, however, has not been accompanied by consistency or standardization. SQL has come through several versions, and various applications require specific dialects of SQL that are incompatible with one another. While the basic syntax is constant, various versions differ in their handling of variables, text strings, wildcards, or formulae, for example.

SQL uses Standard English text without the peculiar syntax seen in advanced programming languages. The elemental SQL query is the "SELECT...FROM...WHERE" statement used to retrieve data from a database:

SELECT [columns.tables]
FROM [tables]
WHERE [criteria];

Say, for example, we wanted to retrieve all entries for scan date and time, scanner, patient ID, and physician ID from the CT_Scan table for January 1, 2002. Within our CT scan database, the SQL query would be written as:

SELECT ScanDate, ScanTime, Scanner, ReferringMD, Patient
FROM CT_Scan
WHERE ScanDate=#1/1/2002#;

In the example above, ScanDate, ScanTime, Scanner, ReferringMD, and Patient are all columns within the CT_Scan table. The ## delimiters are specific to this version of SQL and are used to contain date values. Text strings are most often delimited by double quotation marks. Running the above query in Access gives the following result:


Figure 6: Output of simple query

As you can see, the output is of limited value. What we really would like is the name of the patient and physician, but because of normalization these appear in different tables linked by their key to a foreign key in the CT_Scan table. In SQL parlance, the “INNER JOIN” is used to merge tables by matching key to foreign key. The syntax is:

[Table1] INNER JOIN [Table2] ON [Table1.foreignKey]=[Table2.Key]

Where [Table1] and [Table2] are the names of the corresponding tables, and Key and foreignKey are the names of the columns containing the related keys.

The SQL query to include proper names for our CT Scan query becomes:

SELECT CT_Scan.ScanDate, CT_Scan.ScanTime, CT_Scan.Scanner, Patients.Name AS Patients_Name, RefMD.Name AS RefMD_Name
FROM RefMD INNER JOIN (Patients INNER JOIN CT_Scan ON Patients.PatientID = CT_Scan.Patient) ON RefMD.DoctorID = CT_Scan.ReferringMD
WHERE CT_Scan.ScanDate=#1/1/2002#;

Running this query yields the following output:


Figure 7: Output of query using joined tables

This is obviously more useful, but typing out an SQL query is a tedious and error-prone process requiring much more advanced facility with the language. Fortunately, consumer oriented applications take over the work of writing SQL for us through the use of "wizards" or similarly named tools.

SELECT *
FROM CT_Scan
WHERE ScanDate=#1/1/2002#;

To return all records, simply delete the "WHERE" clause eliminating all constraints. The following would return all records from the CT_Scan table:

SELECT *
FROM CT_Scan;

This query generates the output in Figure 1, showing all columns and all rows for the CT_Scan table.

Once a database becomes larger or more complex, the user probably would not want to see query results displayed on a grid. This is especially true if the user is someone other than the developer of the database. Most consumer software using a back-end database will display query results in some type of formatted report.

Adding and modifying data

There are numerous means to populating a database with data. Each is useful in various situations.

Editing within tables

From the computer user's viewpoint, the easiest way to add new records or change existing records is to open the table as a grid and fill in the cells manually. The simplicity lies in the fact that only one or two mouse clicks may required to open a table and begin typing in data. With a more complicated database containing relationships, this can be quite problematic. If the table contains foreign keys linking to other tables, it may be tedious to keep track of the various keys from other tables that must be manually entered. [With Access there is a solution to this. Fields in linked tables can be added as a "lookup" field in the table being edited. Clicking in a column for a new record would bring up a list of choices limited to valid entries in the related table.] Manually entering data into the table may be extremely inefficient if a large amount of data must be entered from another source.

This option is mostly for the developer alone because there is potential for great mischief, even if unintentional. It takes only seconds to delete many rows of data or entire columns, and clicking "Undo" won't bring them back. If the database is to be used by others, the developer probably would not want anyone else to have such control over the database. For example, one might want others to be able to add data but not delete it. To prevent this, users must not have direct access to the tables, and specific privileges should be specified and enforced.

Editing with SQL

As we mentioned, the database "query" is a misnomer because the query is also used to modify or append data and even modify the database architecture itself. The Structured Query Language offers all the necessary tools for creating queries to modify or add as well as retrieve data from a database. An example of an SQL statement to add a record to the referring MD table follows:

INSERT INTO RefMD (Name, Telephone, Address, City, State, Zip, Pager) VALUES ("Roberts", 5555678, "1000 East Ave", "Indianapolis", "IN", 46202, 5558945);

Manually creating such an SQL query is tedious and the punctuation and syntax is very particular. It gets worse the more columns there are, because every single column must be referenced in the VALUES line, separated by a comma, even if only null values are to be inserted.

SQL statements can be quite useful, however, for performing batch operations to insert multiple records in a single operation. With most DBMS's, this too can be automated in some cases. The "Update" query is used to modify values in existing records, and the "Append" query is used to insert records from one table into another. While these operations can be done with SQL, it is much simpler to use the DBMS to generate these queries for you. [Figure 8]


Figure 8: Using Access to create other types of queries.

Using Forms

The most common method for entering data into a database is through a graphical form, as in Figure 9. Such forms can be built manually, which is very tedious, or through an automated process (“wizard”) common in consumer DBMS applications like Access. Many commercial applications in Radiology and other fields are essentially just a database hiding behind a (more or less) well-designed form serving as the user interface. (This is not to minimize the complexity of these applications, as the databases may be very complex with elaborate rules and relationships governing how the data must be managed). Typically, the same or similar forms are also used for displaying records from the database.


Figure 9: Access-generated form for displaying and adding records to the CT_Scan table. The first record is displayed. The arrow buttons enable navigation between records or addition of a new record.

Starting out on your own

Creating efficient, logically structured databases is within the reach of most computer users and is well worth the effort in terms of time saved when large amounts of information are to be acquired and processed. Consumer applications like Access and Filemaker make the process easier than it has ever been.

The best way to begin a database design is with the emphatically nontechnical pencil and paper. Start by conceptualizing how the data should be divided into groups of related elements. These groups become the basis of tables. Then decide what fields or data elements belong in each table and what data types should be used to store them. The last step would be to sketch out the relationships between tables. At this point one should note where columns should be added to hold foreign keys to related tables. Only after the design makes sense on paper should one open the application and begin actual implementation. Beware that it is sometimes impossible to make certain changes in the database after relationships have been created and data has been entered. [There is rarely a situation, however, where an experienced developer cannot find some workaround]. Finally, never forget the importance of backing up the database frequently on some storage medium (tape, Zip disk, second hard drive, network server) separate from the workstation where it is primarily stored.

February 2002

For further reading:

Hernandez, Michael J. Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design. Addison-Wesley Publishing Co, 1997.

Hernandez, Michael J; Viescas, John L. SQL Queries for Mere Mortals: A Hands-On Guide to Data Manipulation in SQL. Addison-Wesley Publishing Co, 2000.

Jennings, Roger. Special Edition: Using Microsoft Access 2002. Que Publishing, 2002.


Click here to go back to the Main Menu