In this discussion, we look at a particular and very important type of choice in data modeling. In fact, it is so important that we introduce a special convention subtyping to allow our E-R diagrams to show several different options at the same time. We will also find subtyping useful for concisely representing rules and constraints, and for managing complexity. Our emphasis in this discussion is on the conceptual modeling phase, and we touch only lightly on logical modeling issues.
Different Levels of Generalization
It is important to recognize that our choice of level of generalization will have a profound effect not only on the database but on the design of the total system. The most obvious effect of generalization is to reduce the number of entity classes and, on the face of it, simplify the model. Sometimes this will translate into a significant reduction in system complexity, through consolidating common program logic. In other cases, the increase in program complexity from combining the logic needed to handle quite different subtypes outweighs the gains. You should be particularly conscious of this second possibility if you are using an algorithm to estimate system size and cost (e.g., in terms of function points). A lower cost estimate, achieved by deliberately reducing the number of entity classes through generalization, may not adequately take into account the associated programming complexity.
Rules versus Stability
To select the most appropriate level of generalization, we start by looking at an important difference between the models: the number and type of business rules (constraints) that each supports.
The models developed by inexperienced modelers often incorporate too many rules in the data structures, primarily because familiar concepts and common business terms may themselves not be sufficiently general. Conversely, once the power of generalization is discovered, there is a tendency to overdo it. Very general models can seem virtually immune to criticism, on the basis that they can accommodate almost anything. This is not brilliant modeling, but an abdication of design in favor of the process modeler, or the user, who will now have to pick up all the business rules missed by the data modeler.
Using Subtypes and Supertypes
It is not surprising that many of the arguments that arise in data modeling are about the appropriate level of generalization, although they are not always recognized as such. We cannot easily resolve such disputes by turning to the rulebook, nor do we want to throw away interesting options too early in the modeling process.
The ability to represent different levels of generalization requires a new diagramming convention, the box-in-box. You should be very wary about overcomplicating diagrams with too many different symbols, but this one literally adds another dimension (generalization/specialization) to our models.
Subtypes and Supertypes as Entity Classes
Much of the confusion that surrounds the proper use of subtypes and supertypes can be cleared with a simple rule: subtypes and supertypes are entity classes.
- We use the same diagramming convention (the box with rounded corners) to represent all entity classes, whether or not they are subtypes or supertypes of some other entity class(es).
- Subtypes and supertypes must be supported by definitions.
- Subtypes and supertypes can have attributes. Attributes particular to individual subtypes are allocated to those subtypes; common attributes are allocated to the supertype.
- Subtypes and supertypes can participate in relationships. Notice in our family tree model how neatly we have been able to capture our “mother of” and “father of” relationships by tying them to entity classes at the most appropriate level. In fact, this diagram shows most of the sorts of relationships that seem to worry modelers, in particular the relationship between an entity class and its own supertype.
- Subtypes can themselves have subtypes. We need not restrict ourselves to two levels of subtyping. In practice, we tend to represent most concepts at one, two, or three levels of generality, although four or five levels are useful from time to time.
1.Boxes in Boxes
We can use the “box-in-box” convention for representing subtypes. It is not the only option, but it is compact, widely used, and supported by several popular documentation tools. Virtually all of the alternative conventions, including UML, are based around lines between supertypes and subtypes.
In UML notation, the subtypes are represented by boxes outside rather than inside the supertype box.
3.Using Tools That Do Not Support Subtyping
Some documentation tools do not provide a separate convention for subtypes at all, and the usual suggestion is that they be shown as one-to-one relationships. This is a pretty poor option, but better than ignoring subtypes altogether. If forced to use it, we suggest you adopt a relationship name, such as “be” or “is,” which is reserved exclusively for subtypes.
An entity class inherits the definition of its supertype. In writing the definition for the subtype, then, our task is to specify what differentiates it from its sibling subtypes (i.e., subtypes at the same level and, if relevant, within the same partition).
Attributes of Supertypes and Subtypes
Sometimes we can add meaning to the model by representing attributes at two or more levels of generalization.
Nonoverlapping and Exhaustive
The subtypes in our family tree model obeyed two important rules:
- They were nonoverlapping: a given person cannot be both a man and a woman.
- They were exhaustive: a given person must be either a man or a woman, nothing else.
In fact, these two rules are necessary in order for each level of generalization to be a valid implementation option in itself. Consider a model in which Trading Partner is subtyped into Buyer and Seller.
Overlapping Subtypes and Roles
Having established a rule that subtypes must not overlap, we are left with the problem of handling certain real-world concepts and constraints that seem to require overlapping subtypes to model. The most common examples are the various roles played by persons and organizations.
Many of the most important terms used in business (Client, Employee, Stockholder, Manager, etc.) describe such roles, and we are likely to encounter at least some of them in almost every data modeling project. The way that we model (and hence implement) these roles can have important implications for an organization’s ability to service its customers, manage risk, and comply with antitrust and privacy legislation. There are several tactics we can use without breaking the “no overlaps” rule.
Ignoring Real-World Overlaps
Sometimes it is possible to model as if certain overlaps did not exist. We have previously distinguished real-world rules (“Every person must have a mother.”) from rules about the data that we need to hold or are able to hold about the real world (“We only know some peoples’ mothers.”).
Modeling Only the Supertype
One of the most common approaches to modeling the roles of persons and organizations is to use only a single supertype entity class to represent all possible roles. If subtyping is done at all, it is on the basis of some other criterion, such as “legal entity class type”partnership, company, individual, etc.
Modeling the Roles as Participation in Relationships
In the supertype-only model described above, roles can often be described in terms of participation in relationships.
If you are not using the Chen notation, then, rather than further complicate relationship notation for the sake of one section of a model, we suggest you document such rules within the definition of the main entity class.
Using Role Entity Classes and One-to-One Relationships
Despite this inelegance in distinguishing relationships from subtypes, the role entity class approach is usually the neatest solution to the problem when there are significant differences in the attributes and relationships applicable to different roles.
Several CASE tools support a partial solution to overlapping subtypes by allowing multiple breakdowns (partitions) into complete, nonoverlapping subtypes.
The multiple partition facility is less helpful in handling the roles problem, as we can end up with a less-than-elegant partitioning.
Hierarchy of Subtypes
Each subtype can have only one immediate supertype (in a hierarchy, everybody has one immediate boss only, except the person at the top who has none). This follows from the “no overlap” requirement, as two supertypes that contained a common subtype would overlap.
Few conventions or tools support multiple supertypes for an entity class, possibly because they introduce the sophistication of “multiple inheritance,” whereby a subtype inherits attributes and relationships directly from two or more supertypes.
Benefits of Using Subtypes and Supertypes
Each level in each subtype hierarchy represents a particular option for implementing the business concepts embraced by the highest-level supertype. But subtypes and supertypes offer benefits not only in presenting options, but in supporting creativity and handling complexity as well.
Our use of subtypes in the creative process has been a bit passive so far. We have assumed that two or more alternative models have already been designed, and we have used subtypes to compare them on the same diagram. This is a very useful technique when different modelers have been working on the same problem and (as almost always happens) produced different models.
Presentation: Level of Detail
Subtypes and supertypes provide a mechanism for presenting data models at different levels of detail. This ability can make a huge difference to our ability to communicate and verify a complex model. If you are familiar with process modeling techniques, you will know the value of leveled data flow diagrams in communicating first the “big picture,” then the detail as required.
Documentation tools that can display and/or print multiple views of the same model by selective removal of entity classes and/or relationships are useful in this sort of activity.
Communication is not only a matter of dealing with complexity. Terminology is also frequently a problem. A vehicles manager may be interested in trucks, but the accountant’s interest is in assets. Our subtyping convention allows Truck to be represented as a subtype of Asset, so both terms appear on the model, and their relationship is clear.
When using subtypes and supertypes to help communicate a model, we need have no intention of implementing them as tables; communication is a sound enough reason in itself for including them.
Input to the Design of Views
Looking at it from the other direction, using subtypes and supertypes to capture different perspectives on data gives us valuable input to the specification of useful views and encourages rigor in their definition.
Classifying Common Patterns
We can also use supertypes to help us classify and recognize common patterns.
Divide and Conquer
The structured approach to modeling gives us the ability to attack a model from the top down, the middle out, or the bottom up.
From a creative modeling perspective, a top-down approach based on specialization allows us to put in place a set of key concepts at the supertype level and to fit the rest of our results into this framework. There is a good analogy with architecture here: the basic shape of the building determines how other needs will be accommodated.
When Do We Stop Supertyping and Subtyping?
No single rule tells us when to stop subtyping because we use subtypes for several different purposes. We may, for example, show subtypes that we have no intention of implementing as tables, in order to better explain the model. Instead, there are several guidelines. In practice, you will find that they seldom conflict. When in doubt, include the extra level(s).
Differences in Identifiers
If an entity class can be subtyped into entity classes whose instances are identified by different attributes, show the subtypes.
Different Attribute Groups
If an entity class can be subtyped into entity classes that have different attributes, consider showing the subtypes.
If an entity class can be divided into subtypes such that one subtype may participate in a relationship while the other never participates, show the subtype.
If some instances of an entity class participate in important processes, while others do not, consider subtyping. Conversely, entity classes that participate in the same process are candidates for supertyping.
Migration from One Subtype to Another
If we were to implement a database based on such unstable subtypes, we would need to transfer data from table to table each time the status changed. This would complicate processing and make it difficult to keep track of entity instances over time.
Sometimes it is useful to show only two or three illustrative subtypes. To avoid breaking the completeness rule, we then need to add a “miscellaneous” entity class.
Capturing Meaning and Rules
we are often given information that can conveniently be represented in the conceptual data model, even though we would not plan to include it in the final (single level) logical model. For example, the business specialist might tell us, “Only management staff may take out staff loans”.
Subtypes and supertypes are tools we use in the data modeling process, rather than structures that appear in the logical and physical models, at least as long as our DBMSs are unable to implement them directly.
Generalization of Relationships
As with entity classes, our decision needs to be based on commonality of use, stability, and enforcement of constraints. Are the individual relationships used in a similar way? Can we anticipate further relationships? Are the rules that are enforced by the relationships stable? Let’s look briefly at the main types of relationship generalization.
Generalizing Several One-to-Many Relationships to a Single Many-to-Many Relationship
Bear in mind the option of generalizing only some of the one-to-many relationships and leaving the remainder in place. This may be appropriate if one or two relationships are fundamental to the business, while the others are “extras”.
Generalizing Several One-to-Many Relationships to a Single One-to-Many Relationship
Generalization of several one-to-many relationships to form a single manyto-many relationship is appropriate if the individual one-to-many relationship are mutually exclusive, a more common situation than you might suspect. We can indicate this with an exclusivity arc.
Generalizing One-to-Many and Many-to-Many Relationships
The generalization should be fairly obvious, but you need to recognize that if you include the one-to-many relationships in the generalization, you will lose the rules that only one employee can fill a position or act in a position.
Many texts and papers on data modeling focus on disaggregation, particularly through normalization. Decisions about the level of generalization are often hidden or dismissed as “common sense.” We should be very suspicious of this, before the rules of normalization were formalized, that process too was regarded as just a matter of common sense.
Subtypes and supertypes are used to represent different levels of entity class generalization. They facilitate a top-down approach to the development and presentation of data models and a concise documentation of business rules about data. They support creativity by allowing alternative data models to be explored and compared.
Subtypes and supertypes are not directly implemented by standard relational DBMSs. The logical and physical data models therefore need to be subtype-free. By adopting the convention that subtypes are nonoverlapping and exhaustive, we can ensure that each level of generalization is a valid implementation option.
The convention results in the loss of some representational power, but it is widely used in practice.
Data Modeling Theory and Practice Graeme Simsion