Don’t get wrong! Explained guide to choosing a database design notation for ERD in a while

Eric Gcc
20 min readMar 26, 2019

--

From: Data Modelling with UML

All who have chosen the programming path, and some who choose other fewer fun paths, have gone through the difficulties of designing a database, at least once in a lifetime.

The first time someone showed me an entity-relationship diagram (ERD), I thought it was relatively simple. Rectangles, diamonds, some lines, little text and some rules for using those symbols. Nothing I hadn’t experienced before with flow diagrams.

Then, searching the internet for examples, I realized that there were many ways to draw entity-relationship diagrams, different notations to which I had seen initially.

All this was the first clue, of what I would discover years later: in databases world, nothing is standard; even if they say it is, it’s not — yes SQL, I’m talking about you.

Everything became worse when different people used the same symbols with different purposes. On the one hand, they said a dashed line meant a strong relationship; on the other hand, the same dashed line meant an optional relationship. What was correct?

Both are, but it depends on the notation used.

Notations

So, a modeling notation is a set of graphical elements (and instructions on how to use them) that help us represent the structure or functioning of something.

The first stage in database design is the conceptual design where we use these super fancy applications to draw figures that make up an entity-relationship diagram.

ERD is created using some notation to represent graphically the list of requirements and business rules of whatever we are building.

Currently, there is no standard notation for an ER model. The original notation developed by Peter Chen in 1976 is still very popular. Numerous other alternatives exist, and many extensions were added to Chen’s original one, which arose out of the need to capture more semantic elements (mainly constraints). However, they all serve the same purpose.

An ERD has essential components, which are common to all notations:

  1. Entities are the main item from which information is collected. It can be an object, person, idea or concept of the real world and must have attributes.
  2. Attributes are entities’ properties that provide descriptive information about them. There are two classes of attributes: identifiers and descriptors. Additionally, and depending on the notation, there are different attribute constraints.
  3. Relationships are the link that joins two or more entities. They represent associations in the real world and do not have a physical existence. Their role is essential since they serve several purposes:
  • Describe participation, namely, the possible combinations of the entities part of a relationship. Participation is composed by the optionality (minimum cardinality) and maximum frequency (maximum cardinality), in other words, the minimum and the maximum number of entity occurrences that participates with another entity in a relationship.
  • Contain attributes, which only make sense when linking two or more entities.
  • Indicate dependency and exclusivity constraints.
  • Create hierarchies, in plain English, entities that generalize or specialize.

If you’re an IT looking to broaden your knowledge of database design, this guide takes you through each relevant notation of conceptual database modeling you can find in the industry. This is the primary aim of this entry. The second is that you can look more closely at how these notations compare to each other.

For simplicity, I will use Oracle’s well-known Human Resources database (hrdb) to show an overview of ERD in each format, and the Training Courses database (tcdb) to exemplify the most specialized symbols and constraints.

Chen’s notation

Created by Peter Chen in the late ’70s, it originally only contemplated the concepts of entity, relationship, and attributes. More concepts, such as compound attributes, generalizations, and specializations were added later to improve the notation.

Image 1: Human Resources Database in Chen’s notation.

Entities & Attributes

Rectangles represent entities with a noun inside, which denotes the item being modeled. There are weak entities because of dependency relationships — by existence or identification — with strong entities. Double rectangles represent weak entities.

In the tcdb diagram (Image 2), Edition is a weak entity because if a Course does not exist, Editions cannot exist.

Ovals represent the attributes joined to entities by lines. Several constraints can be indicated on the attributes; they can be: mandatory, optional, unique, multivalued, and derived.

Image 3 illustrates the symbols used to represent entities and the different types of attributes.

Image 2: Training Courses Database in Chen’s notation.

Relationships

Diamonds represent relationships between entities. Their name is usually a verb or a noun phrase. Relationships are n-ary (joining two or more entities) and, unlike other notations, can have attributes (job_history relationship, in hrdb of the Image 1).

Image 3: Chen’s notation (first part), entities and attributes symbols.

Participation

An entity’s participation is specified at the opposite end of the relationship of which it is a part. If an entity participates with a minimum cardinality of 0, it means that its relationship with the other entity is optional (partial), on the contrary, if it is 1, the participation is mandatory (total), meaning that each occurrence of the entity must participate in the relationship.

Taking the tcdb diagram (Image 2) as an example, the relationship between Skilled [Employee] and [Course] Edition can be read as:

Each Skilled [Employee] may give one (0, optional) or more (M) [Course] Edition.

Each [Course] Edition must be given by one (1, mandatory) or more (M) Skilled [Employee].

Dependency

An identifying relationship implies that an entity (weak) depends on a strong one to identify itself. With Chen, we depict this relationship with a double diamond (Image 4).

There is also existence dependency when their attributes completely identify the weak entity occurrences, but their existence makes little sense without the occurrences of the strong entity. Identification dependency implies existence dependency, but not vice versa.

In tcdb (Image 2), there is an existence dependency relationship between Course and Edition, and possibly an identification dependency (in accordance with the weak entity identifier).

Exclusive relationships

Sometimes, there are mutually exclusive relationships (exclusive OR relationships) between one entity and two (or more), where only one relationship can exist at a time. Chen marks exclusion with a dashed line that links all the mutually exclusive relationships.

The relationships take and give in the tcdb diagram (Image 2) are mutually exclusive, since an Employee cannot take and teach a course at the same time.

Specialization & generalization

Specialization and generalization intend to reuse the attributes between the associated entities. Specialization is used to define an entity (child) representing a specific segment of a larger entity. Generalization is exactly the opposite. The generalized entity (parent) represents the common structure of all sub-entities and contains all the common attributes from child entities.

Specialization and generalization relationships obey two constraints: completeness and disjunction. The first specifies whether an occurrence of the parent must belong (or not) to some child entity, and the second indicates whether a parent occurrence can be a member of two or more child entities. Therefore, there are four types: overlapping & partial, overlapping & complete, disjoint and partial, and disjoint & complete (see symbols in Image 4).

Chen’s notation can depict each of these constraints. The tcdb diagram (Image 2) shows a disjoint & complete specialization where an Employee must be in one and only one child entity.

Image 4: Chen’s notation (second part), relationship and participation symbols.

Pros and cons

In practice, Chen’s ERDs are unsuitable for complex data models. All graphic elements consume much space and provide no extra value to the model. If a diagram has entities with many attributes and relationships, it becomes chaotic.

Pros

  • Lots of information: Most books that cover relational database design tend to use this notation.
  • Use basic figures: You can draw an ERD with practically any program that has rectangles, diamonds, ovals, and lines (ehem … word).

Cons

  • Notation takes up much space: All the figures the notation uses cause very large and difficult to read diagrams.
  • No database modeler includes this notation: Therefore, you cannot autogenerate a logical or physical diagram and much less SQL or reverse engineering.
  • Not used in the industry: Because of the previous point, you will hardly find the use of these diagrams in a company.

Diagram makers recommendations

Presented in my order of preference. I’ve tried the first four programs of the list and draw.io is my favorite. It is a free tool that can store your diagrams directly in your cloud. Dia is a tool I used a lot a few years ago, with great results. Ludichart is very similar to number 1, and although it is licensed, the free version has what it takes to make any Chen diagram.

  1. draw.io (free | online)
  2. Day (free | desktop)
  3. Ludichart (freemium, commercial | online)
  4. ERDPlus (free | online)
  5. SmartDraw (commercial | online)
  6. Gliffy (commercial | desktop)
  7. Creately (freemium | desktop)

When to use Chen

Although it is true that we owe a lot to Peter Chen for his work that laid the foundations for the techniques that emerged afterward, I would only recommend using this notation if you are learning or teaching to model databases or by simple general culture.

IE notation

Clive Finkelstein and James Martin developed it in the early 1980s. There are several versions, without a single standard. This notation is undoubtedly the most used today. Supported by practically all database modeling tools, it is personally my favorite.

Image 5: Human Resources Database in IE notation.

Entities & Attributes

In this notation, named rectangles represent entities. In IE the concept of a weak entity does not exist; therefore, there is no symbol to represent it.

We list attributes in a rectangular compartment below the entity’s name. It is possible to specify that an attribute is: mandatory (<M>), an identifier (<pi>, primary identifier), unique (<ai>, alternate identifier) ​​or optional (with no modifier). Image 6 displays a list of symbols for entities and attributes in IE.

Image 6: IE notation (first part), entities and attributes symbols.

Relationships

Unlike Chen’s notation, relationships can only link two entities, i.e., they are only binary. Straight lines represent relationships. At the ends, we indicate the participation, and optionally, their name and the roles that each entity plays (Image 7).

Unlike Chen, IE notation does not allow relationships with attributes or n-ary relationships. To model something similar, we use an intersection entity, such as the JOB_HISTORY entity in the hrdb diagram (Image 5). Mutually exclusive relationships exist neither.

Image 7: Training Courses Database in IE notation.

Participation

The participations adopt the crow’s feet convention created by Gordon Everest. As in Chen’s notation, an entity (A) depicts its cardinality and optionality at the opposite end of the relationship, which is the line end that joins the other entity (B).

There are different ways to indicate participation. PowerDesigner (Image 8) uses a single circle ○ at the opposite end of the relationship to indicate a minimum cardinality of zero and a maximum of 1, meaning an optional or partial relationship.

One single bar | at the opposite end is used to indicate “exactly one” (minimum and maximum cardinality of 1), so the relationship is mandatory or total. With the minimum cardinalities of 0 (circle) or 1 (bar), crow’s feet are used to indicate a maximum frequency of many.

Image 8: IE notation (second part), relationship and participation symbols.

Dependency

As I mentioned before, the IE notation does not include any special symbols for weak entities. Even, identifying and non-identifying relationships are not a feature of the notation; however, some versions depict them.

In PowerDesigner, a triangle pointing to the strong entity (next to the cardinality symbols) indicates an identifying relationship, as shown in the tcdb diagram in the Edition entity (Image 7).

Specialization & generalization

Similar to Chen, IE defines the four subtype constraints of completeness and disjunction. The tcdb diagram in IE notation (Image 7) shows a disjoint & complete specialization where an Employee must be in one and only one child entity. In Image 8 you can see the four existing restrictions in IE.

Pros and cons

IE is a notation that captures the basic constraints used when modeling a database. However, in the absence of a standard, the blocks may vary from tool to tool. So, when using a modeler, check which of the conventions apply.

Pros

  • Widely used: Most of the diagrams you review will use this notation.
  • Easy to read: Crow’s feet notation is well known. The graphic elements are optimal, and if you stick to the convention of designing from left to right and top to bottom, the diagrams are understandable.

Cons

  • There is no standard: Several versions exist and can be confusing because of the different representations.

Modeling tools recommendations

In my opinion, these are the best tools for designing a database. They are packed with many features that make life easy. Unfortunately, the first three are licensed (and very expensive), and the fourth is very focused on Oracle products and has partial support of the notation.

1. PowerDesigner (commercial | desktop)

2. ERWin (commercial | desktop)

3. ER / Studio (commercial | desktop)

4. Oracle SQL Developer (free | desktop)

When to use IE

Use always, as IE is by far the most popular notation. Practically all modelers support the notation (with slight differences). There are many examples and documentation. It is likely that if you find an ERD made by someone else, he or she had used IE.

Barker’s notation

This notation is named after Richard Barker, an Oracle board director, who was commissioned in 1990 for the first efforts to develop the company’s design tool (Oracle Designer). Currently, this tool also supports UML, as an alternative to traditional notation.

Image 9: Human Resources Database in Barker’s notation.

Entities & Attributes

Entities are represented by soft rectangles (rounded corners) with their name in capital letters. A line separates the entity’s name from its attributes. Unlike Chen, it does not represent the attributes in ovals; instead, it lists them under the name of the entity, just as with IE. Some restrictions may precede the name of an attribute.

A # indicates that the attribute is the primary identifier or part of it. An asterisk means a mandatory attribute. On the other hand, a circle ○ before the name indicates an optional attribute (Image 10).

All attributes in this notation are atomic. Therefore, there is no concept of composite attributes, and there are no multivalued attributes either. Like IE, there is no graphical element to represent a weak entity.

Image 10: Barker’s notation (first part), entities and attributes symbols.

Relationships

A relationship is represented by a line linking two entities (binary). At each end of the relationship, we can indicate the role each entity plays. Like IE, we use an intersection entity to model an n-ary relationship or attributes in a relationship (JOB_HISTORY entity in Image 9).

Image 11: Training Courses Database in Barker’s notation.

Participation

The relationship line is divided into two halves, one for each participating entity. Unlike IE, Barker indicates the optionality of an entity in the line’s half on the same side. But just like IE, the maximum frequency of an entity is shown in the opposite segment of the relationship, I mean, in the line’s half that joins the other entity (Image 12).

If an entity’s participation in the relationship is mandatory (total), we must use a solid line half. Otherwise, a dashed line half is used to show an optional (partial) participation.

As for the maximum frequencies, a crow’s foot at the opposite line end indicates many, and the absence of any symbol at the line end indicates one. Therefore, a simple solid line means a 1:1 relationship between two entities.

Image 12: Barker’s notation (second part), relationship and participation symbols.

Dependency

The Barker notation does not have a weak entity concept; therefore, there is no symbol used in Oracle literature to represent it. However, it provides a way to indicate an identifying relationship.

An identifying relationship is shown using a bar | at one end of the relationship, meaning that the entity at that end has an existence dependency with the entity with which it is related. PowerDesigner does not support this notation.

Some people misinterpret the bar notation as if it indicated a cardinality of 1 since this is what the symbol means in other notations such as IE.

Exclusive relationships

Mutually exclusive relationships are indicated by using an arc that crosses all the relationships involved in the exclusion, joining by a point. Not all modelers can depict such relationships; Oracle Designer display them, PowerDesigner does not.

Specialization & generalization

Specialization and generalization are shown in Barker using something similar to the Euler diagrams. There are only exclusive and exhaustive inheritances, that is, of the disjoint and complete type. In tcdb (Image 11), an Employee is only Skilled or Unskilled, but not both at the same time.

Since the child entities are shown inside the parent, this representation is only recommended for simple cases. For reasons of space and readability, we often omit the attributes of child entities.

Pros and cons

Barker is one of the most popular notations and is used throughout the Oracle toolset. Although several modeling applications support additional notations, such as UML; Barker is still one of the favorites, it is more expressive than other alternatives.

Pros

  • Expressive notation: It includes restrictions that other notations do not, and it is still widely used.
  • Few symbols: Use relatively few symbols. There are no weak entities, and a simplified crow’s feet notation is used.

Cons

  • Entity hierarchies: The subtype tends to become unhelpful and messy with hierarchies of various levels of depth.

Modeling tools recommendations

In order of my preference, PowerDesigner is top-notch, is the best modeler ever created, but at a price too high.

  1. PowerDesigner (commercial | desktop)
  2. Oracle SQL Developer (free | desktop)

When to use Barker

Use when working with Oracle-related technologies. Since it is the notation, they adopted for their tools and their training material. As we have seen, it does an excellent job of expressing simple constraints of uniqueness, exclusion, and frequency. However, its subtype system is not the best. I would recommend using this notation if you do not have to model complex hierarchies.

UML Notation

Unifying Modeling Language (UML) appeared in the early 1990s, by which time entity-relationship diagrams had been around for 20 years. Then, with the popularity rise of object-oriented programming languages, UML also became popular. While this modeling language was intended for software design, it has expanded into the database design.

UML notation includes many symbols, with which we can build up to 13 different diagrams to model the behavior and structure of an application. However, it is unnecessary to know everything about UML to design an ERD. The diagram of interest for database designers is the class diagram, which can become quite complex. However, I will explain the elements we use to model data.

Image 13: Human Resources Database in UML notation.

Entities & Attributes

An entity corresponds to a UML class, which are displayed by rectangles with 3 compartments. In the first compartment, we place the name of the entity in CamelCase along with the «Entity» symbol specifying the stereotype of the class.

The second compartment contains the attributes, which can provide information about their visibility, data type, default value, and domain. For conceptual design, only the default value is useful; we can omit other features. Attributes that are primary identifiers are expressed with the stereotype «id» or with the modifier {id}, depending on the modeler. The stereotype «ai» (alternate identifier) before the name of the attribute means uniqueness (Image 13).

In UML the attributes are mandatory and single-value by default, this can be explicitly indicated by including [1] to the attribute in question. An optional attribute is marked by adding [0…1] after its name; this is known as a multiplicity constraint.

We also express multivalued attributes with a multiplicity constraint, which shows the minimum and the maximum number of values an attribute can contain. Thus the restriction [0..*] indicates that it can have from 0 (optional) to an indeterminate number of values; and [1…3] says it can contain between 1 (mandatory) and 3 values.

UML points with a / before the name to the derived attributes, i.e., attributes calculated from others. Sometimes the value of an attribute derives from other attributes of the same entity. However, in other cases, the value is calculated from the attributes of other entities.

The third compartment of the class is reserved for its behavior. Since an entity has no behavior, it remains empty.

Image 14: UML notation (first part), entities and attributes symbols.

Relationships

Relationships are illustrated by continuous lines between classes (associations). They can have a name and specify the role each entity plays in the relationship.

In the previous section, I talked about how UML represents multiplicity restrictions in attributes. We use a similar notation in relationships to describe the optionality and maximum frequency of an entity on the opposite line end (like IE): 0.. means that the relationship is optional; 1.. indicates that the relationship is mandatory; ..1 means that an occurrence of the first entity can be related with only one occurrence of the second entity; ..* indicates that an occurrence of one entity can be associated with many of the other (Image 16).

Like Chen, relationships can have attributes, and there are n-ary relationships. The attributes in the relationship are shown using an association class, which is a rectangle containing the attributes and is attached to the relationship by a dotted line, as seen in the JobHistory class of Image 13.

Ternary and major relationships are represented as a diamond (similar to Chen) with the association lines connecting to it.

Image 15: Training Courses Database in UML notation.

Dependency

An identifying relationship is expressed in UML using the stereotype «Identifying» in a composition (Image 15). A composition is an association that implies that a child entity cannot exist without its parent entity. It is represented using a line with a filled diamond next to the parent (strong) entity.

In UML there is no notion of weak or strong entity, so there is no particular way to represent entities participating in a dependent relationship.

Image 16: UML notation (second part), relationship and participation symbols.

Exclusive relationships

UML contemplates an exclusive-or constraint. Considering a relationship between several classes, the restriction serves to indicate that each instance of a class participates exactly with an entity of a set of alternatives.

To illustrate the constraint, a {xor} is placed in the middle of the dotted line connecting the associations involved. A constraint of this type can be seen in Image 15.

Specialization & generalization

This feature of an ERD is modeled using the UML concept of inheritance. Where a superclass (parent) can have several subclasses (child) that inherit its attributes. In UML, we indicate an inheritance relationship using an empty arrowhead on the parent side.

Since generalization is not a relationship between two entities, but a derivation from a general entity to a more specific one, multiplicity (participation) is not indicated. The Skilled and Unskilled at Image 15 are child classes of Employee.

Pros and cons

UML is a language widely accepted by analysts and software developers but not taken into account by data modelers. Designers have focused on using this language primarily for the design of object-oriented applications but not for the design of databases; although because of its popularity, it could become relevant someday, as there are already entire books on how to model databases with this language.

Pros

  • Widely used: UML exists, and thanks to its advantages and despite its disadvantages, it is quite common in the industry.
  • It’s a standard: Unlike most ERD notations, UML is an ISO recognized standard since 2005, so you won’t find all the variants of the other notations.
  • The suite of notations: The UML class diagram has advantages over other notations as it can pick up much more semantics from the requirements.
  • Modelers: There is a wide range of open source and licensing software options designed for this notation. Some database modelers also support UML.

Cons

  • Environment: Although it’s used to design databases, UML has not been as successful as other ERD modeling approaches.
  • The suite of notations: Includes much more than is required for an entity-relationship model, but with patience, you can “clean up” and use an appropriate subset of the notation.
  • Forward & reverse engineering: Since UML is not popular in database design, there are few tools to convert a class diagram to SQL (or vice versa) with no great results.

Modeling tools recommendations

Listed as I prefer them: (1) is the best option, has implemented all the UML standard and there is a Community Version. (2) Is also a great option, from my point of view is the tool with the best UX I have tested, however, does not implement n-ary relationships or xor constraints. (3) Is a legacy project, I don’t like the old-fashioned java interface, and the workflow is clumsy. However, it is a free open source project, so I understand that it is not the most luxurious tool. (4) Is a great tool, but it is commercial, has a 30-day trial, and its price is much more affordable than most professional modelers who implement notations such as IE or Barker.

  1. SoftwareIdeas (freemium | desktop)
  2. StarUML (freemium | desktop)
  3. ArgoUML (free | desktop)
  4. EnterpriseArchitecht (commercial | desktop)

When to use UML

Some recommend using UML when the entire system is intended to be object-oriented (which is practically any project at present). Many companies follow the UML standard when modeling their projects. If this is your case, this is the notation you should choose. I also recommend that if you or your team are already familiar with UML, do not think twice and continue on that path. There’s no point fighting with a new notation when you can use that time to create your dreams database right now!

IDEF1X Notation

This format should not appear in a comparison of database notations for conceptual design, and I’ll explain why. IDEF1X is the standard used in many branches of the U.S. government. It is a hybrid notation because unlike the previous ones. It mixes concepts of the conceptual model with constructions of relational databases (e.g., foreign keys).

The notation of the relationships has many variations to show the participation of the entities. For example, there are four different ways to specify a mandatory one-to-one relationship and four different ways to indicate an optional one-to-one relationship. In reality, all these variants have no meaning in conceptual design, but they become relevant in physical design.

The IDEF website mentions that this notation “is most useful for logical database design after the information requirements are known and the decision to implement a relational database has been made.”

The IDEF1X symbols are closely linked to the final implementation of the tables of a relational database and not so much to the semantic collection. This notation is fundamentally a physical design modeling technique and is not appropriate for the conceptual design.

Conclusion

Today, there are many notations for ERD (many more than I mentioned). The most commonly used are IE and Barker. In one way or another, these notations are supported by many modelers, even when no standards exist, and are preferred over UML.

The choice of a notation depends on several factors. I base the recommendations I give on my own meandering experience. But don’t panic, just remember, “we don’t make mistakes, just happy little accidents.”

In brief:

  • Use Chen just for fun, in teaching-learning situations, or even when hand-drawing an ERD.
  • IE is the de facto standard in the industry, so choosing it will always be a good idea, finally “nobody gets fired for using IE.” The only difficulty will be finding a modeling tool that fits your budget.
  • Barker is the best choice when working with Oracle technologies, and its modeler is free. But run away if you need to model complex hierarchies.
  • Finally, if none of the above options suit your needs, then UML is your way. It’s a standard language known to developers. There are more affordable modeler options, and some are freemium.

One of the most critical decision-making issues may be the software that will be used for modeling. However, although I don’t recommend it, any of the notations can be drawn by hand; all you need is a pencil, a rubber and tons of paper.

Regardless of choice, the most important thing is to know the meaning of each element of the notation, because the goal of modeling information is to create a diagram understandable by everyone and that expresses the business rules.

Happy modeling!

--

--

Eric Gcc

I’m a Computer Engineer focused in data modeling and analysis. I show up to share what I know and what I’ve learned in hopes that it will propel people forward.