CEM: Coarsened Exact Matching Explained

Alan Huffman
8 min readApr 19, 2017

--

“This program is designed to improve the estimation of causal effects via an extremely powerful method of matching that is widely applicable and exceptionally easy to understand and use (if you understand how to draw a histogram, you will understand this method).” [Authors: Stefano Iacus, Gary King, Giuseppe Porro] http://gking.harvard.edu/cem

Simply put, CEM allows 2 populations (treatment & control) to be matched to determine the existence / extent of a treatment effect.

The 2 cohorts have different number of members in total and also a different distribution of type’s of members. In order to match members across the cohorts, a representational set of properties must be chosen to identify the different archetypes of patients. Like members respective to the measured outcome in 2 cohorts must be matched.

Choosing the right representation of properties to match on is the most difficult and crucial part of matching, in my opinion.

Examples: Gender, Age, having cancer, having chronic obstructive pulmonary disease, having congestive heart failure, etc... may be the salient properties for a clinical study. Meaning: Cancer patients may be lower cost than COPD + CHF, thus matching on Cancer, COPD & CHF is important given that the costs across those morbidity states is important to the outcome of costs.

There will be different types (archetypes) of patients, such as:

90+ year old males without cancer but with COPD and CHF.

90+ year old females with cancer but no COPD, CHF.

Almost all combinations of properties may be present in one or both cohorts if there are sufficient members, but adding matching properties will often reduce the matching of treatment to control.

Members in the two populations will fall into comparable archetypes (bin signatures), but the variance in the distribution of the various archetypes in each population must be normalized. Some archetypes will only be represented in 1 population and thus cannot be matched as there is no matching member in the other. Therefor, matching must determine if members can be matched in the 2 cohorts and subsequently deal with the different distribution of archetypes in each population.

CEM deals with archetypes and distribution through coarsening properties into BINS based on a binning strategy:

The brilliance of the idea is in its simplicity.

A member is represented by properties coarsened to discrete values using a coarsening or binning strategy. Thus each member is given a “BIN Signature” that will be used to Exactly Match other members with the same “BIN Signature”. This is where the “E” in CEM comes from.

Example: Age can be coarsened into any number of bins using a “binning strategy” : Above and below the age of 50, for example, so a 45 year old being below 50 is represented by zero, while a 55 year old is over 50 thus 1.

45 ( greater than 50 ) is 0.

50 ( less than 50 ) is 1.

Or if that binning strategy is considered too coarse, age could be converted into a quartile across the treatment+control population, i.e. 1, 2, 3, or 4. Or the age bin could be assigned based on ranges.

age between 0 to 30 = 0

age between 31 to 50 = 1

age between 51 to 60 = 2

age between 61 to 70 = 3

etc…

Each matched member must have the exact same BIN as their matched cohort. A member with an Age Bin value of 1, can only match a member in the other cohort with the exact same value of 1. If Gender is also a bin w/ values 0 (female), 1 (male) or 2 (other), then the matched member must also exactly match on that value.

Thus, a member with a BIN signature of:

Age (2), Gender (1), CHF (0), COPD (1), Cancer (1) would only match a member in the other cohort with the exact same 2–1–0–1–1 bin signature.

Binning strategy is the 2nd most difficult and crucial part of matching in CEM.

With properties and binning strategies chosen, each member is given a BIN Signature that is matched with members in the other cohort. Some % of treatment will be matched, but not necessarily all, as a treatment member may have BIN signature that no control member has. The possible BIN signature is a Cartesian Product or in SQL (example)

with age as ( select * from ( values (1), (2), (3), (4) ) x(age)),
gender as ( select * from ( values (1), (2), (3) ) x(gender)),
copd as ( select * from ( values (1), (2) ) x(bit)),
chf as ( select * from ( values (1), (2) ) x(bit)),
cancer as ( select * from ( values (1), (2) ) x(bit))
select
age,
gender ,
copd.bit as copd,
chf.bit as chf,
cancer.bit as cancer
from
age
cross join gender
cross join copd
cross join chf
cross join cancer

This overly simple example has the following coarsening / binning strategy:

Age = quartiles

Gender = 3 values: 1-female, 2-male, 3-other

COPD, CHF, Cancer = 1-present, 0-negative

Resulting in 96 potential BIN signatures (4 x 3 x 2 x 2 x 2) :

CEM Weights

With every member having a BIN Signature, each is matched to other members with that same BIN Signature. There will likely be an imbalance between the number of Treatment and Control members with a BIN Signature. This variance in distribution must be normalized using CEM Weights.

Example: Let’s assume only 3 BIN signatures matched

(2–1–0–1–1) matched 2 Treatment and 10 Control.

(4–2–1–0–0) matched 4 Treatment and 2 Control

(3–1–1–1–1) matched 6 Treatment and 6 Control

There are also Treatment members that do not match any Control members, and vice versa Control Members that do not match any treatment members.

Weights:

All unmatched members get a weight of 0 (zero) and thus are effectively thrown out.

Matched treatment members get a weight of 1 (one).

Matched control members get weights above 0 that can be fractional or ≥ 1 that will normalize the BIN signature (archetype) to the distribution within the treatment group. The formula being:

Weight = (BIN_Matched_Treatment_N_members / BIN_Matched_Control_N_Members) * ( Total_Matched_Control_# / Total_Matched_Treatment_#)

aka..

Weight = (Treatnment_N / Control_N) / ( Total_Control_N / Total_Treatment_N)

Example: (from above)

Total_Matched_Treatment_# = 2 + 4 + 6 = 12

Total_Matched_Control_# = 10 + 2 + 6 = 18

Unmatched Treatment Weight = 0

Unmatched Control Weight = 0

Matched Treatment Weight = 1

Matched Control Weight (must be calculated for each BIN Signature or Archetype).

(2–1–0–1–1) [2 Treatment and 10 Control] Weight = (2 / 10) * (18 / 12) = 0.3

(4–2–1–0–0) [4 Treatment and 2 Control] Weight = (4 / 2) * (18 / 12) = 3

(3–1–1–1–1) [6 Treatment and 6 Control] Weight = (6 / 6) * (18 / 12) = 1.5

Now the matched & weighted Treatment and Control members can be compared by using the weights to evaluate the presence & extent of the Treatment Effect.

This is simply done by taking the evaluated measurement (costs for example) and multiplying them by the weights for each member.

Matched Treatment members will each have a cost that is multiplied by 1.

Matched Control members will each have a cost that is multiplied by their CEM weight.

Example:

Below, notice that all matched Treatment have a weight of 1, and averaging across the weights for treatment is also 1, obviously.

For Control some of the members have lower or higher weights to deal with the variance in archetype distribution across the Treatment and Control. Take the 1st BIN signature as an example. Treatment has 2 members with that signature, while Control has 10. For treatment 2 of 12 members are of this type, or 16.67% of the entire cohort. For control 10 of 18 are of this type, or 55.56%. By multiplying each control member of this type by 0.3, the distribution of this type within the 18 control members becomes 55.56 * 0.3 = 16.68%, thus the variance in distribution of this type has been taken care of by CEM.

Same is true of the other BIN Signature Archetypes. That’s the magic of the weights.

Note, when calculating the MEAN for control, you simply divide by the number of control (18) or you can sum the weights, which sum to 18.

CEM is simple, but that doesn’t mean it’s easy.

Remember: The 2 most important things (beyond having quality, sufficient data):

  1. Choosing the right variables to match on.
  2. Using an appropriate / coarsening / binning strategy for the variables — quartile, n-tile, ranges, binary, etc..

Problems that can be encountered

  1. Matching can be completely off if the wrong variables are chosen. You may end up choosing variables that too tightly match or too loosely match. Remember this is all about trying to find appropriately matched groups given what sort of program / treatment is being tested. You may need to match on gender — or that might be inappropriate, same for age, race etc…

Example: There may be dramatic differences between male / female members, if matching does not consider gender, then the matching may never work. Treatment could be 90% female / 10% male while control is 10% female / 90% male — without creating a BIN for gender, CEM won’t be able to address this variance with weights. Hence why choosing variables to match on are so important.

2. To a lesser degree, if the right variables are chosen, but the coarsening is too loose.

Example: Age could be binned into 1 or 0 depending on if a member is ≥ 50 years old or < 50 years old — for some studies that might be appropriate, but working on a geriatric study, almost everyone will be ≥ 50 years old, and this coarsening strategy is inappropriate and too lose.

This Code Works (but can be slow as a User Defined Table Function — recommend a sproc w/ temp tables).

ALTER FUNCTION [dbo].[fnCEM]

( @TC_List CEM_TreatmentControl_Bins_List READONLY )

RETURNS TABLE

AS

RETURN

(

WITH weights

AS ( SELECT tbs.BinSignature ,

tbs.BinSignatureHash ,

tbs.Segment ,

tbs.Scenario,

tbs.N AS TN ,

cbs.N AS CN ,

tbs.N * 1.0 / cbs.N AS [weight]

FROM ( SELECT tbs.Segment ,

tbs.Scenario,

tbs.BinSignature ,

tbs.BinSignatureHash ,

COUNT(tbs.GlobalMemberID) N

FROM @TC_List tbs

WHERE tbs.IsTreatment = 1

GROUP BY tbs.Segment ,

tbs.Scenario,

tbs.BinSignature ,

tbs.BinSignatureHash ) tbs

JOIN ( SELECT tbs.Segment ,

tbs.Scenario,

tbs.BinSignature ,

tbs.BinSignatureHash ,

COUNT(tbs.GlobalMemberID) N

FROM @TC_List tbs

WHERE tbs.IsTreatment = 0

GROUP BY tbs.Segment ,

tbs.Scenario,

tbs.BinSignature ,

tbs.BinSignatureHash ) cbs ON cbs.BinSignatureHash = tbs.BinSignatureHash

AND cbs.Scenario = tbs.Scenario

AND cbs.Segment = tbs.Segment

),

theControlWeightAdjusterThingKingDid

AS ( SELECT w.Segment ,

w.Scenario,

SUM(w.CN) * 1.0 / SUM(w.TN) AS theGreatAdjusterWeightThingy

FROM weights w

GROUP BY w.Segment, w.Scenario)

SELECT ctl.IsTreatment ,

ctl.Scenario ,

ctl.GlobalMemberID ,

ctl.BinSignature ,

ctl.BinSignatureHash ,

CASE WHEN ctl.IsTreatment = 1 THEN 1

ELSE w.[weight] * thingy.theGreatAdjusterWeightThingy

END [weight]

FROM @TC_List ctl

JOIN weights w ON w.BinSignatureHash = ctl.BinSignatureHash

AND w.Scenario = ctl.Scenario

AND w.Segment = ctl.Segment

JOIN theControlWeightAdjusterThingKingDid thingy ON thingy.Segment = ctl.Segment AND thingy.Scenario = ctl.Scenario

);

--

--

Alan Huffman

Chief Technology, Data, & Information Security Officer