How Apple can test if the Turkish Crime Family really has 200 million accounts — with only 100 samples

By now, you know the back story: a hacking group calling itself the Turkish Crime Family claims it holds the credentials for 300 million (or 500 million, or 800 million) Apple iCloud accounts, and is threatening to do bad things with them come April 7, should Apple not meet its ransom demands. But this article is not about iTunes gift cards, or whether the passwords came from breaches of other services.
This article seeks to answer the question: if TCF claims to have 300 million valid credentials, recycled or not, can an inquiring party challenge TCF to disclose — in a specific way — a very small set of credentials — say, no more than 100 — such that if TCF is able to do so, it would demonstrate with high probability that TCF is in possession of at least a sizable proportion of the claimed number, such as 200 million?
It may seem that with such a small number of credentials to be disclosed that it must be impossible — but the answer is yes.


Below is a protocol by which an inquiring party could challenge TCF to disclose a small set of valid account credentials to validate its claim that it is in possession of a large number of credentials.
The “inquiring party” need not be Apple, and could be anyone who could verify the credentials are real (e.g. by being able to test that the passwords disclosed are valid). When the inquiring party claims that TCF passed or failed the test, we want to be able to trust this claim, so the default choice of inquiring party would be any reputable news organization, but in this case, it is especially important that this news organization or independent journalist handles the cryptographic details correctly.

In this test, there are inherently many tunable parameters that depend on the inquiring party’s choice of error bounds, how many samples TCF should be required to disclose, and computational resources available. In this article, because April 7 is approaching, I have explicitly chosen values for these parameters which I believe are reasonable, which establishes a test with the following properties:

  • To pass the test, TCF must submit a set of at least 100 valid account credentials with certain characteristics (details in the test description).
  • If TCF passes the test, we can be more than 99.98% confident that TCF holds valid credentials for more than 200 million accounts.
  • If TCF holds valid credentials for at least 300 million accounts, we can be more than 99.98% confident that TCF will pass the test.

Test Description

For this test, I will write the name “Brian Krebs” to represent the inquiring party. When I say that either party “discloses” information, note that it is best that this information be immediately available to the public so that anyone can verify the steps performed by both parties. Obviously, sensitive account details should not be disclosed publicly, but they should be audited by at least one other party because some of their contents (CanonicalID) are important to the cryptographic integrity of the test.

  • To begin, Brian Krebs and TCF agree on (and disclose publicly) these parameters:
    * Minimum number of valid records to pass the test, S: 100
     The number of accounts, together with valid credentials, that TCF must submit to Brian Krebs. But it’s not as simple as that.
    * Maximum Number of Tries, MaxTries: 28
     This must be a positive integer. More on what this parameter is later.
    * SHA256 Cutoff Value: 0000053a5ca4252274a7d0c081c689a13f54d31c4121a03e618fa0d45cbe340a
     This particular value corresponds to a per-entry passing probability p ~= 0.00000031161165788182999. For lack of ambiguity between Brian and TCF, the SHA256 Cutoff Value should be specified instead of a decimal value p with insufficient precision. More on how this parameter is used later.
    * CanonicalID format: service-dependent; for Apple iCloud, normalized e-mail address
     This is an agreed method to select a single bytestring that will correspond to each individual account on the service. For example, for John Doe’s account on Apple iCloud, this may be his normalized email address, johndoe@gmail.com. Note that Gmail treats john.doe@gmail.com JohnDoe@gmail.com and johndoe+icloud@gmail.com as the same address, but for the purposes of this test, what Apple iCloud treats as the same or different in its login system is what matters.
     As another example, if the service in question happens to be Twitter, then the obvious choice of CanonicalID would be the Twitter account name. But because Twitter does not distinguish between upper and lower case in its account names, the two parties would have to agree on some normalization method (e.g. all lower-case).
     It is critical that there be only one bytestring that can correspond to each account. If the two parties cannot agree on a method that selects, for each account on the service, a single bytestring to represent that account, then they cannot proceed with this test. (The reason could be left as an exercise for the reader, but I will provide the answer in Final Remarks.)
  • Brian Krebs chooses a random 32-byte bytestring — call it b. He then discloses the SHA256 hash of b, SHA256(b), but keeps the value of b secret (for now).
    Example:
    b (in hex) = 54686973206973206578616d706c6520646174612e20546865206e756d626572
    SHA256(b) (in hex) = e2d4ddeaa7a08d30016643260b1d9996a33d42b33912dcfda3d754c48b0c716d
  • TCF now picks its own random 32-byte bytestring c and discloses it directly.
    Example:
    c (in hex) = 2031313135363020697320746865206f6e6c792072616e646f6d20706172742e
  • Now, once TCF has publicly committed itself to a value for c, Brian Krebs discloses the raw value of b. TCF then verifies that the SHA256 hash of this value of b matches the previously committed value of SHA256(b) by Brian Krebs.
  • Both parties now know the raw values of b and c, and so can compute a new 32-byte bytestring called Salt, computed as the exclusive or of the two raw bytestrings b and c: Salt = b^c
    Example:
    b, c as in previous examples
    Salt (in hex) = 74595842155f43000c0b41191809454f0a0d0d415c413a0c0a4d4e050c10115c
  • Now, TCF’s challenge is to find among its large set of user accounts a subset of at least S = 100 accounts, together with a single nonnegative integer which we will call try, which satisfy the following properties:
    * try is strictly less than the agreed MaxTries (in this example, it would have to be an integer in the range 0..27).
    * For each account in the subset, the value of SHA256( Salt || try || CanonicalID ) is strictly less than the agreed SHA256 Cutoff Value. The || symbol indicates concatenation, and try is a 32-bit (4-byte) big-endian unsigned integer. The SHA256 hash values are to be treated as 256-bit big-endian unsigned integers for the purpose of this comparison.
    Once TCF has found such a subset, it can submit to Brian Krebs the chosen value of try together with the subset of CanonicalIDs, and the corresponding other account credentials (passwords, phone numbers) for each account in the subset.
    Example:
    TCF is currently trying to determine which accounts can be included in the subset with try = 19 (which is 00000013 in hex as a big-endian 32-bit integer). TCF has credentials for John Doe’s account, with CanonicalID = johndoe@gmail.com, and Jane Doe’s account, with CanonicalID = janedoe@gmail.com. TCF then computes the hashes corresponding to these accounts:
     John Doe’s account,
    try = 19:
     
    hash = SHA256( Salt || try || "johndoe@gmail.com" )
     = SHA256( 74595842155f43000c0b41191809454f0a0d0d415c413a0c0a4d4e050c10115c 00000013 6a6f686e646f6540676d61696c2e636f6d )
     = 000001fafe117688dc9cdfc359385bc17dbc3b96a236a5b0cb75cdacdd369263
     < 0000053a5ca4252274a7d0c081c689a13f54d31c4121a03e618fa0d45cbe340a
     < SHA256 Cutoff Value
    , passes check
     Jane Doe’s account,
    try = 19:
     
    hash = SHA256( Salt || try || "janedoe@gmail.com" )
     = SHA256( 74595842155f43000c0b41191809454f0a0d0d415c413a0c0a4d4e050c10115c 00000013 6a616e65646f6540676d61696c2e636f6d )
     = b6273e0dec25f690d4047941301601d41429a53bce50b8c4b8597f5903e14a89
     >= 0000053a5ca4252274a7d0c081c689a13f54d31c4121a03e618fa0d45cbe340a
     >= SHA256 Cutoff Value
    , fails check
     So, for
    try = 19, TCF can include John Doe’s account in the subset, but cannot include Jane Doe’s account in the subset.
  • Brian Krebs now must verify the following properties about the submitted subset of accounts together with the try value:
    * The value of try that TCF chose is indeed strictly less than the agreed value for MaxTries.
    * For each account, SHA256( Salt || try || CanonicalID ) is indeed strictly less than the agreed SHA256 Cutoff Value.
    * (obvious) For each account, the associated credentials for the account are actually valid. Providing an account’s CanonicalID that passes the hash check doesn’t mean anything without a matching password that allows one to log in to the account. For the purposes of this test, Brian Krebs must invalidate every entry in the subset that does not have working credentials.
  • If the total number of remaining valid entries in the submitted subset is at least 100, then TCF passes the test, else it fails.

So, that’s one example of a test. But how does it work? And how do we calculate the probability of passing or failing for a TCF that has N = 300 million valid accounts, versus a TCF with a smaller number of accounts, n = 200 million?

Short Cryptography Primer

SHA256 is a cryptographic hash function, producing a 256-bit pseudorandom output from any input of arbitrary length. This test requires the following properties to hold for SHA256:

  • Given only the value of SHA256(b), it is computationally infeasible to infer anything about the raw value of b.
  • SHA256 is collision-resistant: it is computationally infeasible to produce two different inputs to SHA256 which result in the same SHA256 hash value.
  • SHA256’s output values are uniformly pseudorandom over the whole 256-bit output space; all outputs are equally likely.

For each different CanonicalID and try combination, the probability that SHA256( Salt || try || CanonicalID ) is less than the SHA256 Cutoff Value is simply p = SHA256 Cutoff Value / (2²⁵⁶), reading the cutoff value as a 256-bit big-endian unsigned integer. In this case, the probability p ~= 0.00000031161165788182999.

Probability Analysis

Let us first consider the case where TCF has N = 300 million accounts.
First, consider the number of accounts TCF is able to find that pass the hash check for a fixed value of try = i. This is a random variable — call it X_i. Because X_i is the total number of account CanonicalIDs which passed the hash check, out of N total, each passing with independent probability p, we see that we can model each X_i as:
X_i ~ B(N,p)
, where B(N,p) indicates a binomial random variable with number of trials N = 300000000 and probability p ~= 0.00000031161165788182999.
Basic probability then tells us that the expected value of each X_i is then Np ~= 93.483497364548997.

Similarly, in the case where TCF only has n = 200 million accounts, we can model the total number of accounts eligible for inclusion in subset with try = i, Y_i as:
 Y_i ~ B(n,p)
, with n = 200000000, and the same value of p. This time, the expected value of each Y_i is np ~= 62.322331576365998.

If your statistics software does not support computation of probabilities concerning binomial random variables with very large values of N and n, one may observe that because the value of N and n are large, and the (common) value of p is small, you may choose to instead use a Poisson random variable to approximate. This will model the random variables X_i and Y_i as Poisson random variables Pois(λ), with rate parameters λ equal to the expected values of the original distributions, that is:
 X_i ~~ Pois(λ), λ ~= 93.483497364548997
 Y_i ~~ Pois(λ), λ ~= 62.322331576365998
For the calculations of all values on this page, a Poisson approximation was not used.

In the case N = 300 million, TCF is permitted to submit any of the 28 subsets, the size of each being an independent, identically distributed random variable with the distribution of X_i.

What is the probability that any specific subset with try = i fails to contain S = 100 entries?

P(Subset try = i smaller than 100) = P(X_i < 100)
= P(X_i <= 99)
= P(X_i = 0) + P(X_i = 1) + … + P(X_i = 99)
= (1 * p^0 * (1-p)^N) + (N * p * (1-p)^(N-1)) + … + ((N!/((N-99)! * 99!)) * p^N * (1-p)^(N-99))
, (binomial distribution, ! indicates the factorial operator)
~= 0.73658194795502904

But recall that TCF is allowed to try all 28 values of try from 0 to 27, and submit any subset of at least size 100. So what is the probability that a TCF with 300 million accounts can find at least one subset with enough entries?

P(At least one subset try = 0..27 has size 100 or more)
= 1 - P(
No subsets try = 0..27 have size 100 or more)
= 1 - P(
All subsets try = 0..27 have size 99 or less)
= 1 - P(X_0 <= 99, X_1 <= 99, …, X_27 <= 99)
= 1 - (P(X_0 <= 99) * P(X_1 <= 99) * … * P(X_27 <= 99))
, by independence of X_0 … X_27
= 1 - (P(X_0 <= 99) ^ 28)
, by identical distribution of X_0 … X_27
~= 1 - (0.73658194795502904 ^ 28)
~= 1 - 0.00019150935281386878
~= 0.99980849064718613

So, if TCF has N = 300 million or more valid account credentials, TCF is able to pass the test with probability greater than 0.9998.

And now we look at the other half of the test: if TCF has at most n = 200 million valid accounts, what is the probability that it is unable to find any try subset with at least 100 entries? We do similar calculations on the random variables Y_i:

Again, the probability that any single subset with try = i has less than 100 entries is:

P(Subset try = i smaller than 100)
= P(Y_i < 100)
= P(Y_i <= 99)
= P(Y_i = 0) + P(Y_i = 1) + … + P(Y_i = 99)
= (1 * p^0 * (1-p)^n) + (n * p * (1-p)^(n-1)) + … + ((n!/((n-99)! * 99!)) * p^n * (1-p)^(n-99))
~= 0.99999315974864207

Therefore, what is the probability that in this case, TCF fails the test, because none of its 28 subsets have 100 entries or more?

P(No subset try = 0..27 has size 100 or more)
= P(
All subsets try = 0..27 have size 99 or less)
= P(Y_0 <= 99, Y_1 <= 99, …, Y_27 <= 99)
= P(Y_0 <= 99) ^ 28
, by independence and identical distribution
~= 0.99999315974864207 ^ 28
~= 0.99980849064718613

So, if TCF has only n = 200 million or less account credentials, it fails the test with probability greater than 0.9998, as required. ■


Final Remarks

  • The fact that the test passing probability in the case N = 300000000 is approximately equal to the test failure probability in the case n = 200000000 is no coincidence: I chose the SHA256 Cutoff Value (which defines p) to minimize the maximum of the two classification error probabilities.
  • The choice of distinguishing 300 million accounts or more versus 200 million accounts or less is an arbitrary but necessary choice, in conjunction with choosing two classification error probabilities to be the same. The short explanation is that there is clearly no perfect test: if your required set size is S, someone with as few as n = S accounts could get astronomically lucky and pass the test; at the same time, someone with the full value of N accounts could get unlucky and fail to produce a set of that size, unless S = 0. Therefore, any test must balance the probability that N fails against the probability that each n < N succeeds, based on how “wrong” you consider each misclassification to be (e.g. allowing someone with n = 299999999 accounts to pass the test is not as “wrong” as allowing someone with only n = 64000 accounts to pass the test).
  • The value of 28 for MaxTries is not the optimum choice, in the sense of minimizing the maximum of the two classification error probabilities. I don’t think there is a MaxTries value which obtains a lowest maximum error probability: there exists at least one test with MaxTries = 9000000000000000000 (but with a different SHA256 Cutoff Value) that has a very good maximum error probability of less than 10^-11 (one in 100 billion). However, it would be computationally infeasible for TCF to perform this many SHA256 hashes per account. However, the two parties may wish to agree on a different value for this parameter (and all others) dependent on the amount of computational power available to TCF.
  • The choice of limiting try to a 32-bit integer was arbitrary, but also based on the assumption that computing more than 2³² SHA256 hashes for each of 300 million accounts is not computationally feasible for those without massive distributed computing resources. If necessary, a variable-length integer encoding could be used for try instead, but this would have to be agreed upon by both parties before starting the test.
  • The requirement of at least S = 100 samples in the submitted subset highly influences the strength of the test: with a requirement of S = 200 samples, there exists a different SHA256 cutoff (with p ~= 0.00000065501421257190503), but with the same maximum number of retries of 28, that has the maximum of both classification error probabilities being ~= 0.000000366193358092999923 (less than 1 in 2.73 million). The factors influencing the choice of this parameter S would be TCF’s reluctance to disclose a large subset, or the inquiring party’s reluctance to validate the account credentials of a large subset, given error bounds that are already deemed to be sufficient by both parties.
  • In the design of this protocol, there was an asymmetric choice of which party would be second to disclose its raw value (out of b and c). This stemmed from an analysis of which party might have a bigger disincentive to complete its disclosure of its raw value after already having received the raw value of the other party:
    * If the protocol were reversed so that TCF discloses SHA256(c) first, then Brian Krebs discloses b, then before TCF discloses c, it could immediately determine if it will end up being able to pass the test — it already knows the raw values of b and c, and can therefore compute Salt = b^c and proceed with the rest of the calculations. If it determines it would fail the test, TCF might say that it lost the raw value of c, and ask Brian Krebs if they could start the test again. If Brian Krebs allows this to pass, then Brian Krebs is effectively granting TCF a free reroll of another MaxTries number of subsets.
    * The reverse of this holds for the test in its current form: if Apple and Brian Krebs are co-conspirators, and Apple has perfect knowledge of the exact set of credentials that TCF holds, then once TCF has disclosed c, Apple, knowing both b and c, can calculate whether TCF will be able to pass the test if Brian Krebs discloses b, and then may choose to abort the test before the disclosure if for whatever reason Apple does not wish that TCF publicly pass the test.
    Therefore, it is vital that the public make note if the party that discloses the SHA256 hash first (Brian Krebs) fails to disclose the raw value.
  • Collision resistance of SHA256 is required:
    * In the reverse protocol, where TCF sends SHA256(c) first, and then Brian Krebs discloses b, if TCF has found a SHA256 collision between two 32-byte messages, SHA256(c) = SHA256(d), c != d, then after it discloses SHA256(c), and Brian Krebs discloses the raw value of b, again before TCF discloses what is supposedly its value of c, TCF can then run the entire test twice, once with Salt = c^b and once with Salt = d^b; if either value produces a subset with size 100, TCF can then choose which of c or d to disclose. This effectively would allow TCF to double its allowed number of retries.
    * Similarly, for the protocol as it is written, an inquiring party with such a SHA256 collision could conspire with Apple to do the same thing, but this time, they would be able to choose which of two different Salt values to subject TCF to, which would square (reduce) TCF’s probability of success.
  • In this example, only two parties take part in the choice of the final value of Salt. If the public thinks a single inquiring party may be co-conspiring with TCF, or the public believes Brian Krebs is not able to generate a value of b with high entropy, then the computation of Salt can be trivially extended to incorporate many inquiring parties:
    * Brian Krebs chooses a random b, and discloses SHA256(b).
    * ZDNet chooses a random
    z, and discloses SHA256(z).
    * Apple chooses a random
    a, and discloses SHA256(a).
    * TCF discloses
    c.
    * Brian Krebs, ZDNet, and Apple disclose
    b, z, and a. (Again, be aware that the last party, or last chain of co-conspiring parties to disclose here may have an incentive to abort the disclosure.)
    * All parties verify all other parties’ raw values against the committed hashes.
    * Salt = b^z^a^c.
     This allows every party to have full confidence in the entropy of the final value of Salt, being at least the entropy of its own chosen component (b, z, a, or c). This change to the protocol does not require changing any other parameters.
  • If TCF has a large number of account credentials, but believes that each credential is only valid with probability v < 1, then Brian Krebs and TCF should agree beforehand on different test parameters. For any choice of SHA256 Cutoff Value (which defines the value of p), TCF knows each account will pass the test with probability pv, by independence. In an example case where it has 500 million accounts, and v = 0.5, TCF should therefore negotiate for test parameters suitable for testing N = 250000000, or simply declare that it only has 250 million accounts. Then, when it needs to produce a subset of S = 100 entries, it can simply send the full subset of all accounts which passed the hash check (which is likely over 200), but have the same confidence that after Brian Krebs invalidates the accounts with invalid credentials that the remaining set will still have size at least 100.
  • In the unlikely case that TCF is unable to find a subset with size 100 or more, it may simply submit its largest subset to Brian Krebs. The number of valid entries in the set still allows Brian Krebs to compute the likelihood that TCF has at least n valid accounts, for each n < N.
  • If by this time you still haven’t worked out why CanonicalID must be of a form that only permits a single bytestring to map to any single account, the answer is that it is simply a problem of set inflation: if, for example, Apple iCloud allows John Doe to log in using johndoe@gmail.com, JohnDoe@gmail.com, JOHNDOE@GMAIL.COM, or any uppercase/lowercase derivative, then for each legitimate single account which TCF has credentials for, it can feed this single account through the hash check once for every capitalization variant, without changing the value of try, giving one fresh roll against p for each variant. If there are 64 variants for each account, then TCF will likely pass a test designed for N accounts, even if it only has N/64 real accounts. In the case where the inflation ratio becomes so extreme that you believe the principles of the birthday paradox might apply (a resulting subset contains both JohnDoe@gmail.com and johndoe@gmail.com), then TCF can simply remove all duplicates before submitting the subset to Brian Krebs, which is unlikely to reduce the set size significantly.

Closing Comments

This test (the Fish-Baskin test) is a test I came up with after hearing the news about how media outlets were unable to verify or falsify the claim of how many account credentials were in TCF’s possession based on the disclosure of only a limited sample. Because I devised it in a limited time, and because it involves cryptography, it is likely that I have missed some small but fatal detail which would invalidate the entire test, so it absolutely requires much public scrutiny from actual cryptographers before it is used in any serious setting. At this time, I remain very open to the possibility that nobody has published a test like this before simply because someone had considered it before, but proceeded to determine that it doesn’t work, and consequently did not publish the useless result.
The probabilistic aspects of this test remain unintuitive to me. I may explore and try to explain them in the future, but it is likely you will get more immediate answers about and insight into how the test works (or doesn’t) from anyone versed in basic probability.
If you have any inquiries about how to verify the probabilities in this article or compute the probabilities or best parameters for other testing scenarios, consult your local statistician.

It’s called the Fish-Baskin test because it combines two things: comparing two (approximately) Poisson random variables with different rates, and strengthening of the test by allowing the challenged party to retry its sampling, but only among a limited number of “flavors”… which may be 31, but in this case was 28.

Special thanks to the GNU MPFR project, the free and open-source arbitrary-precision floating-point arithmetic library which enabled me to compute the probabilities used in this article.

Like what you read? Give FishB a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.