# Information Theory

## A Mini-Companion to a Mini-Introduction

# 1 Introduction

The physicist and mathematician Edward Witten (of M theory renown and I’m sure among other contributions) recently published to arXiv.org a paper serving as a mini-introduction to information theory, including both classical and quantum considerations: **A Mini-Introduction To Information Theory [****arXiv:1805.11965****]**, which I am currently reading through (although the paper is only 39 pages, the information density is such that intentional review takes a bit more time and effort than a newspaper article might for instance). Although I have had some extracurricular interest in the field for a small while (including auditing a series of MOOC’s on eDx.org), I find that there are still several mathematical notations and conventions (especially in the quantum cases) that are common obstacles to comprehension, many of which these type of papers take for granted in a reader, thus I find myself often turning to supplemental resources such as wikipedia.org, mathworld.wolfram.com, or google.com as I work my way through these type of papers. Since am going through the effort anyway, I figured it might be of potential benefit to others new to the field to document a few of these points I come across along the way, which will be the purpose of this essay. Note that this won’t be a comprehensive review of all of the notations and principles included in the paper, this is merely meant to capture a few of the points that I suspect may be less intuitive to someone with the starting point of an engineering degree or the like. As a means of organizing this presentation, I’ll follow the lead of Witten’s paper and present points in the order that they appear in the Mini-Introduction including use of that paper’s section and equation numbering convention, here again is the arXiv: **A Mini-Introduction To Information Theory [****arXiv:1805.11965****]**.** **I’ll more closely follow the narrative of the paper in the classical section and then as we get into the quantum aspects I expect will transition more to supplemental takes. This essay is intended as a supplemental resource, and a reader looking to get the full value here should consider reading it either in parallel or adjacent to a review of Witten’s work.

**2 Classical Information Theory**

**2.1 Shannon Entropy**

It is certainly appropriate that this paper begins with an overview of Shannon entropy. Claude Shannon’s paper A Mathematical Theory of Computation served as the origination of the field of information theory and everything that follows here is built off of that foundation. The core idea of Shannon entropy is that given a signal (such as for example a binary code’s series of 1’s and 0’s or alternatively e.g. a textual communication derived from a series of letters grouped into words and sentences), it is possible to measure the information content of that passage based on other features we can derive from the signal. For example, just from inspecting a signal we can determine the range of potential characters (such as a binary code’s two states 0 and 1 or the English alphabet’s 26 letters), and we can also measure the frequency of occurrence for each state and thus derive the probability of occurrence for each signal (for example in an English passage there will be a higher frequency of vowels than consonants). Using just these two features, it is possible to derive an entropy value S for each additional point in a signal.

It is further demonstrated that this entropy is constrained to S >= 0, where 0 would correspond to Shannon entropy of a message with complete certainty (such as a unitary code of all 1’s). The S value tells us a few things, a key one being that for a signal of length N points, we can derive the number of equivalent bits of information we can extract from a message — or another way of looking at this is number of bits for maximum compression of a signal.

Note that this derivation of limits assume no correlation between points in a signal, this is a simplification and thus there is actually potential for further improvement given further knowledge (such as knowing the rules of grammar for an English passage), although one that would be washed out as N gets sufficiently large.

**2.2 Conditional Entropy**

The usefulness of this Shannon entropy measure becomes more apparent as we extend applications to two-party communication. As is a common convention in the field of information science, we’ll consider two illustrative parties Alice and Bob, where Alice is trying to communicate some message based on a signal distribution X, and Bob receiving a resulting signal with distribution Y, where the difference between the two distributions X and Y originates from such interferences as environmental noise or other errors of signal transmission. In a classical message using binary code transmissions these errors would manifest as bit flips on random points (meaning a point intended as 0 would be read as a 1 or vice versa), in a quantum communication there would be other potential channels of error such as bit flips, phase flips, depolarizing, amplitude dampening, or phase dampening — more on quantum channels to come.

In analysis of this communication, we can evaluate the entropy across a few different bases. The entropy S of distribution X (we’ll call SX) we already know how to derive, just as we know how to derive SY. Another basis that we can consider for this system is SXY, which represents the entropy for the joint system of X and Y, of which the probability p(xi,yj) used in the derivation represents the joint probability of experiencing points x(i) from Alice’s distribution X and y(j) from Bob’s distribution Y.

The conditional entropy comes into play when we want to try to see through Bob’s eyes. Bob doesn’t know with certainty Alice’s original message, but we will assume that he has an idea of Alice’s signal distribution X. Thus when he receives a signal point y(j) from his distribution Y, a reasonable question for him to ask himself is “What is the probability that Alice sent the point x(i) given that I received the point y(j)?” This conditional probability can then be used to derive the conditional entropy for this state of affairs (note the constraint that this conditional entropy be >= 0, an interesting bit of foreshadowing is laid from Witten’s comment that this is not necessarily true for a quantum signal).

I’m not adding much to the discussion beyond what Witten has laid out by including the following, but it is an important point of the narrative so will go ahead and include anyway, as part of the goal here is also to highlight key passages in the paper. The information about X that Bob gains from what he sees in signal Y is called the *mutual information*.

**2.3 Relative entropy**

This section of the paper is devoted to a proof that this mutual information figure is nonnegative ( I(X;Y) >= 0 ), derived by first introducing the concept of relative entropy. Consider the case where we believe a distribution X is described as Q, but this Q is just a hypothesis, one we need to test to confirm. Let’s further say that the actual distribution X is described as P, and so there is a difference between our hypothesis for distribution Q and the correct distribution P of X. The more measurements we take of X (of quantity N), the more information we can infer about the difference between the correct distribution P and our hypothesis distribution Q. Based on our N measurements of X, we can infer that the probability that our hypothesis Q = P (the correct distribution) converges to the following formula for sufficiently large N:

The entropy included in the exponent of this function is the relative entropy, also known as the Kullback-Liebler divergence, which from the derivation we know is nonnegative.

We’ll find that if our hypothesis distribution is correct (P = Q) then the Kullback-Liebler will be ~ 0 (with more certainty for increasing values of N), and if our hypothesis distribution is incorrect our inferred probability from Eq 2.19 will shrink with sufficient N.

The proof that the mutual information (Eq 2.16) is nonnegative is developed by extending this relative entropy from the distribution X to an equivalent joint distribution X and Y, and using some logarithmic properties to show that the relative entropy of this joint distribution is equivalent to the mutual information between these two distributions, which must therefore also be nonnegative.

We made an interesting leap by building on the equivalence between the relative entropy for a single distribution X or an associatively derived joint distribution X,Y. It’s probably worth a clarification that the X in the joint distribution wasn’t meant to be equivalent to the X from the single distribution — it may be easier to think of the equivalent single distribution as X’, where for each case the probability p(x’) = p(xi,yj).

**2.4 Monotonicity of Relative Entropy**

The considerations of joint probability distributions are more in play in this section, as we start to consider what happens when we can make observations for only one of the distributions (X) but not of the other (Y). One thing we can infer off the bat for this case is that adding observations to Y can only help our ability to infer the differences between our hypothesis distribution Q(X,Y) and the actual distribution P(X,Y), thus we can make this statement about the relative entropy for the joint distributions X and Y.

We can extend these questions of joint distributions from the pair X,Y to the joint of three distributions X, Y, and Z. It’s easy to replace the Y in equation 2.28 above to an associatively derived joint distribution Y’,Z to give the same result for three distributions. Witten derives a further property of interest for this case using relations between relative entropy and mutual information called strong subadditivity, one that I did not find extremely intuitive. Some further foreshadowing was provided in the comment that this turns out to be true in quantum mechanics — Witten literally describes this finding for the quantum realm as a miracle.

Similarly to the monotonicity of relative entropy as described above, in the mutual information space, what one learns about a random variable X by observing both Y and Z is at least as much as one could learn by observing Y only.

Some closing thoughts on this, the final portion of the classical information theory section of the paper. I think I followed Witten’s narrative in this writeup much closer than I had originally intended. The goal is not to duplicate Witten’s findings, ideally this essay should serve to elaborate on points I found helped with comprehension that Witten might take for granted, while also highlighting some of the more key findings. I’ll try to stick to this philosophy a little closer in the next sections addressing quantum information theory.

**3 Quantum Information Theory: Basic Ingredients**

**3.1 Density Matrices**

So rather than trying to follow along completely with the narrative, I’ll begin now trying to offer explanations of key concepts and notations that the paper might take for granted. Quantum mechanic notations I think can be thought of mathematically as linear algebra applied to matrices subject to some constraints describing an analogous property to probability of some outcomes upon measurement — but with the probability expanded to allow for +/- values as well as real and imaginary components (a complex scalar). As the number of potential outcomes incorporated into the analysis grows so does the dimensionality which can be described by a Hilbert space vector of dimensionality. We can incorporate or combine different frames of reference using a tensor product operation. The states are typically represented using the Bra-Ket notation <Bra| and |Ket>, where <Bra| represents a row vector and |Ket> a column vector. We have the ability to translate from a Bra to a Ket vector and vice versa using a conjugate-transpose operation (basically a transpose operation coupled with a complex conjugation that multiplies imaginary portion of scalars by (-1)). I am not positive but assume BraKet is an attempt at a pun on the term bracket which if so I fully support. We can combine the vectors using either an inner product (dot product) abbreviated as <Bra|Ket> or an outer product (tensor product) abbreviated as |Ket><Bra|.

This BraKet convention can be extended to incorporate an intermediate square matrix in the inner product using standard matrix multiplication, such as:

The tensor product operation (⊗) can also be extended from vectors to matrices input such as:

The paper makes use of this operation to demonstrate that if we have two Hilbert spaces of vector dimensions A(i,j) and B(k,l) we can combine the two spaces via tensor product to get a Hilbert space matrix of dimensions N(i*k, j*l).

I’m trying to cover a lot of the fundamentals of the linear algebra quantum system notation conventions here and in the process getting a touch removed from the paper’s narrative, let’s try to jump back in by addressing the theme of this section 3.1, density matrices. I confess I ended up pulling up some supplemental wikipedia reading to follow along, and I think in the process found a pretty intuitive textual definition of the difference between a pure state and a density matrix. In each case it is possible to combine multiple quantum substates, however in the pure state the substates are combined using the tensor product in a quantum superposition, and in a mixed state’s density matrix the states are combined as a classical probability statistical ensemble of states (remember a classical probability is a real scalar subject to 0<=p<=1 and the quantum equivalent is an imaginary scalar q with p = |q|²<=1, and in both cases ∑p=1 (actually come to think of it I don’t think have come across a term for the quantum equivalent of a classic probability, on the off chance none exists let’s call it here the quability, got a nice ring to it)).

The paper makes use of the trace operation (Tr) as well so I’ll expand on it quickly. When we have a combination of quantum substates, say state A which represents a qubit and state B which could be a second qubit or alternatively could represent the rest of the universe, by taking a trace in the B basis we’re effectively saying that we can choose to ignore the B basis and drop it altogether from our evaluation, which is a fortunate ability because I imagine nailing down the state space of the rest of the universe might be a challenge for most scientists. Looking at it from a linear algebra standpoint, a trace operation for a matrix can be derived from the sum of its diagonal elements.

Note that these density matrices are described in the paper as subject to the constraints of hermitian and positive semi-definite, let’s quickly define these terms. Hermitian matrices are square dimensions with nondiagnal entries such that the value of the complex scalar in row i and column j is equal to the complex conjugate of the value in row j and column i, where a complex conjugate just means that the imaginary part of the number is multiplied by -1. To be honest I’m not sure why this matters or how this constraint comes about so let’s just go with it. As for positive semi-definite, I think the easiest way to define this is the constraint that the eigenvalues of the matrix are positive. Of course this opens up a whole can of worms, now we have to address eigenvalues. Here’s my layman’s sketch (which I am basing heavily on the clear writeup in Wolfram mathworld): given a square matrix A, there exists a scalar eigen value λ and a corresponding (column) eigen vector X such that AX = λX. Thus λ is a kind of characteristic feature of a matrix, and it turns out there will be several such eigen values λi and corresponding eigen vectors Xi with quantity based on the size of the matrix. Further, it is possible to transform (e.g. rotate thought it’s dimensions) the basis of hermitian matrix A by applications derived from the set of eigenvectors to result in a matrix of same dimensions but whose values are all 0 save for the diagonals whose values are the set of eigen values λi, a process known as diagonalization.

When Witten describes the purification of a density matrix, this diagonalization property comes into play in the proof that we can create a process for incorporation of an orthogonal state space derived for a specific mixed state ρA such that when such unitary transformation is applied it transforms what started as a mixed state ρA to a pure state ψAB.

Before closing this section one further note on this Bra Ket notation, which I’ll include as kind of an aside and because it’s relevant and fundamental to many of my earlier writings on quantum computation even though it’s not directly addressed by Witten. For the special case of quantum computing with a qubit measured in the binary 0/1 basis (as opposed to say the +/- basis for the instance of another common basis) and thus a 2 dimensional Hilbert space, the inner product of the state descriptor Bra vector and the qubit measurement identifier vector Ket can be combined between multiple qubits using the tensor product operation for a joint state of multiple qubits such as we might get as we scale up our quantum computers. Here the Ket |0> refers to a single qubit measurement generating a 0, a |1> refers to a measurement generating a 1, with after combining two qubits for instance the Ket |00> indicates measurement on both qubits produces a 0, |01> indicating the first qubit measures a 0 and the second qubit a 1, and etc for |10> and |11>, with this combined Hilbert space dimensions of 2^n where n is the number of qubits. We can describe the state of a single qubit’s superposition as |ψ> = a|0> + b|1>, where a and b are complex scalars subject to the constraint |a|²+|b|² = 1. Combining multiple qubits thus entails:

**3.2 Quantum Entropy**

Now that we have the vocabulary to describe quantum states and their associated density matrices, we can progress to evaluating the informational entropy measure of these states. The quantum equivalent of the classical information theory’s Shannon entropy measure for a quantum state is called the von Neumann entropy, evaluated against a density matrix ρ. We incorporate the Trace operation into the definition to clarify that we are ignoring properties external to our basis of analysis. The paper’s comment that this S is manifestly invariant under a unitary transformation can be interpreted to mean that S won’t change with the application of a unitary transformation (such as for instance the application of quantum gates to a qubit (or collection thereof)), and we’ll find the same to be true when we get around to quantum conditional and relative entropy in section 3.4.

It’s an important point to consider that while the Shannon entropy was null (S=0) when the state was known with certainty, here the quantum equivalent von Neumann entropy is null (S=0) for a pure state ψ with rank 1 density matrix (i.e. pi = 1 for pure state ψi). That gives the lower bound of S, the upper bound of S corresponds to a perfectly mixed state between k constituent pure states such that the density matrix sees a constant classic probability for each as pi = 1/k for all pi.

Remember when we considered the classical Shannon entropy for the joint distribution of two distributions X and Y? Now we can look at a quantum equivalent for a pure state derived from tensor product of two pure states ψAB = ψA ⊗ ψB. The paper implies that we should be fascinated when we then compare the von Neumann entropies of each state ψA and ψB and find they are equal as SA = SB, although really this is only happening because for any pure state, including a composite pure state as well as those pure states from which it was derived, pi=1, so it’s not really that shocking of a result albeit it is unique to the quantum case.

This section closes by considering the Trace operation applied to a composite mixed state. I’ll discuss the trace operation further in my comments to section 3.5 so for now will just defer to Witten’s treatment.

**3.3 Concavity**

The concavity feature is described for the combination of two density matrices ρ1 and ρ2, where the von Neumann entropy S is concave with respect to the mixing ratio t.

Recall that density matrices are statistical ensembles of pure states using classic probability ratios, derived from the sum of self tensor product matrix of each pure state multiplied by that state’s probability ρ = ∑pi*|ψi><ψi|. It follows then that a collective ensemble of two pure states can be developed simply by multiplying the first pure state by a mixing ratio t (with 0<=t<=1) and the second by (1-t) and then summing the two together as ρ(t) = tρ1 + (1-t)ρ2, which is what Witten demonstrates. The insight of this section comes into play when we consider the first and second order derivatives of the entropy corresponding to that ensemble density matrix with respect to t S(ρ(t)) , e.g. S’(ρ(t)), and S’’(ρ(t)), and it turns out after derivation that the second order derivative is negative for increasing t, S’’(ρ(t))<0, which in mathematic terms can be described as “concavity”.

Concavity and it’s counter convexity are fairly straight forward concepts that describe the sign of a second order derivative of some function. I find a handy way to picture these features graphically is that the graph for a convex function against some variable will curve up in one or both tails such as a happy smiling face, while a concave function will carry the frowning shape of someone who might be feeling a tad blue. The paper calls out that entropy can only increase under mixing, and thus there will be some mixing ratio 0<t<1 which maximizes the entropy S(ρ(t)). I expect that when ρ1=ρ2 and are pure states (of rank 1) then t can be any value 0<=t<=1 for maximum S, which I believe corresponds to the case where the second order derivative d2S(ρ(t))/dt2 = 0, the only exception to the concavity.

This section concludes by asking an interesting question. If we have a density matrix ρ which includes nondiagonal entries, and we drop from ρ those (non)diagonal matrix elements to be replaced by 0’s to derive what we’ll call ρD, what does the mixing function look like between these two versions ρ and ρD? As an aside this type of operation I believe corresponds to a non-unitary transformation such as might change a matrix’s set of eigen values, similar to a qubit experiencing some channel of error. As for the result, Witten uses this case to illustrate that dropping the off-diagonal part of a density matrix (in any basis) can only increase the entropy. Witten later offers an example for the physical realization of this diagnalization procedure in section 3.7.

**3.4 Conditional and Relative Quantum Entropy**

In sections 2.2 and 2.3 we introduced conditional and relative entropy for classical information, here we explore the quantum equivalent, but first recall that in section 2.2 we demonstrated an entropy formula for joint distributions SXY. In the quantum world we have the two different types of joint distributions — those pure states joined via tensor product into a superposition, and alternatively a dense state consisting of a statistical ensemble of pure states — in both cases described by a density matrix. This is mostly speculation but I expect one of the indications of whether to expect a pure state ensemble or a dense matrix ensemble when joining a series of quantum states could involve the scale of address — for example two electrons interacting with each other might interact into a joint superposition pure state, while interacting features at the scale where environmental influence comes into play would interact as a statistical ensemble, the difference between nano scale and macro scale — this is just speculation though and I don’t consider myself an expert. In either case the entropy of the joint system will be as derived in 3.2, as a function of the constituent density matrices for the ensemble.

In 2.2 we derived for the classical case the conditional entropy for a bipartite system where Bob was trying to reconstruct a signal from Alice based only on his knowledge of her signal’s distribution, S(X|Y) which we found we could simplify to S(X|Y) = SXY — SY >= 0. In the quantum case we’ll use that same definition even though as Witten points out there’s not really a good quantum notion of conditional probabilities, and the fascinating point is that unlike for the classic case, in the quantum case S(A|B) negative values are allowed.

Recall that we discussed a purification process in section 3.1 whereby we could transform a dense ensemble to a pure state by conjoining with some specific orthogonal dense state, such that even though SB>0, as a pure state the purified SAB = 0. This proves that it is possible for SAB < SB, and it also corresponds to the boundary condition where SA|B = -SA.

Recall the classical property of mutual information which quantified the information about X that Bob gains from what he sees in signal Y. Just as in the case of conditional entropy, even though there isn’t a direct corollary for inferring information about one quantum state from another, we can still borrow the definition from the classical realm, and this time we even retain the nonnegative constraint (where I(A;B)=0 when ρAB factorizes, meaning ρAB=ρA⊗ρB).

Witten next addresses the question of why this quantum version of mutual information maintains the nonnegative constraint, and to do so he introduces another quantum analog to a previously discussed classical property, in this case relative entropy, S(A||B). Recall that we discussed relative entropy in section 2.3 and that it served as a way to measure the difference between a hypothesis distribution Q vs an actual distribution P, aka the Kullback-Liebler divergence. Witten first proves nonnegativity of this relative entropy, and then demonstrates that it can be derived using the same formula as mutual information, just as we did in the classical case.

I’ll close this section with a somewhat off-topic observation. As can be inferred from David Deutsch’s proof of universal quantum computation, once we have purified an arbitrary density matrix, I expect it would be possible to translate a pure state ψ1 to some arbitrary posterior pure state ψ2 of equivalent Hilbert space simply by the application of unitary transforms such as may be realized for a qubit using a universal set of quantum gates, with our only bound the constraints of these equations for conditional and relative quantum entropy.

**3.5 Monotonicity of Relative Entropy**

Recall that the monotonicity of classical entropy was demonstrated by the comparison of the relative entropy (Kullback-Liebler divergence) between two distributions, such as may be evaluated in comparison of a hypothesis and actual distribution, and that the property can be described by the observance that the relative entropy between two multivariate distributions will be >= a comparable relative entropy of the same distributions after dropping one of the variables.

In the quantum world, the trace operation (Tr) serves as the equivalent to this dropping a variable by reducing the Hilbert space under evaluation. For an example of a quantum computer, a trace operation could be realized by the measurement of a qubit causing the collapse of that portion of the collective superposition, or alternatively just by discarding the qubit such that no further interaction is realized in a fashion that can inform the evaluation.

Jumping back to the monotonicity narrative, Witten then uses the result of monotonicity of (quantum) relative entropy coupled with the corresponding monotonicity of mutual information to derive a demonstration of what he has called the miraculous result of strong subadditivity in quantum information, along with a few equivalent statements, including some relating to entropy of combined systems and some related to conditional entropy of systems.

So what is it about strong subadditivity of quantum information that causes Witten to use a word like miraculous? This section opens with a comparison to classical probability, where although classical probability allows for joint multivariate distributions, there is no equivalent in quantum observables. Despite this disparity, many of the conventions of classical information theory extend into quantum information equivalents, albeit with a few adjustments such as the potential for negativity in conditional entropy.

**3.6 Generalized Measurements**

As this section addresses measurements, it’s probably good to figure out what is meant by the orthogonal hermitian projection operators πs for s from 1 to k, where k is the number of Hilbert dimensions. First from what I’ve been able to gather about projection operators, πs is a square matrix which wasn’t explicitly stated in the paper (although I think that might be implied by the use of the word “operator”). I found an improved description of the derivation of the measurement operator in Nielsen and Chuang’s text Quantum Computation and Quantum Information, section 2.2.3. Basically we have a set {πs } which is in the form of a horizontal row vector like a Bra vector <πs|. We can then apply an outer product between a corresponding Ket vector |πs> to derive the square matrix operator (recall that we can translate from a Bra to Ket vector using the conjugate-transpose operation).

As an aside, note that the specific projection operator will be a function of our basis of measurement. A quantum computer is generally built to distinguish between the superposition of two opposite states such as 0/1 or +/-. It is possible to build a kind of qubit that upon measurement distinguishes between >2 states (e.g. a 3 state qutrit), but I believe it has been demonstrated that the binary basis will generally have close to comparable computationally efficiency.

Turning back to the paper, Witten goes on to introduce linear operators Es for s:1-k which although carry some of the same formulas for application to a state, are not orthogonal projection operators but are derived from the application of some unitary transformation. This is a more general kind of quantum mechanical measurement which we’ll learn more about in the next section. These linear operators Es are called Kraus operators.

Since we are talking about the measurement operation here, I’ll close by going on a brief tangent related to measurements and associated time-based evolution of our quantum states, which will be inspired from an illustration included in Roger Penrose’s recent work “Fashion, Faith, and Fantasy in the New Physics of the Universe”. In this case the below image depicts the two types of evolution, deterministic evolution over time per the Schrödinger equation for time based evolution of a quantum state, and probabilistic state collapse. One of the things that can be inferred here is that the time evolution has a positive slope in all cases for increasing t, where I believe the y axis is a somewhat abstract representation for Schrödinger evolution. I think this is comparable to what I’ve already touched on in that for a qubit superposition the time evolution will always trend to the direction of decoherence. I think an important lesson I am inferring here is that this trend to decoherence is a different operation than the measurement projection, which goes in a different direction. I am thus inferring that a qubit’s trend to decoherence is driven by the evolution from a pure state to an increasing mixed dense state, such that a fully decohered qubit will be maximally mixed, with an invertible density matrix ρ, a different result that a qubit’s post-measurement state. Although to be honest I’m kind of speculating here.

**3.7 Quantum Channels**

This section addresses quantum channels for the evolution of a density matrix. Very simply a state ρ will evolve per the application of a unitary matrix U.

This is the special case for evolution under a single unitary transformation (a set of Kraus operators with k=1) . For the more general case where we have a collection of unitary transforms, ρ will evolve via application of the set of k Kraus operators as we derived in the previous section.

In the derivation of this more general quantum channel via a set of Kraus operators, Witten provides a series of intermediate transformations in equations 3.80–3.82, which I won’t duplicate here in interest of leaving the good stuff where it belongs. But an important point is that after the preparations of a density matrix ρ and applications of Kraus operators to this intermediate state, a trace operation is performed to get to ρ’, which, when we then revisit the monotonicity of (quantum) relative entropy, is called out as the reason for this surprising result.

This section is then concluded by a series of exercises to illustrate properties of quantum channels:

- A reminder of the purification operation, allowing us to map any arbitrary density matrix ρ to a pure state |ψ><ψ| via application of an appropriate set of Krauss operators.
- A somewhat analogous case, where it is alternatively possible to map any arbitrary density matrix ρ to a maximally mixed density matrix via application of an appropriate set of Krauss operators.
- A diagonalization operation to map any arbitrary density matrix ρ to ρD by dropping the non-diagonal elements of the matrix, with a demonstration of a physical realization of this operation offered.
- Inferring the composite quantum channel comprised two joint quantum channels. (It would be really impressive if I worked this out here don’t ya think?)
- Deriving Kraus operators to map to arbitrary states of reduced Hilbert space dimensions.
- Deriving Kraus operators to map to arbitrary states of increased Hilbert space dimensions.

**3.8 Thermodynamics and Quantum Channels**

The discussion turns down some new territory in this section, and for the first time relates the quantum informational entropy value, aka a von Neumann entropy Sρ associated with a dense quantum state’s density matrix ρ, with a thermodynamic equivalent of entropy Sσ associated with that system’s thermal density matrix σ — such entropy as would be described by the second law of thermodynamics for instance. Although they are describing different phenomena, they are in the end both statistical measures derived from some distribution of probabilities. Some might even argue that the two measures share a greater equivalency, consider the “it from bit” hypothesis.

Witten here considers first the thermal density matrix σ at some temperature, and then draws of the relative entropy S(ρ||σ) for these two systems, along with the corresponding free energy F(ρ). The key finding is that assuming some quantum evolution of ρ->ρ’ under constant temperature such that σ’=σ, by making use of a statement similar to the monotonicity of relative entropy, the free energy can only be reduced by the application of a quantum channel.

**4 More On Quantum Information Theory**

These closing passages are three topics that Witten found helpful in gaining insight about the meaning of quantum conditional entropy and quantum relative entropy (see section 3.4), with attention paid to the relevant implications of monotonicity of quantum relative entropy (see section 3.5) — and its close cousin, strong subadditivity.

**4.1 Quantum Teleportation and Conditional Entropy**

Alice has in her possession a quantum state A0 which she wants to share with Bob. Due to the no-cloning theorem it is impossible to create an identical copy of an arbitrary unknown quantum state. However, quantum information allows for a teleportation of a state such that Bob can achieve a copy of the state provided that in the process Alice performs a measurement on her copy, causing her to lose her quantum information. (The teleportation term I suspect derives from the old Star Trek thought experiment — even though the Enterprise generates an exact copy of a crew member on the planet surface in the act of teleporting, the original version of that person is disintegrated in the process.) The teleportation is achieved by connecting Alice and Bob with two channels of communication: a quantum channel of entanglement between qubits A1 and B1 and a classical channel. Alice and Bob first generate an entanglement between a pair of qubits (one held by Alice and one by Bob) by preparing a Bell state ΨAB. (More on generating a Bell state here.)

So keeping count, we now have three qubits, two held by Alice (A0 and A1), and one held by Bob (B1). A1 and B1 we’ve prepared in an entangled state (Eq 4.1), and the state of A0 that we want to teleport to Bob is in the unknown state A0 = α|0> + β|1>. Now as is possible for multi-qubit systems, in addition for allowing A1 to interact with B1, we can also at the same time allow A1 to interact with the target state A0 (Eq 4.2), thus facilitating a joint system of three qubits A0, A1, and B1 (Eq 4.3). Notice that the α and β from our A0 distribution are carried through into our joint distribution for A0A1B1 of Eq 4.3. And while we can’t measure the state A0 directly without causing a collapse of our target state, Alice can manipulate the pair A0A1 such that a measurement will collapse the joint distribution A0A1 into one of four possible two-qubit arrangements (Eq 4.2) while preserving the α and β distribution parameters of A0 amongst the joint interactions of A0A1B1 (Eq 4.3). So when we perform the measurement on A0A1, it tells us which of four possible two-qubit arrangements, which can be communicated to Bob in the classical channel allowing him to reconstruct a correction gate set (Eq 4.5 for the example of Eq 4.4) to his remaining distribution of B1 (post measurement on A0A1) to translate from ΨB1 to ΨA0, and thus after performing this manipulation Bob receives a copy of the state A0 “teleported” from Alice.

After first restating this approach in an alternate configuration (incorporating an addition state R maximally entangled with A0), Witten then proceeds to explore the relative entropy of the system components. Combining A0 and A1 into the joint system A and with B0 as B, we look at the conditional entropy S(A|B). Using the auxiliary system R in the derivation, there is an interesting statement made which I will point out because is novel idea to me: “A0 is maximally mixed, since it is maximally entangled with R”. After accounting for S=1 for maximimally mixed states and S=0 for pure states, it is shown that for this case with incorporation of R S(A|B) = 0, which turns out to be the limit allowed for successful teleportation, in and the more general case:

The discussion around justification for this constraint was a little over my head, but the fact of this limit better allows us to understand the parallel between classical and quantum conditional entropy. Using Witten’s words here “Remember that classically S(A|B) measures how many additional bits of information Alice has to send to Bob after he has already received B, so that he will have full knowledge of A”, and the analog for conditional entropy can be demonstrated by the sharing of (half of a pair of) entangled qubits to Bob, where for each of these entangled qubits Bob receives SB is increased by one. Thus if someone tells you they have a quantum conditional entropy S(A|B) = 1 and they want to teleport a state, tell them they’ll need first share at least one more pair of entangled qubits between Alice and Bob, and the more complex of Alice’s state A0 (and the higher of SAB) the more of these pairs they will need to share to offset.

**4.2 Quantum Relative Entropy and Hypothesis Testing**

This section opens by a restatement of some classical relative entropy principles and their quantum equivalents that were originally discussed in sections 2.3 and 3.4, where for sufficient number of observations N, we could estimate the probability that a hypothesis distribution was equivalent to an actual distribution.

Recall the monotonicity of relative entropy (as discussed in section 3.5 and elsewhere) , Witten sites an external paper that presumably gives an upper bound on our ability to infer these probabilities from relative entropy, which result he then goes on to prove. In order to do so, he demonstrates a further equivalence between the quantum and classical cases of relative entropy. It turns out that in the act of making our observations on the quantum case, which are in effect mapping our quantum density matrices via partial trace onto classical probability distributions such that our ρ and σ distributions are replaced with corresponding classical distributions R and S. Using the monotonicity of relative entropy definition (Eq 4.25), and noting that for the resulting probabilistic derivations as we are rising the 2 to the negative power it will switch the direction of the > sign, we end up with this limit that the process of converting our density matrices to corresponding classical probability collections, the act of performing an additional quantum transformation in the translation means that we will have a higher degree of certainty post conversion, albeit one that will wash out as N gets sufficiently large.

Note that once the information is translated into classical space, it’s now possible to make copies, and so N observations can turn into N observations of N’ copies, which comes up later in the discussion although to be honest most of what follows here is over my head.

**4.3 Encoding Classical Information In A Quantum State**

Of the recurring themes in this research, one of the most consistent has been the explorations of boundary conditions. In the fields of information theory — first in classical and then in quantum, what are the limits of our ability to transmit and to infer given the basis of transmission. More so, we have asked how things change now that we have entered the quantum world — where do the assumptions of classical information theory change in this less intuitive environment. In this the closing passage, the author asks these questions one more time.

When Claude Shannon laid the foundations for information theory, he did so by asking questions about information content of a signal, and what can be inferred about transmission density by inspecting of a signal’s contents, through which he derived the fundamental measure of entropy, a unit that was based on similar probabilistic elements to what had previously been used by physicists to analyze thermodynamic systems. Here as we have extended the basis of Shannon’s classical information theory to consider systems incorporating quantum information, we have found many analogues to those limits of classical information, albeit in a few cases some more surprising aspects of quantum information. It is fitting that Witten closes his paper by asking a question for quantum information very similar to what Shannon must have first asked himself for digital transmissions: how many bits of information can Alice send to Bob by sending him a quantum system X with a k-dimensional Hilbert space H? In the interest of leaving the good stuff where it belongs, I won’t walk through Witten’s derivation here. Suffice to say, it is demonstrated that the best Alice can do in the sharing of k orthogonal basis vectors is to share log k bits of information. This is the boundary of quantum information.

- Just to restate the obvious, this essay has been a partial summary and partial response to a recently published paper by the physicist Edward Witten, which pre-print is available on arXiv here:
**A Mini-Introduction To Information Theory [****arXiv:1805.11965****]**. - All of the equations were either sourced from or derived from that work (with a few exceptions for the quantum computing demonstrations).
- The photos from this collection were taken during a recent road trip to Austin, TX — including photos of the collections of Bullock Texas State History Museum, Harry Ransom Center, and the Blanton Museum of Art.

## *For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

Books that were referenced here or otherwise inspired this post:

## A Mind at Play — by Jimmy Soni and Rob Goodman

## The Elegant Universe — Brian Greene

## Fashion, Faith, and Fantasy in the New Physics of the Universe — Roger Penrose

*(As an Amazon Associate I earn from qualifying purchases.)*

Albums that were referenced here or otherwise inspired this post:

## Dylan & The Dead — Bob Dylan and The Grateful Dead

## Graceland — Paul Simon

## The Animal Years — Josh Ritter

## Rhapsody in Blue — George Gershwin

## Blood on the Tracks — Bob Dylan

## Stevie Ray Vaughn — Texas Flood

*(As an Amazon Associate I earn from qualifying purchases.)*

*Hi, I’m an amateur blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on **linkedin** for professional inquiries or **twitter** for personal.*

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.