Advantages of versatile neuralnetwork decoding for topological codes
Abstract
Finding optimal correction of errors in generic stabilizer codes is a computationally hard problem, even for simple noise models. While this task can be simplified for codes with some structure, such as topological stabilizer codes, developing good and efficient decoders still remains a challenge. In our work, we systematically study a very versatile class of decoders based on feedforward neural networks. To demonstrate adaptability, we apply neural decoders to the triangular color and toric codes under various noise models with realistic features, such as spatiallycorrelated errors. We report that neural decoders provide significant improvement over leading efficient decoders in terms of the errorcorrection threshold. Using neural networks simplifies the process of designing wellperforming decoders, and does not require prior knowledge of the underlying noise model.
I Introduction
Recent smallscale experiments Barends et al. (2014); Córcoles et al. (2015); Kelly et al. (2015); Nigg et al. (2014) have shown an increasing level of control over quantum systems, constituting an important step towards the demonstration of quantum error correction Kitaev et al. (2002); Nielsen and Chuang (2010). In order to scale up quantum devices and maintain their computational power, one needs to protect logical information from unavoidable errors by encoding it into quantum errorcorrecting codes Shor (1995). One of the most successful class of quantum codes, stabilizer codes Gottesman (1996), allows one to detect errors by measuring stabilizer operators without altering the encoded information. Subsequently, errors can be corrected by implementing a recovery operation. A classical algorithm, which allows one to find an appropriate correction from the available classical data, i.e., the measurement outcomes of stabilizers for the given code, is called a decoder.
Optimal decoding of generic stabilizer codes is a computationally hard problem, even for simple noise models Iyer and Poulin (2015). If codes have some structure, then the task of decoding becomes more tractable and efficient decoders with good performance may be available. For example, in the case of topological stabilizer codes Kitaev (2003); Bravyi and Kitaev (1998); Bombin and MartinDelgado (2006); Bombin (2013); Haah (2011), whose stabilizer generators are geometrically local, any unsatisfied stabilizer returning measurement outcome indicates the presence of errors on some qubits in its neighborhood. By exploiting this pattern, many decoding schemes have been developed, some of which are based on cellular automata Harrington (2004); Hastings (2013); Herold et al. (2015, 2017); Duivenvoorden et al. (2017); Dauphinais and Poulin (2017); Kubica (2017), the MinimumWeight Perfect Matching algorithm Dennis et al. (2002); Delfosse (2014); Nickerson and Brown (2017), tensor networks Bravyi et al. (2014); Darmawan and Poulin (2018), renormalization group DuclosCianci and Poulin (2013a, b); Breuckmann et al. (2017); Bravyi and Haah (2011); Brown et al. (2015) or other approaches Delfosse and Zémor (2017); Delfosse and Nickerson (2017).
Efficient decoders with good performance are often taylormade for specific codes and are not easily adaptable to other settings. For instance, despite a local unitary equivalence of two families of topological codes Kubica et al. (2015), the color and toric codes, one cannot straightforwardly use toric code decoders in the color code setting; rather, some careful modifications are needed Delfosse (2014); Kubica (2017). Moreover, decoding strategies are typically designed and analyzed for simplistic noise models, which may not describe well errors present in the experimental setup. Importantly, the best approach to scalable quantum devices is still under debate and dominant sources of noise are yet to be thoroughly explored. Thus, it would be very desirable to develop decoding methods without full characterization of quantum hardware, which are adaptable to various quantum codes and realistic noise models.
threshold of the triangular color code  

\diagboxnoisedecoder  neural  projection  optimal 
bit/phaseflip  Katzgraber et al. (2009)  
depolarizing  Bombin et al. (2012)  
NNdepolarizing  ?  
threshold of the triangular toric code with a twist  
\diagboxnoisedecoder  neural  MWPM  optimal 
bit/phaseflip  Dennis et al. (2002)  
depolarizing  Bombin et al. (2012)  
NNdepolarizing  ? 
The main goal of our work is to systematically explore recently proposed decoding strategies based on artificial neural networks Torlai and Melko (2017); Baireuther et al. (2017); Krastanov and Jiang (2017); Varsamopoulos et al. (2017); Breuckmann and Ni (2017). We consider twostep decoding. In step 1, for any given configuration of unsatisfied stabilizers we deterministically find a Pauli operator, which returns corrupted encoded information into the code space. After this step, all stabilizers are satisfied but a nontrivial logical operator may have been implemented by the attempted Pauli correction combined with the initial error. In step 2, we use a feedforward neural network to determine what (if any) nontrivial logical operator is likely to be introduced in step 1, so that we can account for it in the recovery. We emphasize that step 2 is a classification problem, particularly wellsuited for machine learning.
In our work, we convincingly demonstrate the versatility of neural decoders by applying them to two families of codes, the twodimensional (2D) triangular color and toric codes, under different noise models with realistic features, such as spatiallycorrelated errors. We observe that, irrespective of the noise models, neuralnetwork decoding outperforms standard strategies, including the MinimumWeight Perfect Matching algorithm Dennis et al. (2002) and the projection decoder Delfosse (2014); see Table 1. It is worth emphasizing that only the training datasets, but not the explicit knowledge of the noise models or the geometric structure of the codes, were needed to train neural decoders. We also analyze how computational costs of training and neural network parameters scale with the growing code distance. Our work indicates that due to its adaptability neuralnetwork decoding is a promising errorcorrection method, which can be used in a wide range of future smallscale quantum devices, especially if the dominant sources of errors are not well characterized.
The organization of the article is as follows. We start by discussing quantum error correction from the perspective of topological codes, the triangular color code and the toric code with a twist. In particular, in Section II.3 we explain how to construct the excitation graph, which leads to an efficient algorithm for step 1 of the neural decoder. In Section II.4 we introduce a new notion of the effective error rate, which allows us to easily compare threshold error rates for different noise models. Then, we describe neural decoding and its performance under different noise models, including the spatiallycorrelated depolarizing noise. In Section III.2 we explain how training of deep neural networks is accomplished by successively increasing the error rate used to generate the training dataset. This training method likely has significant impact, since it may lead to faster convergence and better final performance of neural networks for quantum errorcorrection applications. We conclude the article with the discussion of our results and their implications for future neural decoders used in practice.
Ii Error correction with topological codes
ii.1 Topological stabilizer codes
Stabilizer codes Gottesman (1996) are an important class of quantum errorcorrecting codes Shor (1995) specified by a stabilizer group . The stabilizer group is an Abelian subgroup of the Pauli group generated by qubit Pauli operators , where and . The logical information is encoded into the codespace, which is the eigenspace of all the elements of . Logical Pauli operators are identified with elements of the normalizer of the stabilizer group in the Paui group. An operator which implements a nontrivial logical Pauli operator can be chosen to be a product of Pauli operators, which commute with all the elements in the stabilizer group but do not belong to . The weight of the minimalsupport nontrivial logical Pauli operator determines the distance of the code.
Physical qubits of the stabilizer code can be affected by noise, which can take encoded logical information outside of the codespace. By measuring stabilizer generators no information about the original encoded state is revealed. Rather, one effectively projects errors present in the system onto some Pauli operators and subsequently gains some knowledge about them. The set of unsatisfied stabilizers returning measurement outcome is called a syndrome. The syndrome serves as a classical input to a decoding algorithm, which allows one to find a recovery Pauli operator bringing the corrupted encoded state back to the codespace. For a special class of stabilizer codes, the CSS codes Calderbank et al. (1997), whose stabilizer generators are products of either  or type Pauli operators, one can independently correct  and type errors using the appropriate  and type syndrome.
Topological stabilizer codes Kitaev (2003); Bravyi and Kitaev (1998); Bombin and MartinDelgado (2006); Bombin (2013); Haah (2011) are a family of stabilizer codes exhibiting particularly good resilience to noise. The distinctive feature of topological stabilizer codes is the geometric locality of their generators. Namely, physical qubits can be arranged to form a lattice in such a way that every stabilizer generator is supported on a constant number of qubits within some geometrically local region. At the same time, no logical Pauli operator can be implemented via a unitary acting on physical qubits in any local region. By enlarging the system size, one increases the distance and errorcorrection capabilities of the topological code without changing the required complexity of local stabilizer measurements. This is in stark contrast with other quantum codes, such as concatenated codes Knill and Laflamme (1996), whose stabilizer weight necessarily increases with the distance and thus makes those constructions experimentally more challenging.
Two wellknown examples of topological stabilizer codes are the toric and color codes. The triangular color code is defined on a twodimensional lattice with a boundary, whose vertices are 3valent ^{1}^{1}1All the vertices are 3valent except for three corner vertices on the boundary. and faces are 3colorable; see Fig. 1(a). Qubits are identified with vertices. The color code is a CSS code and its stabilizer group is defined as follows
(1) 
where and are Pauli and operators supported on all qubits belonging to a face . Accordingly,  and type errors can be independently corrected using the  and type syndrome.
The triangular toric code with a twist Yoder and Kim (2017) can be defined for the same arrangement of physical qubits as the triangular color code. Its lattice can be obtained from the color code lattice by keeping all the vertices, adding extra edges and modifying some faces; see Fig. 1(b). The resulting lattice is 4valent ^{2}^{2}2All the vertices are 4valent except for three corner vertices on the boundary and one vertex in the bulk, which corresponds to a twist, i.e., the end of the defect line. and the faces are 2colorable, except for the “mixed” faces along a 1D defect line. The color of the face indicates the type of the stabilizer generator identified with that face. Namely, dark and white faces support type and type stabilizers. Depending on the coloring of mixed faces , stabilizers are defined to be mixed products of Pauli and operators. We emphasize that the choice of mixed stabilizer generators along the defect line is needed for the stabilizers to commute with and for all . The full stabilizer group is thus given by
(2) 
We remark that due to mixed stabilizer generators it is not possible to decode and errors independently.
Logical Pauli operators of the 2D topological stabilizer codes can be thought of as deformable noncontractible 1D stringlike operators. In the case of the triangular color and toric codes, logical operators connect certain boundaries as depicted in Fig. 1.
ii.2 Quasiparticle excitations
It is illustrative to establish a connection between quantum errorcorrecting codes and quantum manybody systems described by commuting Hamiltonians. For a topological stabilizer code with the stabilizer group we can define a commuting stabilizer Hamiltonian to be a sum of stabilizer generators of with a negative sign. In particular, for the color code and the toric code with a twist we choose their stabilizer Hamiltonians to be
(3)  
(4) 
Note that all the terms in the stabilizer Hamiltonian are mutually commuting, thus any eigenstate of has to be an eigenstate of every single term. Since eigenstates of stabilizer generators can only have eigenvalues, we conclude that the code space defined as the eigenspace of all the elements of coincides with the ground space of .
We can think of errors affecting information encoded in the topological stabilizer code as operators creating localized quasiparticle excitations in the related quantum manybody system. Namely, consider any Pauli error which anticommutes with some stabilizer generators. The error moves the encoded logical state outside the code space or, equivalently, the ground state outside the ground space. The resulting state is excited in the sense that its energy is larger than the ground space energy by the amount proportional to the number of violated stabilizer Hamiltonian terms. The unsatisfied stabilizer terms can be identified with quasiparticle excitations Wilczek (1982); Kitaev (2003); Preskill (1999); Bombin and MartinDelgado (2007). Depending on whether the unsatisfied stabilizer is of  or type, we will call the excitation electric or magnetic . ^{3}^{3}3For the mixed stabilizers along the defect line, there is ambiguity in associating the type of the excitation since the electric and magnetic excitations are exchanged upon crossing the defect line. Thus, we would refer to those excitations without specifying their type. The subscript indicates the color of the face supporting the excitation. In particular, for the toric code we can only have and , whereas the color code excitations can be supported on faces of any color, i.e., and for any .
In order to understand excitation configurations arising from any Pauli errors, it suffices to know what excitations geometrically local Pauli operators can create and how to combine them. We now discuss these constraints, also known as fusion rules for topological stabilizer codes. In case of the toric code, a singlequbit Pauli or error on the qubit in the bulk of the system violates two  or type stabilizers on neighboring faces and thus necessarily creates two excitations of the same type, either magnetic or electric; see Fig. 2(b). If two errors with nonoverlapping support independently create the same excitation on a face , then the product of both errors will not create any excitation at that location. For an illustration, let us consider two singlequbit errors and on qubits and belonging to the edge . Each error independently creates a magnetic excitation on the face containing the edge ; however, the combined error results in no excitation on . The above discussion can be summarized by the toric code fusion rules
(5) 
which express the fact that in the bulk excitations of the same type can only be created (by geometrically local operators) or annihilated in pairs. Note that denotes no excitation.
The fusion rules for the color code are slightly more complicated than for the toric code. Namely, we have
(6)  
(7) 
where . Similarly as for the toric code, combining two excitations of the same type and color results in no excitation. However, in the bulk of the color code it is also possible to create (by a local operator) or annihilate a triple of excitations. We can see that by considering a singlequbit Pauli or error. It violates three  or type stabilizers on neighboring red, green and blue faces and thus creates a triple of magnetic or electric excitations; see Fig. 2(a).
The topological stabilizer codes we consider are defined on lattices with boundaries. By acting with a local Pauli operator on the qubits near the boundary of the system it is possible to create or annihilate a single magnetic or electric excitation. We emphasize that the type of the boundary determines the type of the allowed excitation Levin (2013). For the triangular toric code, there are two types of boundaries, rough or smooth Bravyi and Kitaev (1998), and a single electric (respectively magnetic) excitation can only be created on the rough (smooth) boundary; see Fig. 2(b). In case of the triangular color code, there are three types of boundaries, red, green or blue Bombin and MartinDelgado (2006), and single electric and magnetic excitations of given color can be created on the boundary of the matching color; see Fig. 2(a).
Once a quasiparticle excitation is created, it can always be moved in the bulk of the 2D topological stabilizer code by applying an appropriate 1D stringlike Pauli operator Bombín (2014). Given fusion rules, the excitation movement can be understood as a process of creating pairs of excitations along some path and fusing them together with the initial one, which results in the excitation changing its position. When the quasiparticle excitation moves its type does not change, unless it passes through a defect line. A defect line, also known as a transparent domain wall^{4}^{4}4 A transparent domain wall can be thought of as an automorphism of the excitation labels which preserves the braiding and fusion rules of the quasiparticle excitations. Bombin (2010, 2011); Kitaev and Kong (2012), is a 1D object, along which the stabilizer generators are appropriately modified. In case of the triangular toric code with a twist, one chooses stabilizers on faces intersected by the defect line to be mixed products of Pauli and operators; see Fig. 1(b). When an electric excitation crosses the defect line, it becomes a magnetic excitation , and vice versa, . We emphasize that logical Pauli operators for the triangular color and toric codes can be implemented by creating a single excitation on one of the boundaries and transporting it to the other boundary, where it can annihilate; see Fig. 1 for examples of logical operators.
We remark that there are only two possible types of defect lines in the toric code, one of which is trivial. However, in case of the color code, there are 72 different defect lines Yoshida (2015). We encourage readers to explore Kesselring et al. (2018) for an illuminating discussion of all the possible boundaries and defect lines in the 2D color code.
ii.3 Decoding of topological codes as a classification problem
As we already discussed, generic errors affect the encoded information by moving it outside the code space, which results in some stabilizers being unsatisfied. A classical algorithm which takes the syndrome as an input and finds an appropriate recovery restoring all stabilizers to yield measurement outcome is called a decoder. For stabilizer codes the recovery operator is a Pauli operator. We say that decoding is successful if no nontrivial logical operator has been implemented by the recovery combined with the error.
We can view decoding as a process of removing quasiparticle excitations from the system and returning the state to the ground space of the stabilizer Hamiltonian. To facilitate the discussion, we introduce an excitation graph , which captures how the excitations can be moved (and eventually removed) within the lattice of the topological stabilizer code. The vertices of the excitation graph correspond to the (possible locations of) quasiparticle excitations. Note that there is one vertex for every single electric, as well as for magnetic excitation. We also include in one special vertex , called the boundary vertex. Two different vertices are connected by an edge if there is a Pauli operator with geometrically local support which can move an excitation from to without creating any other excitations. We say that and the boundary vertex are connected by an edge if one can locally create a single excitation at . In case of the toric and color codes, we restrict our attention to local operators, which are supported on respectively one or at most two neighboring qubits. We identify the edges in with the local operators . We illustrate how to construct the excitation graph in Fig. 3.
We consider a very simple deterministic procedure, the excitation removal algorithm, which efficiently eliminates quasiparticle excitations from the toric and color codes. Let be some Pauli error operator, which results in the excitation configuration in the system. The input of the algorithm is , but not . For every excitation we find the shortest path in the excitation graph between and the boundary vertex , where and . We define an operator to be a product of local Pauli operators identified with the edges along the path , namely . The operator moves an excitation from to the boundary where it is annihilated. As the output of the algorithm we choose an operator . We remark that the operator returns the state to the ground space since it removes all the excitations, and thus . At the same time, the output combined with the initial error likely implements some nontrivial logical operator. Thus, the excitation removal algorithm viewed as a decoder would perform rather poorly.
Now we explain how to reduce the decoding problem to a classification problem by using the excitation removal algorithm. The task of classification is to assign labels, typically from some small set, to the elements of some highdimensional dataset. In the decoding problem, we know positions of the excitations and want to find a recovery operator removing all the excitations and implementing the trivial logical operator. We do not know, however, the Pauli operator resulting in the excitation configuration . Using the excitation removal algorithm we easily find the operator . Clearly, we would be able to successfully decode if we chose as a recovery operator, where is any operator implementing the same logical operator as . Unfortunately, there are many different error operators creating the same configuration of excitations . We can split all those error operators into equivalence classes identified with different logical operators implemented by . Then, for any given excitation configuration we can find the most probable equivalence class of errors creating . What we would like to achieve is to label by a logical operator , which is implemented by the output of the excitation removal algorithm and any operator from the most probable class of errors. Such a problem is wellsuited for machine learning techniques, in particular for artificial neural networks. We defer further discussion of the classification problem to Section III.1, where we explain it in the context of neuralnetwork decoding.
ii.4 Noise models and thresholds
In order to test versatility of neural decoders, we numerically simulate their performance for various noise models. In particular, we consider the following three Pauli error models specified by just one parameter, the error rate .

Bit/phaseflip noise: every qubit is independently affected by an error with probability , and by a error with the same probability .

Depolarizing noise: every qubit is independently affected with probability by an error, which is uniformly chosen from three errors , and .

NNdepolarizing noise: the spatiallycorrelated depolarizing noise on nearestneighbor qubits, i.e., every pair of qubits and sharing an edge in the lattice is independently affected with probability by a nontrivial error, which is uniformly chosen from 15 errors of the form , where and .
We emphasize that one should not necessarily think of the aforementioned noise models as accurately describing errors in the experimental setup. Rather, we choose those models since they are easy to specify and simulate but, at the same time, they also capture realistic noise features, such as spatial correlations of errors, which any good decoder should be able to handle Nickerson and Brown (2017). In addition, in the current proposed circuitbased models for syndrome measurement Fowler et al. (2012) correlated errors across neighboring qubits would naturally arise.
We would like to easily compare the bit/phaseflip, depolarizing and NNdepolarizing noise models. However, the error rate has a different meaning depending on the considered model. This motivates us to introduce a new figure of merit for Pauli error models, the effective error rate . For any physical qubit we define the effective error rate to be the probability of any nontrivial error affecting that qubit. Note that in the scenarios we consider the effective error rate is the same for all the qubits (except for the ones identified with the corner vertices and the twist for the NNdepolarizing noise). Thus, we can unambiguously talk about the effective error rate without specifying which qubit we are referring to. For the depolarizing noise we simply have , whereas for the the bit/phaseflip noise we find . In case of the NNdepolarizing noise, the effective error rate depends on the local structure of the lattice. Namely, if denotes the number of nearest neighbors for some qubit, then the effective error rate for that qubit can be recursively calculated as
(8)  
(9) 
where we use and denote by the secondorder corrections in . In particular, for the analyzed color and toric code lattices we respectively have and .
In order to assess the performance of a decoder for the given family of codes with growing code distance and specified noise model, we use the quantity called the errorcorrection threshold. The errorcorrection threshold is defined as the largest , such that for all effective error rates the probability of unsuccessful decoding for the code with distance goes to zero in the limit of infinite code distance, . Note that in the definition of the threshold we assume perfect stabilizer measurements. We remark that one typically estimates the threshold by plotting the decoder failure probability as a function of the effective error rate for different code distances and identifying their crossing point; see Figs. 5 and 6.
Iii Performance of neuralnetwork decoding
iii.1 Neural decoders
We have already seen in Section II.3 that the task of successful decoding can be deterministically reduced to the following problem: for any configuration of excitations created by some unknown Pauli operator assign a label from the set of logical operators , such that is the logical operator implemented by , where is the output of the excitation removal algorithm with as the input. We approach this classification problem by using one of the leading machine learning techniques, feedforward neural networks. For each code of distance , we train a neural network consisting of layers; see Fig. 4. The input layer encodes the configuration of excitations . Then, there are hidden layers, each containing nodes. Nodes from layer are fully connected with nodes from the preceding layer . Every node in layer evaluates an activation function on the output of nodes from layer , where and are the weights and biases associated with the node . We choose the rectified linear unit activation function . The output layer uses the softmax classifier, which converts an output vector to a discrete probability distribution describing the likelihood of different logical operators being implemented by .
We are now ready to describe neuralnetwork decoding for topological stabilizer codes. The neural decoder is an algorithm which returns a recovery operator for any configuration of excitations created by some unknown operator . We emphasize that error operators are chosen according to some a priori unknown noise model. The neural decoders we consider consist of the following two steps. In step 1, we use a simple deterministic procedure, the excitation removal algorithm, to find a Pauli operator , which removes quasiparticle excitations by moving them to the boundaries of the system, where they disappear. In step 2, we use a neural network to guess what are the most likely errors resulting in and which logical operator is subsequently implemented by . As the output, the operator is returned, where is any operator implementing the logical operator . We emphasize that the neural decoder always returns a valid recovery operator but decoding succeeds if and only if the neural network correctly identifies the logical operator implemented by . Moreover, determining the output of the trained neural network is efficient since it reduces to matrix multiplication. We see that in step 1 we implicitly make use of the excitation graph, which contains information about the topological code lattice and the fusion rules. However, no information about the topological code is required to train the neural network, which is used in step 2.
We emphasize that the details of step 1 in the neural decoder do not matter as long as the returned operator is found in an efficient deterministic way. We choose the excitation removal algorithm because it is simple and has an intuitive explanation — it removes all the excitations by moving them to the boundaries of the system. We point out that we could use a similar version of the neural decoder for other topological codes (or even codes without geometric structure), as long as we knew how to efficiently find the operator . For instance, if we considered the toric or color codes on a torus, with or without boundaries, then we could always find a simple removal procedure which deterministically moves all excitations of the same color to the same location in the bulk or on the boundary, where they are guaranteed to disappear. Such a procedure can then be used to create the training dataset for the neural network. We remark that step 1 becomes more challenging for codes without stringlike operators, such as the cubic code Haah (2011).
iii.2 Training deep neural networks
Before a neural network can be used for decoding, it needs to be trained. We do this via supervised learning, where the network is trained on a dataset of preclassified samples. Sample Pauli errors are generated using Monte Carlo sampling according to the appropriate probability distribution determined by the noise model. For each generated error configuration , we determine the corresponding syndrome, i.e., the excitation configuration , which is the input to the neural network. Then, using the excitation removal algorithm, we find the Pauli operator , and check what logical operator is implemented by . This allows us to label each input excitation configuration with the corresponding classification label we want the neural network to output. We remark that the testing samples used to numerically estimate thresholds are created in the same way as the training samples.
Training the neural network can now be framed as a minimization problem. The network parameters, i.e., the weights and biases, are optimized to minimize classification error on the training dataset. We use the categorical cross entropy cost function to quantify the error, namely
(10) 
where is the classification bitstring for the input , is the likelihood vector returned by the neural network, and . Importantly, this cost function is differentiable, which allows us to use backpropagation to efficiently compute the gradient of the cost function with respect to network parameters in a single backwards pass of the network. The minimization is performed using Adam optimization Kingma and Ba (2014), a highly effective variant of gradient descent, whose learning parameters do not need to be finetuned for good performance. In practice, we find that Adam optimization converges significantly faster than standard gradient descent, with the effects becoming more pronounced for larger networks.
Instead of computing the cost function on the entire training set, which becomes computationally expensive for very large datasets, we use minibatch optimization. This is a standard technique, which estimates the cost function on individual batches, i.e., small subsets of the training datasets; see e.g. Hinton et al. (2012). We define a training step as one round of backpropagation and a subsequent network parameter update, using the cost function in Eq. (10) estimated on a single batch. The batch size controls the accuracy of this estimate and needs to be manually adjusted.
Until recently, training deep neural networks had been next to impossible. However, innovations by the machine learning community have made it easy to train extremely deep networks. We too were unable to successfully train networks with more than three hidden layers, until we implemented two of these improvements: He initialization and batch normalization. He initialization He et al. (2015) ensures that learning is efficient for the rectified linear unit activation function, whereas batch normalization Ioffe and Szegedy (2015) stabilizes the input distribution for each layer. Batch normalization makes it possible to train deeper networks, as well as improves performance on shallower threelayer networks.
The training set is generated according to the noise model and some chosen error rates. Once the neural network is trained, it should be able to successfully label syndromes for error configurations generated at various error rates below the threshold. In particular, any finetuning of the network for specific error rates is not desired. Since the error syndromes for higher error rates are in general more challenging to classify, it would be desirable to train the neural network mainly on configurations corresponding to error rates close to the threshold. However, during training of the networks for higherdistance codes and correlated noise models the optimization algorithm is very likely to get stuck in local minima if we start training on the high errorrate dataset directly. This problem is manifested in the network not effectively learning the noise features and the resulting performance showing only small improvements over random guessing. A solution we propose is to first pretrain the network on a lower errorrate dataset, and only then use the training data corresponding to the nearthreshold error rate; changing of the error rate does not have to be very slow. We believe that this is an important observation for any future implementations of neural networks for decoding quantum errorcorrecting codes. We also speculate that a similar strategy might help to speed up training of neural networks for experimental systems. Namely, we imagine pretraining the neural network for some simple theoretical error models at low error rates, and then using the experimental data for further training.
training cost for the triangular color code  

\diagboxnoiseparameters  
bit/phaseflip  5  3  100  
7  5  200  
9  7  400  
11  9  800  
depolarizing  5  3  200  
7  5  600  
9  7  1400  
NNdepolarizing  5  3  200  
7  5  400  
9  7  800  
11  9  1600  
training cost for the triangular toric code with a twist  
\diagboxnoiseparameters  
bit/phaseflip  5  3  100  
7  5  200  
9  7  400  
11  9  800  
depolarizing  5  3  200  
7  5  600  
9  7  1200  
NNdepolarizing  5  3  200  
7  5  400  
9  7  800  
11  9  1600 
iii.3 Selecting neuralnetwork hyperparameters
In addition to network parameters, there are also hyperparameters which cannot be trained via backpropagation. These include the number of hidden layers , the number of nodes per hidden layer , the size of each batch , and the total number of training steps . We optimize these parameters using a grid search based approach; see Table 2 for the optimal values we find. A heuristic rule for determining the size of a wellperforming neural network for the code with distance is to use hidden layers and nodes per layer. Whether or not this exponential trend continues for larger code sizes is an open question.
We notice that very large training sets are needed for optimal performance. In order to save on computational memory, we choose to generate training samples in parallel to training, since it can be done efficiently. Note that with this strategy the number of different samples seen during training is . We observe that the training time appears to scale exponentially with code distance, approximately doubling as the distance increases by two.
We find evidence that there is some minimal batch size below which the gradient estimates are too noisy for the network to converge to a solution that outperforms random guessing. However, increasing the batch size beyond that minimal value does not improve the final network performance. Rather, it reduces the number of training steps needed for convergence, but with diminishing returns. The batch size we choose is primarily optimized to minimize the training time.
iii.4 Thresholds of neural decoders
In order to assess the versatility of neuralnetwork decoding, we qualitatively study its performance for the toric and color codes under three different noise models: bit/phaseflip, depolarizing and NNdepolarizing. First, we train a neural network for every code with the code distance up to . The optimized hyperparameters of considered neural networks are presented in Table 2. Then, we numerically find the decoder failure probability of the neural decoder as a function of the effective error rate . By plotting the decoder failure probability for different code distances and finding their intersection we numerically establish the existence of nonzero threshold for the neural decoder and estimate its value; see Figs. 5 and 6.
We benchmark the performance of the neural decoder against the leading efficient decoders of the toric and color code. In particular, we analyze the standard decoders based on the MinimumWeight Perfect Matching algorithm and the projection decoder. In our implementation, we use the Blossom V algorithm provided by Kolmogorov Kolmogorov (2009).
We report that the neural decoder for the color code significantly outperforms the projection decoder for all considered noise models, even for the simplest bit/phaseflip noise model. The neural decoder threshold values we find approach the upper bounds from the maximallikelihood decoder. The neural decoder for the toric code shows comparable performance as the MinimumWeight Perfect Matching decoder for the bit/phaseflip noise, however offers noticeable improvements for correlated noise models. We remark that optimal decoding thresholds for topological codes can be found via statisticalmechanical mapping; see Dennis et al. (2002); Katzgraber et al. (2009); Bombin et al. (2012); Kubica et al. (2017). The threshold values we find are expressed in terms of the effective error rate and are listed in Table 1.
As with all learning models, it is important to address the possibility of overfitting. We know that the test samples are different (with high probability) from the training samples, since they are randomly chosen from a set that scales exponentially with the number of physical qubits. We remark that the required training set seems to scale exponentially with the code distance, however it constitutes a vanishing fraction of all possible syndrome configurations. Moreover, the classification accuracy on the test samples is the same as the final training accuracy. Thus, we can conclude that the neural network learns to correctly label syndromes typical for the studied noise models, resulting in wellperforming neural decoders.
Iv Discussions
We have conclusively demonstrated that neuralnetwork decoding for topological stabilizer codes is very versatile and clearly outperforms leading efficient decoders. We focused on the triangular color code and the toric code a twist, whose physical qubits are arranged in the same way but their stabilizer groups are different. We studied the performance of neuralnetwork decoding for different noise models, including the spatiallycorrelated depolarizing noise. In particular, we numerically established the existence of nonzero threshold and found significant improvements of the color code threshold over the previously reported values; see Table 1 and Figs. 5 and 6. This result indicates that the relatively low threshold of the color code, which was considered to be one of its main drawbacks, can be easily increased, making quantum computation with the color code more appealing than initially perceived Wang et al. (2010); Fowler (2011); Landahl and RyanAnderson (2014).
We emphasize that the neural network does not explicitly use any information about the topological code or the noise model. The neural network is trained on very simple data usually available from the experiment, which includes the information about the measured syndrome and whether the simple deterministic decoding, i.e., the excitation removal algorithm, succeeds. Importantly, this raw data can not only be used to train the neural network, but also to characterize the quantum device Combes et al. (2014). Without assuming any simplistic noise models the neural network efficiently detects the actual error patterns in the system and subsequently “learns” about the correlations between observed errors. This provides a heuristic explanation why neural decoding is currently the best strategy to decode the color code, since the correlations between errors in the color code are difficult to account for in standard approaches Delfosse and Tillich (2014). Using neural networks simplifies and speeds up the process of designing good decoders, which is rather challenging due to its heavy dependency on the choice of the quantum errorcorrecting code as well as the noise model.
Our results show that neuralnetwork decoding can be successfully used for quantum errorcorrection protocols, especially in the systems affected by a priori unknown noise with correlated errors. We stress that neuralnetwork decoding already provides an enormous datacompression advantage over methods based on (partial) lookup tables, even for smalldistance quantum codes. However, an important question of scalability has to be addressed if neural decoders are ever going to be used for practical purposes on future faulttolerant universal quantum devices. One possible approach to scalable neural networks is to reduce the connectivity between the layers by exploiting the information about the topological code lattice and geometric locality of stabilizer generators. We imagine incorporating convolutional neural networks as well as some renormalization ideas in the future scalable neural decoders. Also, a fullyfledged neural decoder should account for the possibility of faulty stabilizer measurements Chamberland and Ronagh (2018). We do not perceive any fundamental reasons why neuralnetwork decoding, possibly based on recurrent neural networks, would not work for the circuit level noise model. However, in that setting the training dataset as well as the size of the required neural network grow substantially, making the training process computationally very challenging.
Acknowledgements.
We would like to thank Ben Brown, Jenia Mozgunov and John Preskill for valuable discussions, as well as Evert van Nieuwenburg for his feedback on this manuscript. During the preparation of the manuscript two related preprints were made available Davaasuren et al. (2018); Jia et al. (2018), however their scope and emphasis are different from our work. NM acknowledges funding provided by the Caltech SURF program. AK acknowledges funding provided by the Simons Foundation through the “It from Qubit” Collaboration. Research at Perimeter Institute is supported by the Government of Canada through Industry Canada and by the Province of Ontario through the Ministry of Research and Innovation. TJ acknowledges the support from the Walter Burke Institute for Theoretical Physics in the form of the Sherman Fairchild Fellowship. The authors acknowledge the support from the Institute for Quantum Information and Matter (IQIM).References
 Barends et al. (2014) R. Barends, J. Kelly, A. Megrant, A. Veitia, D. Sank, E. Jeffrey, T. C. White, J. Mutus, A. G. Fowler, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, C. Neill, P. O’Malley, P. Roushan, A. Vainsencher, J. Wenner, A. N. Korotkov, A. N. Cleland, and J. M. Martinis, Nature 508, 500 (2014).
 Córcoles et al. (2015) A. D. Córcoles, E. Magesan, S. J. Srinivasan, A. W. Cross, M. Steffen, J. M. Gambetta, and J. M. Chow, Nature communications 6, 6979 (2015).
 Kelly et al. (2015) J. Kelly, R. Barends, A. G. Fowler, A. Megrant, E. Jeffrey, T. C. White, D. Sank, J. Y. Mutus, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, I. C. Hoi, C. Neill, P. J. J. O’Malley, C. Quintana, P. Roushan, A. Vainsencher, J. Wenner, A. N. Cleland, and J. M. Martinis, Nature 519, 66 (2015).
 Nigg et al. (2014) D. Nigg, M. Mueller, E. A. Martinez, P. Schindler, M. Hennrich, T. Monz, M. A. MartinDelgado, and R. Blatt, Science 345, 302 (2014).
 Kitaev et al. (2002) A. Kitaev, A. Shen, and M. Vyalyi, Classical and Quantum Computation (American Mathematical Society, 2002) p. 257.
 Nielsen and Chuang (2010) M. Nielsen and I. Chuang, Quantum Computation and Quantum Information, 10th ed. (Cambridge University Press, 2010) p. 702.
 Shor (1995) P. W. Shor, Physical Review A 52, R2493 (1995), arXiv:0506097 [arXiv:quantph] .
 Gottesman (1996) D. Gottesman, Physical Review A 54, 1862 (1996), arXiv:9604038 [quantph] .
 Iyer and Poulin (2015) P. Iyer and D. Poulin, IEEE Transactions on Information Theory 61, 5209 (2015), arXiv:1310.3235 .
 Kitaev (2003) A. Y. Kitaev, Annals Phys. 303, 2 (2003), arXiv:9707021 [quantph] .
 Bravyi and Kitaev (1998) S. B. Bravyi and A. Y. Kitaev, , 6 (1998), arXiv:9811052 [quantph] .
 Bombin and MartinDelgado (2006) H. Bombin and M. A. MartinDelgado, Physical Review Letters 97, 180501 (2006), arXiv:0605138 [quantph] .
 Bombin (2013) H. Bombin, in Topological Codes, edited by D. A. Lidar and T. A. Brun (Cambridge University Press, 2013) arXiv:1311.0277 .
 Haah (2011) J. Haah, Physical Review A 83, 042330 (2011).
 Harrington (2004) J. Harrington, Analysis of quantum errorcorrecting codes: symplectic lattice codes and toric codes, Ph.D. thesis, California Institute of Technology (2004).
 Hastings (2013) M. B. Hastings, , 1 (2013), arXiv:1312.2546 .
 Herold et al. (2015) M. Herold, E. T. Campbell, J. Eisert, and M. J. Kastoryano, npj Quantum Information 1 (2015), 10.1038/npjqi.2015.10, arXiv:1406.2338 .
 Herold et al. (2017) M. Herold, M. J. Kastoryano, E. T. Campbell, and J. Eisert, New Journal of Physics 19 (2017), 10.1088/13672630/aa7099, arXiv:1511.05579 .
 Duivenvoorden et al. (2017) K. Duivenvoorden, N. P. Breuckmann, and B. M. Terhal, (2017), arXiv:1708.09286 .
 Dauphinais and Poulin (2017) G. Dauphinais and D. Poulin, Commun. Math. Phys. 355, 519 (2017), arXiv:1607.02159 .
 Kubica (2017) A. Kubica, The ABCs of the color code: A study of topological quantum codes as toy models for faulttolerant quantum computation and quantum phases of matter, Ph.D. thesis, California Institute of Technology (2017).
 Dennis et al. (2002) E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, Journal of Mathematical Physics 43, 4452 (2002), arXiv:0110143 [quantph] .
 Delfosse (2014) N. Delfosse, Physical Review A 89, 012317 (2014), arXiv:1308.6207 .
 Nickerson and Brown (2017) N. H. Nickerson and B. J. Brown, (2017), arXiv:1712.00502 .
 Bravyi et al. (2014) S. Bravyi, M. Suchara, and A. Vargo, Physical Review A  Atomic, Molecular, and Optical Physics 90 (2014), 10.1103/PhysRevA.90.032326, arXiv:1405.4883 .
 Darmawan and Poulin (2018) A. S. Darmawan and D. Poulin, (2018), arXiv:1801.01879 .
 DuclosCianci and Poulin (2013a) G. DuclosCianci and D. Poulin, Physical Review A 87, 062338 (2013a), arXiv:1302.3638 .
 DuclosCianci and Poulin (2013b) G. DuclosCianci and D. Poulin, , 11 (2013b), arXiv:1304.6100 .
 Breuckmann et al. (2017) N. P. Breuckmann, K. Duivenvoorden, D. Michels, and B. M. Terhal, Quantum Information and Computation 17, 0181 (2017), arXiv:1609.00510 .
 Bravyi and Haah (2011) S. Bravyi and J. Haah, Physical Review Letters 107, 150504 (2011), arXiv:1105.4159 .
 Brown et al. (2015) B. J. Brown, N. H. Nickerson, and D. E. Browne, Nature Communications 7, 4 (2015), arXiv:1503.08217 .
 Delfosse and Zémor (2017) N. Delfosse and G. Zémor, (2017), arXiv:1703.01517 .
 Delfosse and Nickerson (2017) N. Delfosse and N. H. Nickerson, (2017), arXiv:1709.06218 .
 Kubica et al. (2015) A. Kubica, B. Yoshida, and F. Pastawski, New Journal of Physics 17, 083026 (2015).
 Katzgraber et al. (2009) H. G. Katzgraber, H. Bombin, and M. A. MartinDelgado, Physical Review Letters 103, 090501 (2009).
 Bombin et al. (2012) H. Bombin, R. S. Andrist, M. Ohzeki, H. G. Katzgraber, and M. A. MartinDelgado, Physical Review X 2, 021004 (2012).
 Torlai and Melko (2017) G. Torlai and R. G. Melko, Physical Review Letters 119 (2017), 10.1103/PhysRevLett.119.030501, arXiv:1610.04238 .
 Baireuther et al. (2017) P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W. J. Beenakker, (2017), 10.22331/q2018012948, arXiv:1705.07855 .
 Krastanov and Jiang (2017) S. Krastanov and L. Jiang, Scientific Reports 7 (2017), 10.1038/s41598017112661, arXiv:1705.09334 .
 Varsamopoulos et al. (2017) S. Varsamopoulos, B. Criger, and K. Bertels, (2017), arXiv:1705.00857 .
 Breuckmann and Ni (2017) N. P. Breuckmann and X. Ni, (2017), arXiv:1710.09489 .
 Calderbank et al. (1997) A. Calderbank, E. Rains, P. Shor, and N. Sloane, Physical Review Letters 78, 405 (1997), arXiv:9605005 [quantph] .
 Knill and Laflamme (1996) E. Knill and R. Laflamme, arXiv: quantph/9608012 (1996).
 Yoder and Kim (2017) T. J. Yoder and I. H. Kim, Quantum 1, 2 (2017), arXiv:1612.04795 .
 Wilczek (1982) F. Wilczek, Physical Review Letters 49, 957 (1982).
 Preskill (1999) J. Preskill, Lecture notes for Physics 219: Quantum computation (1999).
 Bombin and MartinDelgado (2007) H. Bombin and M. MartinDelgado, Physical Review B 75, 075103 (2007), arXiv:0607736 [condmat] .
 Levin (2013) M. Levin, Physical Review X 3 (2013), 10.1103/PhysRevX.3.021009, arXiv:1301.7355 .
 Bombín (2014) H. Bombín, Communications in Mathematical Physics 327, 387 (2014), arXiv:1107.2707 .
 Bombin (2010) H. Bombin, Physical Review Letters 105 (2010), 10.1103/PhysRevLett.105.030403, arXiv:1004.1838 .
 Bombin (2011) H. Bombin, New Journal of Physics 13 (2011), 10.1088/13672630/13/4/043005, arXiv:1006.5260 .
 Kitaev and Kong (2012) A. Kitaev and L. Kong, Communications in Mathematical Physics 313, 351 (2012), arXiv:1104.5047 .
 Yoshida (2015) B. Yoshida, Physical Review B  Condensed Matter and Materials Physics 91 (2015), 10.1103/PhysRevB.91.245131, arXiv:1503.07208 .
 Kesselring et al. (2018) M. Kesselring, B. Brown, , F. Pastawski, and E. J., in preparation (2018).
 Fowler et al. (2012) A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, Physical Review A 86, 032324 (2012).
 Kingma and Ba (2014) D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 (2014).
 Hinton et al. (2012) G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learninglecture 6aoverview of minibatch gradient descent,” (2012).
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of the IEEE international conference on computer vision (2015) pp. 1026–1034.
 Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy, in International conference on machine learning (2015) pp. 448–456.
 Kolmogorov (2009) V. Kolmogorov, Mathematical Programming Computation 1, 43 (2009).
 Kubica et al. (2017) A. Kubica, M. E. Beverland, F. Brandao, J. Preskill, and K. M. Svore, (2017), arXiv:1708.07131 .
 Wang et al. (2010) D. S. Wang, A. G. Fowler, C. D. Hill, and L. C. L. Hollenberg, Quantum Information and Computation 10, 780 (2010), arXiv:0907.1708 .
 Fowler (2011) A. G. Fowler, Physical Review A  Atomic, Molecular, and Optical Physics 83 (2011), 10.1103/PhysRevA.83.042310, arXiv:0806.4827 .
 Landahl and RyanAnderson (2014) A. J. Landahl and C. RyanAnderson, , 13 (2014), arXiv:1407.5103 .
 Combes et al. (2014) J. Combes, C. Ferrie, C. Cesare, M. Tiersch, G. J. Milburn, H. J. Briegel, and C. M. Caves, , 16 (2014), arXiv:1405.5656 .
 Delfosse and Tillich (2014) N. Delfosse and J. P. Tillich, in IEEE International Symposium on Information Theory  Proceedings (2014) pp. 1071–1075, arXiv:1401.6975 .
 Chamberland and Ronagh (2018) C. Chamberland and P. Ronagh, arXiv preprint arXiv:1802.06441 (2018).
 Davaasuren et al. (2018) A. Davaasuren, Y. Suzuki, K. Fujii, and M. Koashi, (2018), arXiv:1801.04377 .
 Jia et al. (2018) Z.A. Jia, Y.H. Zhang, Y.C. Wu, L. Kong, G.C. Guo, and G.P. Guo, (2018), arXiv:1802.03738 .