# Adversarial Representation Learning

for Domain Adaptation

###### Abstract

Domain adaptation aims at generalizing a high performance learner to a target domain via utilizing the knowledge distilled from a source domain, which has a different but related data distribution. One type of domain adaptation solutions is to learn feature representations invariant to the change of domains but is discriminative for predicting the target. Recently, the generative adversarial nets (GANs) are widely studied to learn a generator to approximate the true data distribution by trying to fool the adversarial discriminator in a minimax game setting. Inspired by GANs, we propose a novel Adversarial Representation learning approach for Domain Adaptation (ARDA) to learn high-level feature representations that are both domain-invariant and target-discriminative to tackle the cross-domain classification problem. Specifically, this approach takes advantage of the differential property of Wasserstein distance to measure distribution divergence by incorporating Wasserstein GAN. Our architecture constitutes three parts: a feature generator to generate the desired features from inputs of two domains, a critic to evaluate the Wasserstein distance based on the generated features and an adaptive classifier to accomplish the final classification task. Empirical studies on 4 common domain adaptation datasets demonstrate that our proposed ARDA outperforms the state-of-the-art domain-invariant feature learning approaches.

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

## 1 Introduction

Domain adaptation deals with the machine learning problems where there is no labeled data for a target domain and one hopes to transfer the knowledge from a model trained on sufficient labeled data from a source domain with the same feature space but different marginal distribution ben2007analysis ; pan2010survey . Domain adaptation becomes more and more important recently since the widely-used deep learning methods usually require large-scale labeled data to fit the deep neural networks lecun2015deep while some labeled training data are expensive or difficult to obtain in some real-world scenarios.

To effectively transfer a supervised classifier across different domains, different approaches have been proposed, including instance reweighting mansour2009domain , subsampling chen2011automatic and feature representation learning pan2011domain , among which the feature-based methods have shown great power recently. The main goal of feature-based methods is to learn a feature mapping that projects the data from different domains to a common latent space where the feature representations are domain-invariant yet still discriminative for predicting the target. Recently, deep neural networks, as a great tool to automatically learn effective data representations, have been leveraged in learning knowledge-transferable latent representations for domain adaptation glorot2011domain ; chen2012marginalized ; zhuang2015supervised ; long2015learning ; ganin2016domain .

On the other hand, generative adversarial nets (GANs) goodfellow2014generative are widely studied during recent three years for generative models and unsupervised learning problems. The basic idea is to play a minimax game between two adversarial networks: the discriminator is trained to distinguish real data instances from fake ones generated by the generator, while the generator learns to generate high-quality data instances to fool the discriminator. It has been proved that there exists an equilibrium in GAN minimax game where the distribution of generated data equals to that of real data, which inspires us to employ this minimax game for domain adaptation since GAN naturally provides a framework to make the source and target data indistinguishable. We note that domain adversarial neural network (DANN) ganin2016domain is a related method for domain adaptation where the first several layers of the networks learn the domain shared features to confuse a domain classifier in an adversarial fashion. However, when the learned features are poor to mix the distributions of source and target domain data, it is very easy for the domain classifier network to perfectly distinguish them. In such a case, there will be a gradient vanishing problem for the domain classifier with traditional probability-based loss, such as cross entropy or Jensen-Shannon divergence arjovsky2017wasserstein . A more reasonable solution would be to replace the domain classifier with an estimator of Wasserstein distance ruschendorf1985wasserstein , which is shown to have more stable gradient even if two distributions are distant.

In this paper, we propose a domain adaptation approach, named as Adversarial Representation learning for Domain Adaptation (ARDA), to learn the transferable representations and adaptive classifiers by incorporating the recently proposed Wasserstein GAN framework arjovsky2017wasserstein . ARDA consists of three parts: a feature generator to generate the domain-invariant yet target-discriminative feature representations from inputs of the domains, a critic to evaluate the Wasserstein distance which is an effective and differential divergence metric of representation distributions from two domains, and an adaptive classifier to accomplish the classification task. By jointly learning the domain adaptation system using data from both domains, the feature generator and classifier can finally be effectively adapted to the target domain. Empirical results on 4 common domain adaptation datasets demonstrate that ARDA outperforms the state-of-the-art feature learning approaches for domain adaptation. Furthermore, the visualization of the two domain representations clearly shows that ARDA successfully unifies two domain distributions, as well as maintains obvious label discrimination.

## 2 Related Work

Domain adaptation is a popular subject in transfer learning pan2010survey . It concerns covariate shift in two data distributions, usually labeled source data and unlabeled target data. Solutions to domain adaptation problems can be mainly categorized into two types. One is instance-based method, which reweights/subsamples the source samples to match the distribution of the target domain, thus training on the reweighted source samples guarantees classifiers with transferability huang2007correcting ; chen2011co ; chu2013selective . The other popular and effective solution is feature based, which maps different domains to a common latent space where the feature distributions are close. Among feature based methods, minimizing the maximum mean discrepancy (MMD) gretton2012kernel is effective to minimize the divergence of two distributions. MMD is a nonparametric metric that measures the distribution discrepancy between the kernel mean embeddings of two distributions in reproducing kernel Hilbert space (RKHS). The MMD metric is widely adopted in learning feature representations, and helps learn the features invariant to the domain shift in recent works tzeng2014deep ; long2015learning ; long2013transfer ; pan2011domain .

Recently, deep learning has been regarded as a powerful method in learning feature representations for domain adaptation. glorot2011domain proposed stacked denoising autoencoder (SDA) to learn robust feature representations, and applied it on cross domain sentiment analysis. chen2012marginalized extended the work of SDA and proposed the marginalized SDA (mSDA) where the random corruptions were marginalized out and it had a closed form solution. While a very recent study yosinski2014transferable revealed that hidden layer outputs as feature representations vary from general to specific along the network, resulting in significant drops in feature transferability in deeper layers with increasing domain discrepancy. Inspired by this finding, long2015learning proposed deep adaptation network (DAN) that enhanced the feature transferability in the task-specific layers by minimizing multi-kernel MMD to reduce the domain discrepancy.

Motivated by theory on domain adaptation ben2007analysis ; ben2010theory that suggests that a good representation for cross-domain transfer contains no discriminative information about the origin (i.e. domain) of the input, domain adversarial neural network (DANN) ajakan2014domain ; ganin2016domain was proposed to learn domain-invariant features by adding a domain classifier. We should note that although DANN seems close to ARDA, there are distinct differences. In order to back-propagate the gradient computed from the domain classifier, DANN employs a gradient reverse layer (GRL). However, DANN could fail to provide the desired gradient when the domain classifier is poor and thus degrades the transferability of learned representations. While ARDA plays the minimax objective more reasonably by training the critic to optimality which guarantees the transferability of learned feature representations. Besides, due to the outstanding differential property of Wasserstein distance on measuring the distribution divergence over Jensen-Shannon divergence, ARDA minimizes the distribution divergence between the source and target data more effectively and thus achieves better performance.

Besides learning shared representations between the two domains, bousmalis2016domain proposed domain separation network (DSN) to explicitly separate representations private to each domain and shared between source and target domains. The private representations were learned by defining a difference loss via a soft orthogonality constraint between the shared and private representations while the shared representations were learned by DANN or MMD mentioned above. With the help of reconstruction through private and shared representations together, the classifier trained on the shared representations can generalize across domains better. Since our work focuses on learning the shared representations, it can also be integrated in the DSN easily.

## 3 Model

### 3.1 Notations

In unsupervised domain adaptation, we have a labeled source dataset of samples from the source domain which is assumed to be sufficient to train an accurate classifier and an unlabeled target dataset of samples from the target domain . It is assumed that two domains share the same feature space but follow different marginal data distributions, and respectively. The goal is to learn a transferable classifier to minimize target risk using all the given data.

### 3.2 Wasserstein Metric

The Wasserstein metric is a distance measure between probability distributions on a given metric space , where is a distance function for two instances and in the set . The Wasserstein distance between two Borel probability measures and is defined as

(1) |

where are two probability measures on with finite moment and is the set of all measures on with marginals and . The Kantorovich-Rubinstein theorem shows that when is separable, the dual representation of the first Wasserstein distance (Earth-Mover distance) can be written as a form of integral probability metric

(2) |

where the Lipschitz semi-norm . In this paper, for simplicity, Wasserstein distance represents the first Wasserstein distance.

As is presented in arjovsky2017wasserstein , Wasserstein distance induces a weaker topology which means it is easier for the distributions to converge and thus easier to define a continuous mapping. Besides, Wasserstein distance on distributions over a low dimensional manifold is continuous everywhere and differentiable almost everywhere under mild assumption. Due to the outstanding properties, it is promising to adopt the Wasserstein distance to measure the divergence between the data distributions of source and target domains.

In domain adaptation, to tackle the difference between the data distributions and , the feature based methods learn a transformation function to map the data from the original space to a latent space where the distribution divergence could be reduced with a distance measure. There are two situations after the mapping: i). There is a probability that the two mapped feature distributions have supports that lie on low dimensional manifolds narayanan2010sample in the latent space. In such a situation, there will be a gradient vanishing problem if adopting the non-continuous distance measures while Wasserstein distance could provide reliable gradient. ii). It is also possible that the features fill in the whole space since the feature mapping usually reduces dimensionality. However, if a data point lies in the regions where the probability of one distribution could be ignored compared with the other distribution, it makes no contributions to the gradient with traditional cross-entropy loss (see supplementary material). However if we adopt Wasserstein distance as the distance measure, stable gradient could be provided. Our proposed ARDA approach directly draws support from the power of Wasserstein distance and the training strategy of Wasserstein GAN arjovsky2017wasserstein .

### 3.3 Adversarial Representation Learning

The challenge of unsupervised domain adaptation mainly lies in the fact that two domains have different distributions. To solve this problem, we propose a new approach to learn feature representations invariant to the change of domains by minimizing Wasserstein distance between the source and target distributions through adversarial training.

In our adversarial representation learning approach there is a feature generator which is supposed to learn the domain-invariant feature representations from inputs across domains. Given an instance from either domain, the feature generator implemented by neural network learns a function that maps the instance to a -dimensional representation with parameter . In order to minimize the distance of two domain distributions we introduce a critic of which the goal is to estimate the Wasserstein distance between the source and target representation distributions. Given a feature representation , the critic learns a function that maps the feature representation to a real number with parameter . Then the Wasserstein distance between two representation distributions and , where and , can be computed according to Eq. (2).

(3) | ||||

If the parameterized family of functions are all -Lipschitz, then we can approximate the Wasserstein distance by solving the problem

(4) |

Here comes the question of enforcing the Lipschitz constraint. arjovsky2017wasserstein proposed to clip the weights of critic within a compact space while gulrajani2017improved pointed out that weight clipping will cause capacity underuse and gradient vanishing or exploding problems. As suggested in gulrajani2017improved , a more reasonable way is to add gradient penalty in the critic loss function where the features at which to penalize the gradient are defined not only the source and target features but also at the random points along the straight line between source and target feature pairs. Since the Wasserstein distance is continuous and differentiable almost everywhere we can first train the critic till optimality. Then by minimizing the estimator of Wasserstein distance while fixing the optimal parameter of critic the feature generator can learn domain-invariant feature representations. Up to now the goal of learning the domain-invariant representations is achieved by solving the minimax problem

(5) |

However, the learning process is in an unsupervised setting which may result in that the learned domain-invariant representations are not discriminative enough to accomplish the final classification task. Hence it is necessary to incorporate the supervision signals of source domain data into the representation learning process. We introduce a classifier to compute the softmax prediction with parameter where is the number of classes. The classifier loss function can be defined as the cross entropy between the prediction probabilistic distribution and the one-hot encoding of the class label given the labeled source data:

(6) |

where is the indicator function and corresponds to the -th dimension value of the distribution . By combining the classifier loss, we get our final objective function

(7) |

where is the coefficient that controls the interaction of discriminative and transferable feature learning. Note that ARDA can be trained by the standard back-propagation in two alternative steps. As suggested previously, we first train the critic network till optimality by optimizing the max operator by gradient ascent. After that, the representation can be learned to be domain-invariant and target-discriminative since the parameter receives the gradient from both the critic and classifier loss.

## 4 Experiments

### 4.1 Datasets

Amazon review benchmark dataset. The amazon review dataset ^{1}^{1}1https://www.cs.jhu.edu/~mdredze/datasets/sentiment/ blitzer2007biographies has been one of the most widely used benchmark datasets for domain adaptation and sentiment analysis. It contains product reviews taken from Amazon.com of four types (domains), namely books (B), DVDs (D), electronics (E) and kitchen appliances (K). For each domain, there are 2,000 labeled reviews and approximately 4,000 unlabeled reviews (varying slightly across domains) and the classes are balanced. In our experiments, for easy computation, we follow chen2012marginalized ; jiang2016 to use the 5,000 most frequent terms of unigrams and bigrams as features and totally adaptation tasks are constructed.

Email spam filtering dataset. The email spam filtering dataset ^{2}^{2}2http://www.ecmlpkdd2006.org/challenge.html released by ECML/PKDD 2006 discovery challenge contains 4 separate user inboxes. From public inbox (source domain) 4,000 labeled training samples were collected, among which half samples are spam emails and the other half non-spam ones. The test samples were collected from 3 private inboxes (target domains), each of which consists of 2,500 samples. In our experiments, 3 cross-domain tasks are constructed from the public inbox to the private inboxes. We choose the 5,067 most frequent terms as features and 4 test samples were deleted as a result of not containing any of these terms.

Newsgroup classification dataset. The 20 newsgroups dataset ^{3}^{3}3http://qwone.com/~jason/20Newsgroups/ is a collection of 18,774 newsgroup documents across 6 top categories and 20 subcategories in a hierarchical structure. In our experiments, we adopt a similar setting as duan2012domain . The task is to classify top categories and the four largest top categories (comp, rec, sci, talk) are chosen for evaluation. Specifically, for each top category, the largest subcategory is selected as the source domain while the second largest subcategory is chosen as the target domain. Moreover, the largest category comp is considered as the positive class and one of the three other categories as the negative class. Table 1 provides the details of all the settings.

Setting | Source Domain | Target Domain |
---|---|---|

Comp vs. Rec | comp.windows.x & rec.sport.hockey | comp.sys.ibm.pc.hardware & rec.motorcycles |

Comp vs. Sci | comp.windows.x & sci.crypt | comp.sys.ibm.pc.hardware & sci.med |

Comp vs. Talk | comp.windows.x & talk.politics.mideast | comp.sys.ibm.pc.hardware & talk.politics.guns |

Office-Caltech object recognition dataset. The Office-Caltech dataset ^{4}^{4}4https://cs.stanford.edu/~jhoffman/domainadapt/ released by gong2012geodesic is comprised of 10 common categories shared by the Office-31 and Caltech-256 datasets. In our experiments, we construct 12 tasks across 4 domains: Amazon (A), Webcam (W), DSLR (D) and Caltech (C), with 958, 295, 157 and 1,123 image samples respectively. For all images, SURF features are extracted and quantized into a 800-bin histogram with codebooks trained from a subset of Amazon images.

### 4.2 Implementation Details

We implement all our experiments using TensorFlow abadi2016tensorflow . As a baseline, we train a standard multi-layer perceptron (MLP) using the labeled source data and test it on the target test data directly ("source only") as an empirical lower bound. The depth of the baseline network depends on the specific task while in our experiments the network architectures are sufficient to handle all the problems.

We mainly compare our proposed approach with domain adversarial neural network (DANN) ajakan2014domain ; ganin2016domain and the maximum mean discrepancy metric (MMD) gretton2012kernel since these techniques and our proposed ARDA all aim to learn the domain-invariant feature representations which is crucial to the feature-based domain adaptation methods. Without loss of generality, to our knowledge, recent feature-based methods bousmalis2016domain ; tzeng2014deep ; long2015learning ; long2016deep more or less adopt this kind of techniques as a component to minimize the distribution divergence. And thus there is no need to compare with these methods such as deep adaptation network (DAN) long2015learning and domain separation network (DSN) bousmalis2016domain since they focus on effective domain adaptation architectures while our ARDA aims to learn transferable features. ARDA can be incorporated in these architectures to replace the MMD or DANN to probably achieve better performance if its priority could be shown.

Source only | MMD | DANN | ARDA- | ARDA | mSDA | l2,1-SRA | ARDAm+ | |

B D | 81.09 | 82.57 | 82.07 | 82.18 | 83.05 | 83.57 | 84.09 | 84.97 |

B E | 75.23 | 80.95 | 78.98 | 79.14 | 83.28 | 77.09 | 76.79 | 85.12 |

B K | 77.78 | 83.55 | 82.76 | 83.25 | 85.45 | 84.92 | 85.27 | 87.92 |

D B | 76.46 | 79.93 | 79.35 | 79.84 | 80.72 | 83.76 | 83.88 | 84.92 |

D E | 76.24 | 82.59 | 81.64 | 81.15 | 83.58 | 85.06 | 83.59 | 85.72 |

D K | 79.68 | 84.15 | 83.41 | 84.54 | 86.24 | 87.49 | 87.68 | 88.60 |

E B | 73.37 | 75.72 | 75.95 | 77.04 | 77.22 | 80.10 | 80.55 | 82.62 |

E D | 73.79 | 77.69 | 77.58 | 78.42 | 78.28 | 78.81 | 80.36 | 82.71 |

E K | 86.64 | 87.37 | 86.63 | 86.83 | 88.16 | 88.19 | 88.85 | 89.72 |

K B | 72.12 | 75.83 | 75.81 | 76.82 | 77.16 | 79.14 | 79.02 | 79.84 |

K D | 75.79 | 78.05 | 78.53 | 79.98 | 79.89 | 78.52 | 79.88 | 83.77 |

K E | 85.92 | 86.27 | 86.11 | 86.23 | 86.29 | 87.45 | 87.31 | 88.22 |

AVG | 77.84 | 81.22 | 80.74 | 81.28 | 82.43 | 82.84 | 83.11 | 85.34 |

The MMD metric is a measurement of the difference between two probability distributions from their samples by computing the distance of mean embeddings after mapping distributions into RKHS. Therefore, the kernel function used in MMD is crucial to the final performance. To maximize the effectiveness of MMD we use a combination of approximately 19 RBF kernels with the standard deviation parameters ranging from to . As for DANN implementation, we add a gradient reverse layer (GRL) and then a domain classifier as described in ganin2016domain . For both approaches, the corresponding loss is added to the main classifier loss with a coefficient for the two-loss tradeoff.

Before introducing our ARDA implementation, we first add a comparison denoted as ARDA- which optimizes the same objective as DANN but a different training strategy is adopted where the GRL is not included. ARDA- can be viewed as incorporating the standard GAN in ARDA rather than Wasserstein GAN. For every mini-batch containing source labeled data and target unlabeled data, ARDA- first trains the domain classifier for some rounds (typically 2 rounds) and then the feature generator is trained with the gradient from the domain classifier and the label predictor. By separating the training of domain classifier, it might be easier to find an appropriate domain classifier through flexible hyper-parameter tuning.

Our approach is easy to implement by adding a critic which is also a standard MLP to the baseline network. For every mini-batch, we first train the critic around 5 rounds, which is chosen for the low time consumption and also the ability to train the critic to optimality. After the training of the critic every round, we clip the weight in critic or add a gradient penalty gulrajani2017improved to satisfy the Lipschitz constraint. Both approaches are tested and we find that the latter achieves better performance in most tasks. Therefore for brevity, we omit the result of weight clipping approach in the following comparisons. Specifically, we penalize the gradient not only at source and target points but also at the random points along the straight line between source and target pairs and the coefficient which controls the trade-off between optimizing the penalty term and the critic loss is set 10. As to the hyperparameter in Eq. (7), it should be tuned to optimal values according to different datasets and tasks. Besides, we use Adam optimizer in almost all our experiments.

Source only | MMD | DANN | ARDA- | ARDA | |
---|---|---|---|---|---|

Public | 69.63 | 80.95 | 83.27 | 84.55 | 85.67 |

Public | 76.01 | 85.98 | 85.74 | 88.75 | 88.26 |

Public | 81.24 | 94.08 | 91.92 | 93.20 | 95.76 |

AVG | 75.63 | 87.00 | 86.98 | 88.83 | 89.90 |

### 4.3 Results and Discussion

Amazon review benchmark dataset. The challenge of cross domain sentiment analysis lies in the distribution shift as different words are used in different domains. The left part of Table 2 shows the detailed comparison results of basic methods in 12 transfer tasks. As we can see, our proposed ARDA(-) approaches outperforms all other compared approaches in most domain adaptation tasks. This verifies that ARDA can learn high-quality domain-invariant representations to enable domain adaptation. ARDA further outperforms ARDA- in 10 out of 12 task, indicating the superiority of Wasserstein distance over traditional log-loss as the training objective. To show that ARDA can work with other domain adptation algorithm, we further apply our ARDA on the representations generated by marginalized stacked denoising autoencoders (mSDA) chen2012marginalized . The result of combination (denoted by ARDAm+) of mSDA and ARDA is shown in the right part of Table 2. 2,1-SRA jiang2016 is the state-of-the-art model on amazon review dataset in the same setting. Obviously ARDA shows its effectiveness combined with mSDA and outperforms 2,1-SRA on all tasks.

Email spam filtering dataset. Experimenting on the 3 tasks by transferring from public to private groups of private inboxes , we found our method does achieve better performance than MMD and DANN, which is demonstrated in Table 3. We can see from this result that all the three methods can reach the goal of learning the transferable features for they all outperform the source only baseline at least . Among them, MMD and DANN achieve almost the same performance while ARDA further boosts the performance by a rate of . The superiority of our method is shown in this experiment.

Newsgroup classification dataset. The distribution shift across newsgroups is caused by category specific words. Notice the construction of our domain adaptation tasks which aim to classify the top categories while the adaptation exists between the subcategories. It makes sense that there exist more differences among top categories than those among subcategories which implies that classification is not that sensitive to the subcategories and thus enables the ease of domain adaptation. Table 4 gives the information of performance on the 20newsgroup dataset from which we can find that the comparison methods are almost neck and neck, which is consistent with our previous observation. To our surprise, ARDA can still have slightly better performance in this situation.

Source only | MMD | DANN | ARDA- | ARDA | |
---|---|---|---|---|---|

Comp vs. Rec | 81.62 | 97.85 | 98.10 | 98.22 | 98.35 |

Comp vs. Sci | 74.01 | 87.52 | 90.57 | 90.57 | 91.33 |

Comp vs. Talk | 94.44 | 96.96 | 97.75 | 97.22 | 97.62 |

AVG | 83.36 | 94.11 | 95.47 | 95.34 | 95.77 |

AC | AD | AW | WA | WD | WC | DW | DA | DC | CW | CA | CD | AVG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Source only | 43.19 | 35.03 | 35.25 | 30.06 | 80.25 | 30.19 | 69.50 | 31.21 | 30.37 | 36.95 | 52.92 | 45.86 | 43.40 |

MMD | 44.08 | 41.40 | 37.29 | 34.13 | 84.71 | 30.72 | 73.56 | 32.46 | 30.72 | 40.34 | 54.80 | 47.13 | 45.95 |

DANN | 44.97 | 41.40 | 38.64 | 34.13 | 82.80 | 32.68 | 74.24 | 31.63 | 32.24 | 43.39 | 54.91 | 47.77 | 46.57 |

ARDA- | 45.33 | 41.40 | 39.32 | 34.66 | 84.08 | 32.68 | 75.59 | 32.05 | 32.06 | 41.69 | 53.97 | 47.13 | 46.66 |

ARDA | 45.86 | 44.59 | 40.68 | 32.15 | 81.53 | 31.08 | 76.95 | 35.60 | 32.59 | 42.37 | 55.22 | 48.41 | 47.25 |

Office-Caltech object recognition dataset. Table 5 shows the result of our experiments on Office-Caltech dataset. We observe that our approach achieves better performance than other compared approaches on most tasks. Office-Caltech dataset is small since there are only hundreds of images in one domain and it is a 10-class classification problem. Thus we could draw a conclusion that our ARDA approach can also deal with small-scale datasets effectively.

### 4.4 Empirical Analysis

Proxy A-distance (PAD). A-distance kifer2004detecting is a measure of divergence between two probability distributions which can be viewed as a relaxation of the total variation distance. ben2007analysis showed that A-distance between the source and target distributions is a crucial part of an upper generalization bound for domain adaptation. In practice, computing the exact A-distance is impossible and one has to compute a proxy. The proxy is defined as , where is the generalization error of a linear SVM classifier trained to discriminate between two domains. Figure 1 shows the PAD computed by ARDA representations on two benchmark datasets from which we can see that ARDA representations have a lower PAD than raw data. So according to the theory of ben2007analysis , ARDA is guaranteed lower generation bound for domain adaptation.

Feature visualization To demonstrate the transferability of the ARDA learned features, we follow donahue2014decaf ; long2016deep and plot in Figure 2 the t-SNE embeddings van2013barnes on DE domain adaptation task of Amazon review dataset to visualize feature distributions. We can see from the comparison that our ARDA learned features are more discriminative and transferable since the classes between source and target domains are aligned much better.

## 5 Conclusion

This paper presents a novel approach to learn domain-invariant features. The approach is motivated by the generative adversarial nets (GANs) to match the generated and real distributions via adversarial learning. By incorporating Wasserstein GAN we can reduce the distance between source and target distributions efficiently and thus the high-quality transferable features can be learned. Our proposed approach could be further combined with existing domain adaptation algorithms bousmalis2016domain ; tzeng2014deep ; long2015learning ; long2016deep ; zhuang2015supervised to attain better transferability. Empirical results on 4 common domain adaptation datasets demonstrate that ARDA outperforms the state-of-the-art domain-invariant feature learning approaches. From feature visualization, one can easily observe that ARDA yields domain-invariant yet target-discriminative feature representations. For future work, we will delve deeper into the ARDA architecture design for better representation learning. We also plan to investigate semi-supervised domain adaption problems.

## References

- [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- [2] Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
- [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- [4] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
- [5] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007.
- [6] John Blitzer, Mark Dredze, Fernando Pereira, et al. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, volume 7, pages 440–447, 2007.
- [7] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
- [8] Minmin Chen, Yixin Chen, and Kilian Q Weinberger. Automatic feature decomposition for single view co-training. In ICML-11, pages 953–960, 2011.
- [9] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In NIPS, pages 2456–2464, 2011.
- [10] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012.
- [11] Wen-Sheng Chu, Fernando De la Torre, and Jeffery F Cohn. Selective transfer machine for personalized facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3515–3522, 2013.
- [12] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, volume 32, pages 647–655, 2014.
- [13] Lixin Duan, Ivor W Tsang, and Dong Xu. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):465–479, 2012.
- [14] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
- [15] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520, 2011.
- [16] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. IEEE, 2012.
- [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [18] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- [19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
- [20] Jiayuan Huang, Alexander J Smola, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, et al. Correcting sample selection bias by unlabeled data. NIPS, 19:601, 2007.
- [21] Wenhao Jiang, Hongchang Gao, Fu-lai Chung, and Heng Huang. The l2, 1-norm stacked robust autoencoders for domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 1723–1729. AAAI Press, 2016.
- [22] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 180–191. VLDB Endowment, 2004.
- [23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- [24] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
- [25] Mingsheng Long, Jianmin Wang, Yue Cao, Jiaguang Sun, and S Yu Philip. Deep learning of transferable representation for scalable domain adaptation. IEEE Transactions on Knowledge and Data Engineering, 28(8):2027–2040, 2016.
- [26] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.
- [27] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In Advances in neural information processing systems, pages 1041–1048, 2009.
- [28] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. In Advances in Neural Information Processing Systems, pages 1786–1794, 2010.
- [29] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
- [30] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
- [31] Ludger Rüschendorf. The wasserstein distance and approximation theorems. Probability Theory and Related Fields, 70(1):117–129, 1985.
- [32] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
- [33] Laurens Van Der Maaten. Barnes-hut-sne. arXiv preprint arXiv:1301.3342, 2013.
- [34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NIPS, pages 3320–3328, 2014.
- [35] Fuzhen Zhuang, Xiaohu Cheng, Ping Luo, Sinno Jialin Pan, and Qing He. Supervised representation learning: Transfer learning with deep autoencoders. In IJCAI, pages 4119–4125, 2015.

Here we would like to prove the priority of Wasserstein distance as a loss function over cross-entropy in the situation where the mapped feature distributions fill in the whole latent feature space. For simplicity, we take two normal distributions as example and the conclusion still holds in the high-dimensional space. Fig 3 shows the two normal distributions and the space is divided into 3 regions where the probability of source data lying in region A is high while that of target data is extremely low. The situation is just opposite in region C and in region B two distributions differ a little.

We use the same notation as Section 3.3 here. If source data are labeled 1 while target data are labeled 0 and a domain classifier is used to help learn the features, the feature generator minimizes the following objective which could be viewed as the negative of cross-entropy between the domain label and prediction

(8) |

where is the sigmoid function. Then we want to compute the gradient of corresponding to , according to the chain rule, we have

(9) |

As we know, the optimal domain classifier is where and represents the source feature distribution and represents the target feature distribution. So if one source sample lies in region A, it provide gradient of almost 0. The same result holds for target samples lying in region C. So these points make no contribution to the gradient and thus the divergence between feature distributions couldn’t be reduced effectively.

Now we consider Wasserstein distance as the loss function.

(10) |

So for data from source domain , while for data from target domain . Therefore Wasserstein distance can always provide stable gradient.