Deformation equivariant cross-modality image synthesis with paired non-aligned training data

Joel Honkamaa
Department of Computer Science
Aalto University &Umair Khan
Institute of Biomedicine
University of Turku &Sonja Koivukoski
Institute of Biomedicine
University of Eastern Finland &Leena Latonen
Institute of Biomedicine
University of Eastern Finland &Pekka Ruusuvuori
Institute of Biomedicine
University of Turku &Pekka Marttinen
Department of Computer Science
Aalto University

Abstract

Cross-modality image synthesis is an active research topic with multiple medical clinically relevant applications. Recently, methods allowing training with paired but misaligned data have started to emerge. However, no robust and well-performing methods applicable to a wide range of real world data sets exist. In this work, we propose a generic solution to the problem of cross-modality image synthesis with paired but non-aligned data by introducing new deformation equivariance encouraging loss functions. The method consists of joint training of an image synthesis network together with separate registration networks and allows adversarial training conditioned on the input even with misaligned data. The work lowers the bar for new clinical applications by allowing effortless training of cross-modality image synthesis networks for more difficult data sets and opens up opportunities for the development of new generic learning based cross-modality registration algorithms.

1 Introduction

Image-to-image translation is one of the most active areas of research in computer vision because of its various applications such as image synthesis, segmentation, restoration, style transformation and pose estimation. After the advent of deep leaning, medical imaging as a cardinal application area, has seen an increasing interest in the use of image-to-image translation. In histopathology image-to-image translation has been used, e.g., for cross-stain translation (Liu et al., 2021; Xu et al., 2019), for replacing chemical staining by digitally generated mask (Valkonen et al., 2019), for tissue color normalization (de Bel et al., 2019, 2021), for virtual staining of label-free or unstained tissue images (Bayramoglu et al., 2017; Rana et al., 2020; Rivenson et al., 2019). In radiology, it has been used for organ segmentation, synthetic CT and cross-modality MRI synthesis (Boulanger et al., 2021; Spadea et al., 2021; Xie et al., 2022). The image-to-image translation methods are primarily divided into two categories: supervised methods that rely upon aligned image pairs and unsupervised methods that don’t require aligned image pairs, although in general, their translation quality is not on par with the supervised methods.

Image-to-image translation is called differently depending on the application and in this work we will call it cross-modality image synthesis often used in medical imaging context. We use the term modality broadly to refer to any distinct image types capturing different characteristics of the underlying anatomy.

In medical domain, different modality images of the same subject are not usually anatomically aligned. To solve this before training a network images are typically registered, or in other words, aligned anatomically. Deep learning registration methods have gained popularity (Fu et al., 2020) with the best methods performing close to classical registration algorithms, e.g. in Learn2Reg multi-task medical image registation challenge (Hering et al., 2021) or in histopathology ANHIR competition (Borovec et al., 2020).

Methods combining the two, cross-modality image synthesis and cross-modality registration, have also started to surface. In registration a synthesized image can be used as a bridge to generate a cross-modality similarity metric (Lu et al., 2021). However, some methods combine these two into a unified architecture, solving both of the problems at the same time. Such methods have been published from both the registration (Arar et al., 2020; Chen et al., 2022) and image synthesis viewpoint (Joyce et al., 2017; Kong et al., 2021; Wang et al., 2018, 2019, 2021a).

Figure 1: Basic setting. Only non-aligned pairs of $x^{(i)}$ and ${~ y}^{(i)}$ are available but the task is to learn $F$ for transforming $x^{(i)}$ into $y^{(i)}$ . The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) data set.

In this paper we propose a new architecture for combined cross-modality image synthesis and cross-modality registration. Firstly, we suggest training the image-synthesis network directly for deformation equivariance which refers to the property that applying a deformation before or after the image synthesis should result in the same image. Secondly, we develop a strategy allowing adversarial training conditioned on the input images despite of using misaligned training data which is not possible by earlier methods. Conditioning adversarial training on the input is especially important in medical domain as it results in more reliable predictions. In addition to the better quality predictions the method is applicable to wider range of data sets than earlier similar methods.

2 Basic Setting

Assume we have a training set of input images $(x^{(1)}, \dots, x^{(N)})$ and non-aligned output images $({~ y}^{(1)}, \dots, {~ y}^{(N)})$ . Following the notation by Kong et al. (2021) we denote the (unavailable) aligned ground truth labels by $(y^{(1)}, \dots, y^{(N)})$ . Additionally, unavailable deformations $(d^{(1)}, \dots, d^{(N)})$ connect the coordinate systems of the inputs and the outputs.

Assuming that the images are continuous, they can be seen as mappings $x^{(i)} : R^{n} \to R^{m_{1}}$ and $y^{(i)}, {~ y}^{(i)} : R^{n} \to R^{m_{2}}$ where $n$ is the dimensionality of the image (e.g. $n = 2$ for two dimensional images) and $m_{1}$ and $m_{2}$ are number of channels in input and output images respectively (e.g. $3$ for RGB images). The deformations would then be mappings $d^{(i)} : R^{n} \to R^{n}$ connecting the image coordinates. Doing a coordinate transformation of an image $x^{(i)}$ based on a deformation $d^{(i)}$ would equal to function composition $x^{(i)} \circ d^{(i)}$ which can be written using the pullback notation as $d^{(i) *} x^{(i)} := x^{(i)} \circ d^{(i)}$ where $d^{(i) *}$ can be seen as a mapping acting on images.

Following the notation, we have the relationship $d^{(i) *} y^{(i)} = {~ y}^{(i)}$ between the aligned and non-aligned labels. In practice, the images are not continuous, but instead only samples of the images are available, i.e. the pixels or voxels. Hence in reality, the mapping $d^{(i) *}$ equals to interpolating the image at the locations defined by the deformation, and we use linear interpolation.

In this work, we study a setting where we are trying to learn a function $F$ which is a neural network such that $F (x^{(i)}) = y^{(i)}$ . To do this, we simultaneously try to learn a second neural network, or as it turns out, multiple networks for predicting $d^{(i)}$ .

3 Previous Work

3.1 Cross-Modality Image Synthesis

Cross-modality medical image synthesis has gained a lot of attention in recent years with multiple proposed clinical applications (Wang et al., 2021b). The conditional GANs-based architecture pix2pix (Isola et al., 2017) is widely used when paired and aligned data are available as it is based on an assumption of pixel-to-pixel correspondence between training images of different modality. Whereas CycleGAN (Zhu et al., 2017) can be used if one does not have paired or aligned data.

Paired training images in the medical context are typically not aligned, and hence for pixel-to-pixel training they are registered into the same coordinate system. However, registration is never perfect and pixel-to-pixel losses are very sensitive to registration errors reducing the synthesis quality especially on difficult to register areas with large internal anatomic motion such as on pelvis area (Wang et al., 2021b).

In pixel-to-pixel setting, different approaches have been proposed to mitigate for the remaining registration errors (Chen et al., 2020; Joyce et al., 2017; Kazemifar et al., 2019; Leynes et al., 2018; Yu et al., 2019). Most similarly to our work, Kong et al. (2021) combine cross-modality image synthesis network with registration network to enable training with non-aligned data. However, we argue that their method does not robustly mitigate for registration errors especially with real world data sets. Additionally they use unconditional adversarial training which has been shown to be inferior to conditioning the discriminator with input images. In this work we aim to solve both of these problems.

While performing worse than pix2pix when paired and aligned data are available, unsupervised CycleGAN is more robust to misalignments due to its cycle consistency loss (Kaji and Kida, 2019; Wang et al., 2021b). However, if the misalignments between the modalities are systematic and severe, CycleGAN can also fail to produce geometrically aligned predictions. Approaches, similar to ones used with pix2pix, have also been employed with CycleGAN (Zhang et al., 2018; Hiasa et al., 2018; Kida et al., 2019). Wang et al. (2018, 2019, 2021a) propose network architectures combining cross-modality image synthesis and registration, together with mutual information loss between the input and the prediction to enforce similar geometry.

3.2 Cross-Modality Registration

Deformable medical image registration using deep learning has also gained popularity recently (Fu et al., 2020). With stationary velocity field parametrization, one can generate diffeomorphic deformations (Arsigny et al., 2006; Ashburner, 2007). This kind of methodology was applied to deep learning by Dalca et al. (2018). Some architectures such as the one by De Vos et al. (2019) combine affine or rigid registration together with a separate deformable registration resulting in multi-stage registration approach.

From cross-modality registration methods, the method by Arar et al. (2020) is closest to our work. They train a cross-modality registration network by simultaneously training a cross-modality image synthesis network which they encourage to be equivariant to deformations predicted by the registration network. This is done by applying the predicted deformation both before and after the image synthesis network and comparing both of them to the label. We instead use simulated deformations for encouraging deformation equivariance which we argue to be more robust approach. Our method of encouraging deformation equivariance is similar to the method by Pielawski et al. (2020) where they train their network for rotational equivariance.

Very recently, Chen et al. (2022) use contrastive learning based loss for enforcing geometric (or shape) similarity of the image synthesis in an otherwise similar setting to Arar et al.

4 Methods

Figure 2: Proposed core architecture. When using the equivariance similarity loss instead of the default similarity loss, the commutation loss is optional. The architecture presented here is further refined by adding an adversarial loss and a separate cross-modality registration network for first registering ${~ y}^{(i)}$ to $x^{(i)}$ . Deformation $t$ is a random deformation sampled on the fly individually for each training image pair. The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) data set.

To teach the network $F$ for predicting $y^{(i)}$ from $x^{(i)}$ , Kong et al. (2021) train an additional network $G$ aimed at learning the $d^{(i)}$ s.t. $G (F (x^{(i)}, {~ y}^{(i)}) = d^{(i)}$ . They train both of the networks $F$ and $G$ simultaneously with the similarity loss (which we label default similarity loss)

L_{def-sim} := E_{x, ~ y} | | ~ y - G (F (x), ~ y)^{*} F (x) | |_{L^{1}}

(1)

together with a regularization loss

L_{reg} := E_{x, ~ y} Reg (G (F (x), ~ y))

where $Reg$ is some operator penalizing non-smooth deformations, and an unconditional adversarial loss with the intent of training the distribution of $F (x^{(i)})$ to match the distribution of ${~ y}^{(i)}$ .

In the work by Kong et al. they view the deformations between inputs and labels as noise and assume the same underling physical distribution for both the inputs and the labels. In that setting, the adversarial training objective they use is justified. However, often images in the label domain might be systematically geometrically different to the images in the input domain, e.g., when patients are laying differently within different medical imaging equipment. In that case matching the distribution of $F (x^{(i)})$ with the distribution of ${~ y}^{(i)}$ is not desirable.

To fix this we first omit the adversarial loss altogether, although we will develop a revised adversarial training strategy later. However, without the adversarial loss the optimization problem is very unstable since the network $F$ is not in any way constrained to preserve the geometry of the input images, that is, the predictions are not guaranteed to be anatomically aligned with the inputs. If $F$ is a convolutional neural network it has an inductive bias towards this kind of a behaviour but there is no guaranatee that the convolutional network does not, e.g., shift the predictions and the registration network compensate for the shift. An example of a possible failure mode is shown in Figure 5. It is also noteworthy that while having the adversarial training in a setting similar to the work by Kong et al. will definitely stabilize the training, there is no fundamental theoretical reason why it should result in $F$ preserving the geometry of its inputs.

The property of $F$ preserving the geometry of an input can be formulated as deformation equivariance. Any movement in the underlying anatomy of the input image should be reflected similarly in the output image. Assuming a set of anatomically possible geometric deformations $T^{(i)}$ for each input image $x^{(i)}$ , the function $F$ should be such that for any $t \in T^{(i)}$ , it holds that

t^{*} F (x^{(i)}) = F (t^{*} x^{(i)}) .

(2)

In other words, $F$ should commute for all anatomically possible deformations of $x^{(i)}$ . To achieve this, we enforce the property implicitly by modifying the default similarity loss given in Equation (1). The modification is similar to the one by Pielawski et al. (2020), although they use it in a different contrastive learning setting. We label the resulting loss equivariance similarity loss:

L_{eq-sim} = E_{x, ~ y, t} | | ~ y - (G (F (x), ~ y)^{*} t^{- 1})^{*} F (t^{*} x) | |_{L^{1}}

(3)

Here $t$ is seen as a random variable sampled from some distribution. The loss can be zero only if $F$ is equivariant to all the $t$ for all the inputs. Note that we first compose the deformations $G (F (x), ~ y)$ and $t^{- 1}$ and after that deform the prediction $F (t^{*} x)$ . This way we avoid multiple interpolations of the same image. The same strategy of composing the deformations first is always used in this paper when applying multiple deformations to an image.

Optionally one can explicitly enforce the equivariance by training with the following objective which we label commutation loss:

L_{com} := E_{x, t} | | t^{*} F (x) - F (t^{*} x) | |_{L^{1}}

(4)

When using the commutation loss using the equivariance similarity loss is optional and the default similarity loss can be used as well. Without the commutation loss the equivariance similarity loss is needed. As a result we have three possible configurations.

Arar et al. (2020) also encourage deformation equivariance but only for deformations predicted by their registration network. That is not always enough, e.g., with perfectly aligned training data the network $F$ could still introduce any translation which the registration network could compensate since translations commute. Also with subtle systematic deformations the network $F$ might easily overlearn the deformations from the data resulting in the registration network predicting zero deformation.

The core architecture presented so far is visualized in Figure 2, and is in itself trainable. However, in addition to the core architecture, we will be looking at adding a conditional adversarial loss for training the model to improve the prediction quality further. In order for the adversarial training to converge even in the presence of systematically different geometry between the input and the label domains, it turns out we will require two registration networks: one for registering labels to inputs and another one for registering predictions to possibly imperfectly registered labels.

Before advancing further, we introduce an additional notation. In case a variable should be treated as a constant from optimization point of view even if it is an output of a neural network, we overline the variable, e.g. $x^{(i)}$ vs. ${¯ ¯ ¯ x}^{(i)}$ . In the neural network context, this means halting the backward pass during back-propagation.

4.1 Deformation Set

The equivariance similarity loss and the commutation loss require some way to simulate anatomically realistic deformations for calculating them. An ideal set would be the set of all anatomically realistic deformations for each sample, but anatomically realistic non-affine deformations are very difficult to simulate. A natural question is then whether a significantly smaller set of deformations would be enough especially given the inductive biases of the used architectures.

Given that $F$ is a convolutional network it is (roughly) translation equivariant. Hence intuitively simulating only globally affine deformations might be enough since any diffeomorphic deformation is "locally affine". In our experiments globally affine deformations were indeed enough, and we ended up using only rotations and flips. As a result, deformations can be sampled from the same distribution for each sample. Applying affine deformations to images is also computationally efficient and numerically accurate. With more challenging data, scaling, shearing or even elastic deformations might have to be used.

Flips deviate from the idea of the set of deformations being anatomically realistic. However, physically flipping is a relatively well defined action, although in practice usually impossible to execute. Most imaging methods image the physical world in such a way that flipping the imaged subject should also flip the resulting image.

4.2 Adversarial Training

In addition to the losses presented so far, we want to incorporate adversarial loss to the training in order to improve the appearance and also clinical quality of the predictions. In adversarial training, an additional discriminator network is trained to classify whether an image fed to the network is real or fake and can be used for guiding the generator network responsible for generating synthetic images. We want to employ a conditional adversarial training setting, similar to pix2pix (Isola et al., 2017), wherein the input image is also fed to the discriminator. This is different to the approach taken by Kong et al. (2021) where they feed only the prediction or the label. Conditioning the discriminator on input images results in generally better predictions (Isola et al., 2017).

Let now $D$ be the discriminator network receiving an input image as the first argument and either a label or a prediction as the second argument. The conditional adversarial learning objective is defined as follows:

E_{x, ~ y} [log D (input, label) + log (1 - D (input, prediction))]

(5)

The discriminator is trained to maximize the loss and the generator is trained to minimize it with the training executed in turns while holding the weights of the other network constant. Placeholder texts are used as we are yet to derive the optimal way to feed the data to the discriminator. Misaligned training data will require care in how that is done. To be more precise, the following three points need to be taken into account:

Predictions and labels have to be fed in the same coordinate system since the input domain and the label domain might have systematic geometric differences which would encourage the predictions to be misaligned with the inputs.
Inputs and their corresponding predictions can not be fed in the exactly same coordinate system since even if the predictions are registered to the labels or vice versa, the labels will not be exactly aligned with the inputs, especially in the beginning of the training. That would encourage misaligned predictions.
Interpolation acts as a low-pass filter especially in areas where the image is stretched. As a result, predictions registered to labels can not be directly compared with the labels as the discriminator will learn to notice the missing high frequencies.

Points 1) and 2) would suggest to feed labels and predictions in the label coordinates but 3) would suggest that we can not deform the predictions to the labels either. As a solution we propose to train two separate registration networks: one for cross-modality registration of labels to inputs and one for intra-modality registration of predictions to possibly imperfectly registered labels. The adversarial comparison can then be done between the registered labels and the predictions registered to the registered labels. The proposed approach will solve all the three problems mentioned above: 1) The comparison will be done in the same coordinate system. 2) If the label and the input are imperfectly aligned, the prediction can still be separately registered to the registered label hence removing the incentive for misaligned predictions. 3) As the training progresses, most of the movement should be included in the registration of the labels to the inputs, allowing the predictions registered to the registered labels to contain at least as high frequency information as the registered labels.

Let us now denote the predicted cross-modality deformation (approximately) mapping coordinates of ${~ y}^{(i)}$ to $x^{(i)}$ as $d_{cross}^{(i)}$ and the predicted intra-modality deformation (approximately) mapping coordinates of $F (x^{(i)})$ to ${d_{cross}^{(i)}}^{*} {~ y}^{(i)}$ as $d_{intra}^{(i)}$ . We will look at how these are obtained in Section 4.3.

From the adversarial loss perspective, we want to treat $d_{cross}^{(i)}$ as constant since the cross-modality registration would not benefit from the adversarial loss and might even result in unexpected optima. This approach corresponds to the normal GAN training where only the second term is used in updating the generator. The proposed adversarial loss is then

E_{x, ~ y} [log D (x, {¯ ¯ ¯ d}_{cross}^{*} ~ y) + log (1 - D (x, d_{intra}^{*} F (x))] .

(6)

The loss function can be improved even further by employing similar idea to the equivariance similarity loss. To simultaneously prevent discriminator from over-fitting and implicitly enforce equivariance to set of deformations, we propose to further modify the loss to the following form (which we label equivariance adversarial loss):

L_{eq-adv} := E_{x, ~ y, t} [log D (t^{*} x, (t^{*} {¯ ¯ ¯ d}_{cross})^{*} ~ y) + log (1 - D (t^{*} x, (t^{*} d_{intra}^{*} t^{- 1})^{*} F (t x))]

(7)

Here same deformations $t$ can be used which are used for the equivariance similarity loss and the commutation loss. The core idea is to augment the inputs to the discriminator with the deformations $t$ and the formulation follows from it naturally.

4.3 Registration Architecture

As discussed, the registration will be divided into cross-modality registration for registering labels to inputs and intra-modality registration for registering predictions to the registered labels. While the cross-modality registration receives pairs of different modality as inputs, it is trained with intra-modality loss based on the synthesised image $F (x^{(i)})$ similarily to the intra-modality registration.

In principle, the registration networks can predict the deformation in any suitable form. We split the cross-modality registration into rigid registration and elastic registration. The two-stage architecture makes it significantly easier for the model to handle large deformations. For intra-modality registration, we do not use two-stage architecture as cross-modality registration should take care of most of the movement.

We generate elastic deformations from stationary velocity fields to enforce diffeomorphic deformations and to allow inverting the deformations. From a stationary velocity field the final diffeomorphic deformation is obtained by integrating the field over itself over a unit time. In group theory, this can be seen as exponentiation of a member of a lie algebra (Arsigny et al., 2006), and hence we denote the integration by $exp$ . Exponentiation can be estimated efficiently by the scaling and squaring method (Arsigny et al., 2006; Dalca et al., 2018). The velocity fields are predicted in the same resolution as the images.

4.3.1 Cross-Modality Registration

Let the neural network predicting the rigid deformation for cross-modality registration be $H_{rig}$ and the neural network predicting the stationary velocity field for elastic cross-modality registration be $H_{svf}$ . Both networks $H_{rig}$ and $H_{svf}$ could be trained in principle with a single loss, but for the rigid registration network not to first shift the label unnecessarily and then the elastic registration network to shift it back, we add a separate rigid registration loss.

The overall predicted cross-modality deformation is then

d_{cross}^{(i)} := exp (v_{cross}^{(i)})^{*} {¯ ¯ ¯ r}_{cross}^{(i)}

where $r_{cross}^{(i)} := H_{rig} (x^{(i)}, {~ y}^{(i)})$ and $v_{cross}^{(i)} := H_{svf} (x^{(i)}, {~ y}^{(i)})$ . Halting the gradients for $r_{cross}^{(i)}$ is not necessary but makes loss function balancing more straightforward by separating the rigid registration altogether.

We train the rigid registration network $H_{rig}$ with the loss

L_{rig-sim}^{cross} := E_{x, ~ y} [| | ¯ ¯¯¯¯¯¯¯¯¯ ¯ F (x) - r_{cross}^{*} ~ y | |_{L^{1}}]

(8)

and the elastic registration network $H_{svf}$ with the loss

L_{sim}^{cross} := E_{x, ~ y} [

| | ¯ ¯¯¯¯¯¯¯¯¯ ¯ F (x) - d_{cross}^{*} ~ y | |_{L^{1}}] .

(9)

We halt the gradients for the backward pass for $F (x)$ as we do not want the imperfect cross-modality and especially rigid cross-modality registered label to affect the image synthesis network.

Additionally we need to regularize the deformation. The regularization term can be applied only to the elastic component as we do not penalize rigid deformations. We use non-rigidity penalty by Staring et al. (2007) and apply it to both inverse and forward deformations. We have

L_{reg}^{cross} := E_{x, ~ y} [Rig (exp (v_{cross})) + Rig (exp (- v_{% cross}))]

(10)

where $Rig$ is the non-rigidity penalty by Staring et al. Details of the regularization used can be found in the supplementary materials.

Figure 3: Cross-modality registration architecture. The outputs $y_{reg}^{(i)}$ and $v_{cross}^{(i)}$ are forwarded for intra-modality registration. Regularization is applied to both inverse and forward elastic deformation which is not explicitly shown here. The images are from the synthetic "multimodal" data sets built using COCO (Lin et al., 2014) data set.

4.3.2 Intra-Modality Registration

The intra-modality registration network receives the triplets

(F (x^{(i)}), {¯ ¯ ¯ y}_{reg}^{(i)}, exp (- {¯ ¯ ¯ v}_{% cross}^{(i)}) - I)

as inputs where $I$ is the identity mapping. The third argument represents displacement field of the inverse elastic deformation with which $y_{reg}^{(i)}$ has been deformed and allows the network to optimize regularity of the concatenated overall deformation, as we regularize based on that.

Outputs of the cross-modality registration stage are treated as constants by the intra-modality registration losses. By that we prevent the networks from finding any non-desired optima where, e.g., difficult to synthesize regions were made smaller.

Let now the function predicting the stationary velocity field for intra-modality registration be $G_{svf}$ . Then, the predicted intra-modality deformation is

d_{intra}^{(i)} := exp (- v_{intra}^{(i)})

where $v_{intra}^{(i)} := G_{svf} (F (x^{(i)}), {¯ ¯ ¯ y}_{% reg}^{(i)}, exp (- {¯ ¯ ¯ v}_{cross}^{(i)}) - I)$ . Here, we use the negative sign for the velocity field to emphasize that the direction is different to the cross-modality registration. As the training progresses and cross-modality registration and cross-modality image synthesis improve, $d_{intra}^{(i)}$ should approach the identity mapping.

The loss function for the intra-modality registration is also guiding the cross-modality image synthesis. Hence, we use the deformation equivarince encouraging loss function following the equation (3):

L_{eq-sim}^{intra} := E_{x, ~ y} | | (d_{intra}^{*} t^{- 1})^{*} F (t^{*} x) - {¯ ¯ ¯ y}_{reg} | |_{L^{1}}

(11)

We also experiment with a setting where the $L_{sim}^{intra}$ is replaced with the default similarity loss following the equation (1). In that case, the loss simply takes the following form:

L_{def-sim}^{intra} := E_{x, ~ y}

| | (d_{intra}^{*} F (x) - {¯ ¯ ¯ y}_{reg} | |_{L^{1}} .

(12)

For regularization, we use the concatenated overall elastic deformation again in both directions:

L_{reg}^{intra} := E_{x, ~ y} [Rig (exp (v_{intra})^{*} exp ({¯ ¯ ¯ v}_{cross})) + Rig (exp (- {¯ ¯ ¯ v}_{cross})^{*} exp (- v_{% intra}))]

(13)

Using the concatenated overall deformation is logical, as that is the deformation we are actually trying to learn and hence regularize.

Having the separate intra-modality registration network in addition to the cross-modality registration allows the prediction from the image synthesis network to directly affect the predicted deformation. As a result the cross-modality image synthesis network is efficiently optimized for generating predictions with lowest deformation regularization loss, which can be seen as a meaningful selection criteria among all the possible geometry preserving versions of the synthesized image.

4.4 Masking

Throughout the architecture, images are resampled by deformations, but the sampled locations might be outside the image. We connect each image with a mask that can initially represent invalid regions in the image. Each time an image is resampled, the mask is updated with the regions resampled from outside the image. In similarity losses, we then compare images only within intersections of the masks. Masks contain discrete values and no gradients flow through the masks during the backward-pass, preventing the optimization of the masks themselves. The same procedure is done also for the deformations as they are also resampled by other deformations.

Invalid region masks are also fed for the registration networks since for invalid regions only the regularization should affect the generated deformation.

Additionally, the masks of the registered labels and the predictions registered to the registered labels might be systematically different, which the discriminator could use for separating the images. We mitigate for that by multiplying each image fed to the discriminator by the intersection of the masks of the images compared.

4.5 Overall Loss Function

The overall loss function can be written as

L

(14)

where $L_{sim}^{intra}$ is either $L_{eq-sim}^{intra}$ or $L_{def-sim}^{intra}$ , and $λ, γ, δ \in R$ are loss function weights. Each loss component affects only the weights of the sub-networks written in the curly braces. Note that $D$ is trained to maximize the loss whereas the other networks are trained to minimize it.

Actual values used for the loss function weights are given in the supplementary materials.

5 Experiments

In this section we will be using the following naming conventions when referring to different experiment configurations:

EqSim: The equivariance similarity loss from Equation (11) was used.
DefSim: The default similarity loss from Equation (12) was used.
Com: The commutation loss from Equation (4) was used.
EqAdv: The equivariance adversarial loss from Equation (7) was used.
DefCondAdv: The default input conditioned adversarial loss from pix2pix (Isola et al., 2017) defined directly between unmodified inputs, predictions and labels was used.
DefUncondAdv: The default unconditional adversarial loss from pix2pix (Isola et al., 2017) defined directly between unmodified predictions and labels was used.
NoReg: Only the cross-modality image synthesis component $F$ with $L^{1}$ similarity loss directly between predictions and unmodified training labels was used.
Aug: Traditional data augmentation was used for each training input using the same distribution of deformations as for the equivariance similarity loss, the commutation loss, and the equivariance adversarial loss.

The cross-modality registration related similarity loss and both of the regularization losses were used in all the trainings except with NoReg setup.

In Section 4 three variants of our developed method were proposed: EqSim, DefSim + Com, and EqSim + Com. Optionally, EqAdv can be combined with any of them. Training with any of the variants should result in a stable convergence and the experiments aimed at measuring their relative performance. In addition, we naturally wanted to measure their performance against other models which are listed separately for each experiment.

Note that the configurations DefSim + DefUncondAdv + Aug, and DefSim + DefUncondAdv correspond to training with losses similar to the ones used by Kong et al. (2021) (with and without augmentation). Our used registration methodology including deformation regularization is different to Kong et. al. but we argue that using the exact same registration setup provides the most fair comparison between the methods.

Architectures used were identical for overlapping parts within each experiment and are detailed in the supplementary materials.

In all the experiments, concatenation of the following transformations were used as simulated deformations for the loss functions and data augmentation:

Rotations in range ( $- 15^{\circ}$ , $15^{\circ}$ )
Orthogonal rotations of either $0^{\circ}$ , $90^{\circ}$ , $180^{\circ}$ , or $270^{\circ}$ ,
Random flips over any axis

In two experiments we require simulated deformations during synthetic data set generation. For that we used concatenation of a rotation and a translation together with an elastic deformation component generated by exponentiation of a stationary velocity field defined by parameters $μ, σ, m \in R^{n}$ using the formula

m_{i} e^{- \frac{1}{2} \frac{| | (x - μ) | |^{2}}{σ_{i}^{2}}},

(15)

where $x$ is the spatial coordinate and $i \in 1, \dots, n$ is the dimension ( $n = 2$ or $3$ ).

5.1 Synthetic

	Rigid		Elastic
	Translation	Rotation	$μ$	$σ$	$m$
LR	U( $- 15$ , $15$ )	U( $- 15^{\circ}$ , $15^{\circ}$ )	U( $0$ , $400$ )	U( $40$ , $120$ )	U( $- 20$ , $20$ )
SR	U( $- 1.5$ , $1.5$ )	U( $- {1.5}^{\circ}$ , ${1.5}^{\circ}$ )	U( $0$ , $400$ )	U( $40$ , $120$ )	U( $- 2.0$ , $2.0$ )
LC	( $10$ , $- 10$ )	$10^{\circ}$	( $120$ , $280$ )	( $60$ , $80$ )	( $20$ , $- 20$ )
SC	( $1$ , $- 1$ )	$- 1^{\circ}$	( $120$ , $280$ )	( $60$ , $80$ )	( $2$ , $- 2$ )
$U$ refers to the uniform distribution independent for each dimension.
All the values except the rotations are in pixel coordinates.

Table 1: Deformation parameters for synthetic data sets

Synthetic "multimodal" data sets were created using images from COCO data set (Lin et al., 2014) with unmodified images as input images. Label images were generated by circularly swapping the RGB color channels of the input images and by deforming them with simulated deformations. All images were centrally cropped to resolution $(400, 400)$ . Four data set were generated: LR (Large Random), SR (Small Random), LC (Large Constant), and SC (Small Constant). Used deformation parameters are displayed in Table 1. Training, validation, and test sets all contained $4113$ images.

With this data set, all the experiments were conducted without the adversarial loss to study separately the effects of deformation equivariance encouraging losses. The six models trained using each of the data sets are listed in Table 2.

No model was trained with aligned data as it would be easily learned perfectly with data set this easy.

5.2 Semi-Synthetic Cross-Modality Brain MRI Synthesis

In recent years, a significant amount of research has emerged on applying deep learning to cross-modality brain MRI synthesis. Synthetically generated modalities have many possible down-stream use cases such as segmentation, classification, detection and diagnosis. (Xie et al., 2022)

We used brain images from Information eXtraction from Images (IXI) data set ¹¹1http://brain-development.org/ixi-data set/ to generate a semi-synthetic 3D data set for T2 to PD (proton density) synthesis, like in Wang et al. (2021a). The main reason for choosing this task was that brain T2 and PD images are initially well aligned and hence provide good ground truths. Most of the images were already in resolution $(1.25 mm, 0.9375 mm, 0.9375 mm)$ , the rest were also resampled to the same resolution. All the images were normalized based on brain white-matter using the implementation by Reinhold et al. (2019). We used $192$ images for training, $19$ for validation, and $365$ for testing.

Simulated deformations were added to the PD images. Translations were sampled from range ( $3.0 mm$ , $10.0 mm$ ), rotations from range ( ${1.0}^{\circ}$ , ${4.0}^{\circ}$ ), $σ$ from range ( $24.0 mm$ , $72.0 mm$ ), and $m$ from range ( $- 15.0 mm$ , $15.0 mm$ ) with values sampled independently for each dimension or rotation axis. The elastic component mean parameter $μ$ was sampled uniformly from image coordinates. The distribution is intentionally skewed to make the non-desired outcome of over-learning the deformation already in $F$ more attractive.

The six models trained with this data set are listed in Table 3. Note that the model NoReg + DefCondAdv + Aug is an oracle as it is trained with the aligned data set. It’s performance should provide a good upper boundary on the performance of the other models.

All of the models with this data set were trained by sampling random image patches of size $(64, 64, 64)$ from the whole training data set. Additionally the inputs were augmented with low-amplitude noise.

Figure 4: Example images from the semi-synthetic cross-modality MRI synthesis data set. Only one sagittal slide of the 3D volumes is visualized.

5.3 Virtual Histopathology Staining

Virtual histopathology staining using deep learning has emerged as an active research topic in recent years, and has been primarily driven by GAN-based methods (Bayramoglu et al., 2017; Rana et al., 2020; Rivenson et al., 2019). However, a majority of the methods require elastic registration of inputs and labels. Our method simplifies training of the network by eliminating the need to elastically register image pairs.

We used a non-public data set containing unstained and stained tissue whole slides image (WSI) pairs. These are essentially ultra high resolution gigapixel images, and virtually staining the unstained tissue WSIs is a highly non-trivial task. Pre-clinical murine prostate tissue samples were prepared at the University of Eastern Finland, Kuopio. Material used was surplus tissue from previous studies (Latonen et al., 2017; Valkonen et al., 2017) where all animal experimentation and care procedures were carried out in accordance with guidelines and regulations of the national Animal Experiment Board of Finland, and were approved by the board of laboratory animal work of the State Provincial Offices of South Finland (licence number ESAVI/6271/04.10.03/2011). The tissue samples were first scanned without staining. This was followed by hematoxylin and eosin (H&E) staining of the unstained tissue samples, and then the stained samples were scanned again. The samples were scanned using Thunder Imager 3D Tissue slide scanner (Leica Microsystems, Wetzlar, Germany) equipped with DMC2900 camera at 40X magnification level with a pixel size of 0.353µm. Total of 17 WSI pairs were included in the data set each with resolution of approximately $40 k \times 40 k$ from which $9$ were used for training, $1$ for validation, and $7$ for testing.

Inputs and labels were coarsely registered and the alignment seems superficially good. However, upon a closer inspection clear misalignments are present.

The six models trained with this data set are listed in Table 4. No oracle model was trained as we did not have ground truth registrations for this data set. For evaluation, however, we did additionally register the images using an open source cross-modality whole slide image registration tool called wsireg²²2https://github.com/NHPatterson/wsireg. The WSI pairs were registered in two steps, first rigidly for global alignment and then elastically for more granular correspondence between the modalities.

The models were trained by sampling random image patches of size $512 \times 512$ from the whole training data set. Additionally inputs were augmented with low-amplitude noise.

5.4 Evaluation Metrics

For image synthesis evaluation, we used three metrics: peak-signal-to-noise-ratio (PSNR), structural similarity index (SSIM) (Wang et al., 2004), and normalized mutual information (NMI) (Studholme et al., 1999). NMI was applied between inputs and predictions as opposed to inputs and aligned labels as it is used here for measuring the geometric similarity of the predictions to the corresponding inputs.

Additionally we evaluated the accuracy of the overall predicted deformation with a metric which we label mean deformation error (MDE). Overall deformation refers to the concatenation of deformations from both the cross-modality and the intra-modality registration components. MDE is then the average length of the difference between a predicted deformation and the corresponding label deformation. It is noteworthy that for areas with roughly the same pixel values it is not even possible to predict the correct deformation and deformation on those areas is determined by the regularization used.

Detailed description of the metrics is given in the suplementary materials.

6 Results and Discussion

6.1 Synthetic

Data set	Model	PSNR	SSIM	NMI	MDE
LR	EqSim + Com	36.45	0.9842	1.159	0.2348
	DefaultSim + Com	33.95	0.9757	1.155	0.2803
	EqSim	34.24	0.9781	1.154	0.3424
	DefSim	-2.705	1.152e-03	1.042	22.75
	DefSim + Aug	-3.138	6.303e-04	1.049	22.88
	NoReg + Aug	17.39	0.4550	1.079	–
SR	EqSim + Com	41.68	0.9954	1.165	0.07272
	DefSim + Com	33.72	0.9737	1.153	0.1198
	EqSim	35.39	0.9817	1.155	0.0631
	DefSim	4.369	3.000e-03	1.022	31.29
	DefSim + Aug	31.51	0.9559	1.147	0.1256
	NoReg + Aug	24.44	0.7607	1.123	–
LC	EqSim + Com	38.78	0.9905	1.162	0.1385
	DefSim + Com	34.00	0.9745	1.153	0.1422
	EqSim	35.18	0.9819	1.155	0.121
	DefSim	32.74	0.9640	1.149	0.1605
	DefSim + Aug	-1.537	1.526e-03	1.039	24.77
	NoReg + Aug	16.92	0.4617	1.073	–
SC	EqSim + Com	40.89	0.9919	1.166	0.07649
	DefSim + Com	34.05	0.9754	1.154	0.09809
	EqSim	35.89	0.9840	1.157	0.06197
	DefSim	15.93	0.38.00	1.070	7.133
	DefSim + Aug	30.93	0.9501	1.145	0.1815
	NoReg + Aug	22.54	0.6696	1.114	–

Table 2: Results for synthetic data set experiments

Figure 5: Example failure mode when training without deformation equivariance encouraging losses. The prediction is shifted towards top-left direction and also has a non-desired pattern both of which are compensated by the registration networks. Images are from the synthetic experiment with data set SR and model DefSim.

Figure 6: Example output from the synthetic experiment with data set LR and model EqSim + Com. The image is from the test set. In addition to the synthesized image, the deformation is accurately reproduced by the cross-modality registration network.

Results for the synthetic data set experiment can be seen in Table 2.

The models using the deformation equivariance encouraging losses systematically outperformed the models not using them. Four out of eight trainings with the registration component but without the deformation equivariance encouraging losses did not converge at all to a meaningful optimum. An example prediction of a such training is shown in Figure 5. The performance of the models without the deformation equivariance losses varies a lot and in few cases the performance is even quite good. However, when using either of the deformation equivariance losses the trainings always converged robustly to a meaningful optimum.

Models using both the equivariance similarity loss and the commutation loss performed the best in terms of the similarity metrics. However, the models having only either the equivariance similarity loss or the commutation loss also performed very well and it is questionable whether the differences in performance when using this kind of synthetic data set will be relevant in real world applications.

Registration accuracy is well below one pixel for all the models with deformation equivariance losses which is an excellent performance.

6.2 Cross-Modality Brain MRI synthesis

Model	PSNR	SSIM	NMI	MDE
EqSim + Com + Adv	36.73	0.9557	1.183	0.6975
DefSim + Com + Adv	36.99	0.9573	1.181	0.6671
EqSim + Adv	36.5	0.953	1.172	0.6971
DefSim + DefUncondAdv	22.82	0.3609	1.038	8.818
DefSim + DefUncondAdv + Aug	30.19	0.8221	1.111	1.556
\hdashlineNoReg + DefCondAdv + Aug $^{a}$	36.86	0.9606	1.193	–
$^{a}$ The model was trained using aligned data and hence is not comparable with the other models.

Table 3: Results for cross-modality brain MRI synthesis experiment

Figure 7: Predictions from the cross-modality brain MRI synthesis experiment. Only one sagittal slide of the 3D volumes is visualized. Note that the model NoReg + DefCondAdv + Aug was trained using aligned data unlike the other models.

Results for cross-modality brain MRI synthesis experiment can be seen in Table 3.

All the three proposed variants perform very well compared to the oracle model trained with aligned data. Perhaps surprisingly on PSNR metric the model DefSim + Com + EqAdv even outperforms the oracle. The result is similar to the one obtained by Kong et al. (2021) where also the model involving registration trained with non-aligned data outperformed the oracle pix2pix model trained with aligned data. They theorize it might be due to the fact that even their aligned data set is not really aligned. However, those misalignments should then be present also in the evaluation data so it is questionable whether that explains the improvement. In our case the most likely explanation is that the predictions by the oracle model seem to be slightly sharper based on visual inspection. The loss of sharpness might actually benefit the PSNR metric. Note that all the differences are small, both metric-wise and visually. Example predictions can be seen in Figure 7.

The two configurations DefSim + DefUncondAdv, and DefSim + DefUncondAdv + Aug using the losses similar to Kong et al. (2021) perform significantly worse than our proposed models. Without augmentation the desired distributions of $F (x^{(i)})$ and ${~ y}^{(i)}$ are very different and, as assumed, the training did not converge to anything meaningful. Together with augmentation the distributions overlap significantly and the results are clearly better, although still nowhere close to acceptable. With more difficult data, e.g, if there is a systematic posture difference between inputs and labels, designing augmentations which would work even this well could be very difficult.

Deformation accuracy is very good for all the three proposed models. Average displacement in the training data is over $10$ voxels with maximum displacement being over $32$ voxels which is a lot considering the patches with side length of $64$ voxels. On the other hand, the small patch size has the benefit that the elastic component within each patch is quite small.

6.3 Virtual Histopathology Staining

Figure 8: Virtually stained histopathological images with different models. An area of epithelial cells (pink) with nuclei (blue) is shown.

Model	PSNR	SSIM	NMI
EqSim + Com + EqAdv	21.73	0.6827	1.0201
DefSim + Com + EqAdv	21.31	0.6750	1.0180
EqSim + EqAdv	21.33	0.6630	1.0175
DefSim + DefUncondAdv	20.45	0.6118	1.0169
DefSim + DefUncondAdv + Aug	21.49	0.6662	1.0169
NoReg + DefCondAdv + Aug	19.81	0.6264	1.0170
All the models were trained with the same coarsely registered data set and were evaluated against the more finely registered data.

Table 4: Results for virtual histopathology staining experiment

Results for virtual histopathology staining experiment can be seen in Table 4.

Virtual histopathology staining data set was particularly difficult one from the viewpoint of image synthesis. The data is not very uniform and the shapes and locations of cell nuclei are in some cases impossible to predict perfectly as the input images simply do not contain information about them. However, the deformations between the inputs and the labels are quite simple.

The model with both of the deformation equivariance encouraging losses performed the best on all the metrics. However, the model DefSim + DefUncondAdv + Aug using the losses similar to Kong et al. (2021) did not perform much worse, and even outperformed the two other models proposed by us on the PSNR metric. With this data set the distributions of the desired predictions $F (x^{(i)})$ and the training labels ${~ y}^{(i)}$ are very close to each other, i.e. there are no systematic geometrical differences. In such settings the method by Kong et al. can be expected to perform relatively well. Note also that the metrics might miss subtle improvements in geometric accuracy since no perfectly registered ground truth labels are available, and the fact that our proposed methods perform systematically better in terms of the input image based NMI metric actually suggests that to be the case. The non-augmented training DefSim + DefUncondAdv looks as good superficially as the augmented one but the predictions are clearly shifted, resulting in significantly worse metrics. Without augmentation, the distribution of translations between the inputs and the labels is not symmetric around the origin, resulting in a worse outcome.

Visual inspection reveals an interesting trend about the effects of the different equivariance encouraging losses. From our three methods the metric-wise worst performing model EqSim + EqAdv seems to be more inclided to guess nucleai whereas the two other models with the commutation loss seem to place them in more certain locations. As a result the model with only the equivariance similarity loss looks most realistic to the eye as it’s nuclei density is closer the ground truth. The effect is visible in Figure 8. We suspect this is due to the commutation loss more directly enforcing deformation equivariance which will require the shape of the nuclei to be known. The equivariance similarity loss does not have the same effect allowing the adversarial training to dominate more. This is also hinted by the fact that during the training model EqSim + EqAdv had significantly lower adversarial loss for the generator than the other two models.

7 Conclusions

In this work, we have developed a generic method for training a network for cross-modality image synthesis with paired but misaligned training data by enforcing equivariance to simulated deformations. The method is applicable to a wider set of data sets than earlier methods. We have demonstrated that the method outperforms baseline methods in terms of robustness and prediction quality.

At the same time, the work can be seen as an unsupervised method for cross-modality registration. Accuracy of the generated registrations is good on the data sets with synthetically generated deformations and could probably be improved. Accuracy with real world data sets in different application areas remains a subject for further study.

Acknowledgments

This work was supported by Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence [grant no. 345552] and grants no. 315896, 336033, 341967, 335976), ERA PerMed ABCAP (grants no. 334774, 334782), EU Horizon 2020 INTERVENE (grant no. 101016775). We also acknowledge the computational resources provided by the Aalto Science-IT Project from Computer Science IT.

References

M. Arar, Y. Ginger, D. Danon, A. H. Bermano, and D. Cohen-Or (2020) Unsupervised multi-modal image registration via geometry preserving image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13410–13419. Cited by: §1, §3.2, §4.
V. Arsigny, O. Commowick, X. Pennec, and N. Ayache (2006) A log-euclidean framework for statistics on diffeomorphisms. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 924–931. Cited by: §3.2, §4.3.
J. Ashburner (2007) A fast diffeomorphic image registration algorithm. Neuroimage 38 (1), pp. 95–113. Cited by: §3.2.
N. Bayramoglu, M. Kaakinen, L. Eklund, and J. Heikkila (2017) Towards virtual h&e staining of hyperspectral lung histology images using conditional generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 64–71. Cited by: §1, §5.3.
J. Borovec, J. Kybic, I. Arganda-Carreras, D. V. Sorokin, G. Bueno, A. V. Khvostikov, S. Bakas, I. Eric, C. Chang, S. Heldmann, et al. (2020) ANHIR: automatic non-rigid histological image registration challenge. IEEE transactions on medical imaging 39 (10), pp. 3042–3052. Cited by: §1.
M. Boulanger, J. Nunes, H. Chourak, A. Largent, S. Tahri, O. Acosta, R. De Crevoisier, C. Lafond, and A. Barateau (2021) Deep learning methods to generate synthetic ct from mri in radiotherapy: a literature review. Physica Medica 89, pp. 265–281. Cited by: §1.
L. Chen, X. Liang, C. Shen, S. Jiang, and J. Wang (2020) Synthetic ct generation from cbct images via deep learning. Medical physics 47 (3), pp. 1115–1125. Cited by: §3.1.
Z. Chen, J. Wei, and R. Li (2022) Unsupervised multi-modal medical image registration via discriminator-free image-to-image translation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt (Ed.), pp. 834–840. Note: Main Track External Links: Document, Link Cited by: §1, §3.2.
A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu (2018) Unsupervised learning for fast probabilistic diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 729–738. Cited by: §3.2, §4.3.
T. de Bel, J. Bokhorst, J. van der Laak, and G. Litjens (2021) Residual cyclegan for robust domain transformation of histopathological tissue slides. Medical Image Analysis 70, pp. 102004. Cited by: §1.
T. de Bel, M. Hermsen, J. Kers, J. van der Laak, and G. Litjens (2019) Stain-transforming cycle-consistent generative adversarial networks for improved segmentation of renal histopathology. In Proceedings of The 2nd International Conference on Medical Imaging with Deep Learning, M. J. Cardoso, A. Feragen, B. Glocker, E. Konukoglu, I. Oguz, G. Unal, and T. Vercauteren (Eds.), Proceedings of Machine Learning Research, Vol. 102, pp. 151–163. External Links: Link Cited by: §1.
B. D. De Vos, F. F. Berendsen, M. A. Viergever, H. Sokooti, M. Staring, and I. Išgum (2019) A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, pp. 128–143. Cited by: §3.2.
Y. Fu, Y. Lei, T. Wang, W. J. Curran, T. Liu, and X. Yang (2020) Deep learning in medical image registration: a review. Physics in Medicine & Biology 65 (20), pp. 20TR01. Cited by: §1, §3.2.
A. Hering, L. Hansen, T. C. Mok, A. Chung, H. Siebert, S. Häger, A. Lange, S. Kuckertz, S. Heldmann, W. Shao, et al. (2021) Learn2Reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning. arXiv preprint arXiv:2112.04489. Cited by: §1.
Y. Hiasa, Y. Otake, M. Takao, T. Matsuoka, K. Takashima, A. Carass, J. L. Prince, N. Sugano, and Y. Sato (2018) Cross-modality image synthesis from unpaired data using cyclegan. In International workshop on simulation and synthesis in medical imaging, pp. 31–41. Cited by: §3.1.
P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §3.1, §4.2, 5th item, 6th item.
T. Joyce, A. Chartsias, and S. A. Tsaftaris (2017) Robust multi-modal mr image synthesis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 347–355. Cited by: §1, §3.1.
S. Kaji and S. Kida (2019) Overview of image-to-image translation by use of deep neural networks: denoising, super-resolution, modality conversion, and reconstruction in medical imaging. Radiological physics and technology 12 (3), pp. 235–248. Cited by: §3.1.
S. Kazemifar, S. McGuire, R. Timmerman, Z. Wardak, D. Nguyen, Y. Park, S. Jiang, and A. Owrangi (2019) MRI-only brain radiotherapy: assessing the dosimetric accuracy of synthetic ct images generated using a deep learning approach. Radiotherapy and Oncology 136, pp. 56–63. Cited by: §3.1.
S. Kida, S. Kaji, K. Nawa, T. Imae, T. Nakamoto, S. Ozaki, T. Ohta, Y. Nozawa, and K. Nakagawa (2019) Cone-beam ct to planning ct synthesis using generative adversarial networks. arXiv preprint arXiv:1901.05773. Cited by: §3.1.
L. Kong, C. Lian, D. Huang, Z. Li, Y. Hu, and Q. Zhou (2021) Breaking the dilemma of medical image-to-image translation. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §1, §2, §3.1, §4.2, §4, §5, §6.2, §6.2, §6.3.
L. Latonen, M. Scaravilli, A. Gillen, S. Hartikainen, F. Zhang, P. Ruusuvuori, P. Kujala, M. Poutanen, and T. Visakorpi (2017) In vivo expression of mir-32 induces proliferation in prostate epithelium. The American journal of pathology 187 (11), pp. 2546–2557. Cited by: §5.3.
A. P. Leynes, J. Yang, F. Wiesinger, S. S. Kaushik, D. D. Shanbhag, Y. Seo, T. A. Hope, and P. E. Larson (2018) Zero-echo-time and dixon deep pseudo-ct (zedd ct): direct generation of pseudo-ct images for pelvic pet/mri attenuation correction using deep convolutional neural networks with multiparametric mri. Journal of Nuclear Medicine 59 (5), pp. 852–858. Cited by: §3.1.
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Figure 1, Figure 2, Figure 3, §5.1.
S. Liu, B. Zhang, Y. Liu, A. Han, H. Shi, T. Guan, and Y. He (2021) Unpaired stain transfer using pathology-consistent constrained generative adversarial networks. IEEE Transactions on Medical Imaging 40 (8), pp. 1977–1989. Cited by: §1.
J. Lu, J. Öfverstedt, J. Lindblad, and N. Sladoje (2021) Is image-to-image translation the panacea for multimodal image registration? a comparative study. arXiv preprint arXiv:2103.16262. Cited by: §1.
N. Pielawski, E. Wetzer, J. Öfverstedt, J. Lu, C. Wählby, J. Lindblad, and N. Sladoje (2020) CoMIR: contrastive multimodal image representation for registration. Advances in neural information processing systems 33, pp. 18433–18444. Cited by: §3.2, §4.
A. Rana, A. Lowe, M. Lithgow, K. Horback, T. Janovitz, A. Da Silva, H. Tsai, V. Shanmugam, A. Bayat, and P. Shah (2020) Use of deep learning to develop and analyze computational hematoxylin and eosin staining of prostate core biopsy images for tumor diagnosis. JAMA network open 3 (5), pp. e205111–e205111. Cited by: §1, §5.3.
J. C. Reinhold, B. E. Dewey, A. Carass, and J. L. Prince (2019) Evaluating the impact of intensity normalization on MR image synthesis. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 109493H. Cited by: §5.2.
Y. Rivenson, T. Liu, Z. Wei, Y. Zhang, K. de Haan, and A. Ozcan (2019) PhaseStain: the digital staining of label-free quantitative phase microscopy images using deep learning. Light: Science & Applications 8 (1), pp. 1–11. Cited by: §1, §5.3.
M. F. Spadea, M. Maspero, P. Zaffino, and J. Seco (2021) Deep learning based synthetic-ct generation in radiotherapy and pet: a review. Medical physics 48 (11), pp. 6537–6566. Cited by: §1.
M. Staring, S. Klein, and J. P. Pluim (2007) A rigidity penalty term for nonrigid registration. Medical physics 34 (11), pp. 4098–4108. Cited by: §4.3.1.
C. Studholme, D. L. Hill, and D. J. Hawkes (1999) An overlap invariant entropy measure of 3d medical image alignment. Pattern recognition 32 (1), pp. 71–86. Cited by: §5.4.
M. Valkonen, J. Isola, O. Ylinen, V. Muhonen, A. Saxlin, T. Tolonen, M. Nykter, and P. Ruusuvuori (2019) Cytokeratin-supervised deep learning for automatic recognition of epithelial cells in breast cancers stained for er, pr, and ki-67. IEEE transactions on medical imaging 39 (2), pp. 534–542. Cited by: §1.
M. Valkonen, P. Ruusuvuori, K. Kartasalo, M. Nykter, T. Visakorpi, and L. Latonen (2017) Analysis of spatial heterogeneity in normal epithelium and preneoplastic alterations in mouse prostate tumor models. Scientific reports 7 (1), pp. 1–10. Cited by: §5.3.
C. Wang, G. Macnaught, G. Papanastasiou, T. MacGillivray, and D. Newby (2018) Unsupervised learning for cross-domain medical image synthesis using deformation invariant cycle consistency networks. In International Workshop on Simulation and Synthesis in Medical Imaging, pp. 52–60. Cited by: §1, §3.1.
C. Wang, G. Papanastasiou, S. Tsaftaris, G. Yang, C. Gray, D. Newby, G. Macnaught, and T. MacGillivray (2019) TPSDicyc: improved deformation invariant cross-domain medical image synthesis. In International Workshop on Machine Learning for Medical Image Reconstruction, pp. 245–254. Cited by: §1, §3.1.
C. Wang, G. Yang, G. Papanastasiou, S. A. Tsaftaris, D. E. Newby, C. Gray, G. Macnaught, and T. J. MacGillivray (2021a) DiCyc: gan-based deformation invariant cross-domain information fusion for medical image synthesis. Information Fusion 67, pp. 147–160. Cited by: §1, §3.1, §5.2.
T. Wang, Y. Lei, Y. Fu, J. F. Wynne, W. J. Curran, T. Liu, and X. Yang (2021b) A review on medical imaging synthesis using deep learning and its clinical applications. Journal of applied clinical medical physics 22 (1), pp. 11–36. Cited by: §3.1, §3.1, §3.1.
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §5.4.
G. Xie, J. Wang, Y. Huang, Y. Zheng, F. Zheng, and Y. Jin (2022) A survey of cross-modality brain image synthesis. arXiv preprint arXiv:2202.06997. Cited by: §1, §5.2.
Z. Xu, C. F. Moro, B. Bozóky, and Q. Zhang (2019) GAN-based virtual re-staining: a promising solution for whole slide image analysis. arXiv preprint arXiv:1901.04059. Cited by: §1.
B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, and P. Bourgeat (2019) Ea-gans: edge-aware generative adversarial networks for cross-modality mr image synthesis. IEEE transactions on medical imaging 38 (7), pp. 1750–1762. Cited by: §3.1.
Z. Zhang, L. Yang, and Y. Zheng (2018) Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 9242–9251. Cited by: §3.1.
J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §3.1.