Thoughts and Theory
Maxim Ziatdinov¹ ² & Sergei V. Kalinin¹
¹ Center for Nanophase Materials Sciences and ² Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, United States
Scientific research often yields enormous volumes of data. The examples range from fields such as astronomy with its exquisite optical and radio telescopes, satellite geospatial imagery, medical and biological imaging, neutron- and X-Ray scattering, and scanning probe and electron microscopies. Very often the imaging information is available at multiple bands or spectra at each spatial location, giving rise to complex multidimensional data sets. The typical example will be hyperspectral satellite imaging, where a full optical spectrum is measured at each spatial location. However, even more imaging modalities emerge in the context of domain science applications such as scanning probe or electron microscopy. In all these cases, the first step to the researcher is to visualize and subsequently interpret the imaging data in terms of objects of interest, were it new types of stars or exoplanets in astronomy, plant growth patterns or hidden installations in geospatial imaging, or electronic properties of a single atom in scanning tunneling microscopy.

The human eyes are extremely well suited to the analysis of the imaging data, the capability that emerged as the result of the millennia of evolution in the savannah where noticing the predator or prey in time was a condition for survival. At the same time, human perception and comprehension (with few exceptions) are generally not well suited for comprehension and object recognition of 3, 4, or higher-dimensional data. Hence, the natural question is whether machine learning can help in this matter.
The scientific field is abounding of the examples of visionaries whose foresight has significantly preempted the field. A well-known example is Ada Lovelace, who had explored the context of computer programming long before computers has become available. Another is Joe Licklider, one of the founders of DARPA, whose book "The Libraries of the Future" written in the 50ies have predicted much of the modern internet-based knowledge infrastructure including electronic books and journals, data bases, etc. In microscopy, one of the first visionary perspectives was given by Noel Bonnett in his publications, delineating applications of machine learning and multivariate statistics to hyperspectral data [1, 2]. The first practical applications of multivariate statistical methods in microscopy appeared several years later, when personal computers have become powerful enough to handle the "huge" 50x50x1000 hyperspectral data sets in electron [3] and scanning probe [4] microscopy that gave rise to acceptable images.
The ML method of choice (or availability) at that time was principal component analysis (PCA), a technique converting the hyperspectral data set R(x, y, E) in the linear combination of loading maps aᵢ(x, y) and components Aᵢ(E).In the most general sense, the components define the specific behaviors of the system and the loading maps contain the information on where this behavior manifests. It was often argued that PCA analysis does not assign physical meaning to the components (which is correct – they are defined only in the information-theory sense), and consequently, multiple versions of linear and non-linear dimensionality reduction methods allowing for specific physical constraints emerged over the last decade [5]. Currently, many of these methods are available as a part of basic Python libraries such as scikit-learn, and even more are available as stand-alone libraries. However, the key aspect of these methods as applied to hyperspectral data is that they explore common features in the spectral domain only, and the exchange of pixels in the spatial domains does not affect the components and only leads to the exchange of pixels in loading maps.
However, in many cases, the scientists analyzing imaging data are interested in specific shapes. In the ancient human world, we sought to detect the predator before the opposite happens. In astronomy, we aim to identify the shape of the galaxies or weak variations of the intensity of stars indicative of the presence of exoplanets. In scanning probe or electron microscopy, we aim to discover the ordering patterns signifying symmetry breaking and emergence of the novel electronic or structural orders. Sometimes, PCA and related methods allow such discoveries via examination of the loading maps. However, in many cases, the relevant spatially resolved information is spread over the tens of loading maps. Hence, the question is whether there are machine learning algorithms that can explore shapes?
One of the main difficulties here is that in images the shapes can a priory have any orientation. While modern deep convolutional neural networks (DCNNs) are equivariant, i.e. allow for the detection of a specific object anywhere within the image, they are typically limited when it comes to the discovery of multiple rotational classes of a single object. For example, in classical cases such as Image Net or supervised learning of cats in images, cats generally are oriented in the vertical direction. While the data sets with cats in any orientation can be made, that will likely cause serious questions on the ethical treatment of animals. Consequently, even supervised networks are not particularly good in recognizing a rotational version of the same object (see, for example, the famous duck vs. rabbit illusion). This is even a bigger problem for unsupervised learning when we do not know what specific objects we are trying to find.
Here comes the rotationally invariant variational autoencoders (that we have learnt to love over the last year since they have allowed us to resolve several physical problems that stymied us for a decade). Let’s explore this concept step by step. The autoencoder generally refers to a special class of neural networks where the original data set is compressed to a small number of continuous latent variables and is then expanded back to the original data set. The network is trained to improve (in a sense of a chosen loss function) the data reconstruction, which sounds like a somewhat limited objective (e.g. we could have just left it alone). The key trick here is that in the process the network learns how to optimally describe the data in terms of the latent variables. This allows the autoencoder to discover the optimal representation(s), while also rejecting the noise present in the data. Hence applications in denoising, image reconstruction, etc.
The variational autoencoder (VAE) builds upon this concept by making the reconstruction process probabilistic. In this case, the latent variables are drawn from certain (typically Gaussian) distribution, and the training seeks to optimize both the reconstruction loss and the Kullback-Leibler divergence between the distribution of the encoded images and latent variables corresponding to decoded images in the latent space. One of the most attractive features of VAEs is their capability to disentangle the representations of the data, i.e. discovering the specific trait such as handwriting styles in MNIST data, emotions in human faces, or complex manifolds defining the degrees of freedom in robotic systems. There are multiple excellent sources explaining VAEs both on Medium and on arXiv [6].
Here, we extend the VAE to the analysis of rotationally invariant features. In the rotationally-invariant VAE (rVAE), there is one "neuron" in the latent layer designated to absorb arbitrary rotations of structures in images whereas all other neurons are used for disentangling the remaining factors of variation. The trick here is to write the VAE’s decoder as a function of image (spatial) coordinates, which can be realized either via the spatial generator net [7](for a fully-connected decoder) or via a slightly modified version of the spatial broadcast generator [8](for a convolutional decoder). Here, we realized the rVAE via the Pyro probabilistic programming language. For completeness, we introduce and discuss both the simple rVAE and class-conditioned rVAE. More complex versions of (r)VAE including joint (r)VAE, semi-supervised (r)VAE, and (r)VAE augmented with normalizing flows allowing for the unknown and partially known classes will be discussed later.
Let’s play with the rVAE and its variants using the classical MNIST data set (note that instead of digits it could be different atomic or molecular structures). First, we create the rotated MNIST data set that looks like this:

We further proceed to use the rotated digits as features, and keep the labels and rotation angles as ground truth data to compare with the results of rVAE and class-conditioned rVAE analysis.
First, we explore the simple VAE. Here, we derive the latent space and distribution of points in the latent space colored by angle and digit. As a reminder, the convenient way to illustrate VAE operation is via the distribution of the (encoded) points in the latent space, and via the latent space representation projected to image space. In the former, all features in the data set are encoded and visualized in the latent plane. If some attributes of the data set are known (e.g. classes or rotation angles), they can be used to set the color scale to examine the trends in the latent space. Alternatively, the latent space can be sampled by a rectangular grid of points and the corresponding images can be decoded and plotted as a latent space representation. This analysis is particularly convenient when latent space is 2D and the features are 2D as well.

Shown above are the latent space representations for the rotated MNIST data for the simple VAE. Here, we can clearly see that latent space contains digits in all orientations. It may be fun to experiment with the number of reconstructed sub-images via the function _vae.manifold2d(d=some_positiveinteger) in the accompanying notebook. Particularly for very large d = 100 the individual digits cannot be recognized, but the latent space representation starts to exhibit patterns corresponding to the areas occupied by individual digits.
The examination of latent space shows that the angle changes along the first latent direction, whereas the digit changes along the second latent dimension forming well-separated clusters. The notable exception is the cluster corresponding to ‘9’ and’6′ which cannot be distinguished after 180 rotation (same for VAE, same for human). Overall, this clearly illustrates the disentanglement of the data representation concept, where the rotational angle and class emerge as the two most prominent factors of variation within the data.
Now, we repeat the same analysis with the conditional VAE (cVAE). In this case, our decoded object during the training and prediction is the concatenated image and the class, and hence we have a separate latent space for each of the classes.

Shown above are the latent spaces for digits ‘0’, ‘1’, and ‘2’. Note a very intriguing pattern – in this case, the "unphysical shapes" form a cluster (e.g., around the central region in the image with "1"-s) surrounded by "physically realizable" digits. This behavior is linked to a rather fundamental aspect of physics, namely the topological structure of the latent space and data space. Some of the excellent sources on this are Ref. [9, 10]. The encoded data is also shown in the figure above. Here, the latent angle forms a well-defined circle (going from negative to positive rotation angles in the counter-clockwise direction), whereas the digits classes are now distributed as a single blob (as expected, given that each forms its own latent space). The take-home message is that now both disentangled factors are the angle variation, and that VAE and cVAE do not deal with angles well.
Now let’s add rotation invariance, as incorporated in the rVAE. The latent space representations are shown below.

In this case, the digits in the latent space are oriented in one direction. The latent space shows that the angles are now random, whereas the digits form well-defined clusters (rather remarkable given that we encode the data set using only two latent variables). We also get the angle as one of the latent variables, and the latent angle and ground truth angle are compared in the figure above. Note that they are linearly related, but at the same time latent angles have broad distribution. This is unsurprising since an angle is one of the characteristics of handwriting style and varies from person to person! The latent representation we discovered hence compensated by this fact via introducing it as an additional augmentation variable, and then separating it from other factors of variation.
Finally, we illustrate the class-conditioned rVAE (crVAE) applied to this data set:

In this case, our latent reconstructions clearly show that within each latent space, the digits are oriented in the same direction, and latent variables now encode very subtle details of the handwriting styles. Have a look at the ‘0’, ‘1’, and ‘2’ above, and play with the other digits in the accompanying notebook.
This summarizes the introduction of the rVAE and crVAE. Feel free to play with the notebook and apply it to your data sets. The authors use VAE and its extensions in their research on atomically resolved and mesoscopic imaging in scanning probe and electron microscopies, but these methods can be applied to a much broader variety of optical, chemical, and other imaging, as well as across other computer science domains. Please also check out our AtomAI software package for applying this and other deep/machine learning tools to scientific imaging.
Finally, in the scientific world, we acknowledge the sponsor that funded this research. This effort was performed and supported at Oak Ridge National Laboratory’s Center for Nanophase Materials Sciences (CNMS), a U.S. Department of Energy, Office of Science User Facility. You can take a virtual walk through it using this link and tell us if you want to know more.
The executable Google Colab notebook is available via GitHub.
- Bonnet, N., Multivariate statistical methods for the analysis of microscope image series: applications in materials science. J. Microsc.-Oxf. 1998, 190, 2–18.
- Bonnet, N., Artificial intelligence and pattern recognition techniques in microscope image processing and analysis. In Advances in Imaging and Electron Physics, Vol 114, Hawkes, P. W., Ed. Elsevier Academic Press Inc: San Diego, 2000; Vol. 114, pp 1–77.
- Bosman, M.; Watanabe, M.; Alexander, D. T. L.; Keast, V. J., Mapping chemical and bonding information using multivariate analysis of electron energy-loss spectrum images. Ultramicroscopy 2006, 106 (11–12), 1024–1032.
- Jesse, S.; Kalinin, S. V., Principal component and spatial correlation analysis of spectroscopic-imaging data in scanning probe microscopy. Nanotechnology 2009, 20 (8), 085714.
- Kannan, R.; Ievlev, A. V.; Laanait, N.; Ziatdinov, M. A.; Vasudevan, R. K.; Jesse, S.; Kalinin, S. V., Deep data analysis via physically constrained linear unmixing: universal framework, domain examples, and a community-wide platform. Adv. Struct. Chem. Imag. 2018, 4, 20.
- Kingma, D. P.; Welling, M., An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 2019.
- Bepler, T.; Zhong, E.; Kelley, K.; Brignole, E.; Berger, B., Explicitly disentangling image content from translation and rotation with spatial-VAE. Advances in Neural Information Processing Systems 2019, 15409–15419.
- Watters, N.; Matthey, L.; Burgess, C. P.; Lerchner, A., Spatial broadcast decoder: A simple architecture for learning disentangled representations in VAEs. arXiv preprint arXiv:1901.07017 2019.
- Batson, J.; Haaf, C. G.; Kahn, Y.; Roberts, D. A., Topological Obstructions to Autoencoding. arXiv preprint arXiv:2102.08380 2021.
- Falorsi, L.; de Haan, P.; Davidson, T. R.; De Cao, N.; Weiler, M.; Forré, P.; Cohen, T. S., Explorations in homeomorphic variational auto-encoding. arXiv preprint arXiv:1807.04689 2018.





