Classification of Clusters

in MD simulations of

Collision Cascades

Utkarsh Bhardwaj^a, Andrea E. Sand^b and Manoj Warrier^a

^{| a} Bhabha Atomic Research Centre, Vizag, India; ^{| b} University of Helsinki, Finland

You can start the audio slideshow from the player controls at the bottom or read the notes along with slides by pressing the 's' key.

Hello, Good Evening, everyone. I'm going to talk about classification of clusters in Collision Cascades using application of unsupervised machine learning on the new feature descriptors developed specifically to characterize the cluster morphologies. We got interested in this problem when the IAEA Materials for fusion 2018 competition was announced with a database of 76 Fe and W cascades at PKA energies ranging from 10keV to 200keV. Although we could not finish the whole problem in given time constraint but after the competition was over we continued working on it and here I'm quite excited to share our results on the same database that has over a thousand clusters. So, Let's go through some basic background quickly...

Who am I?

I am a researcher at the Bhabha Atomic Research Centre, Vizag, India

My masters was in Computer Science and Engg. with specialisation in Nuclear Engg.

I currently work on applying computational and ML techniques to computational physics problems

You can check my other github repositories to know more about my work which ranges from distributed computing to computer science educational tools:

EasyLambda: Distributed data processing with functional list operations & MPI
Eka: A new old take on learning and teaching programming
Pawn: Command line scripting for distributed data processing

But first, a bit of my background, about me: I'm a researcher at Bhabha Atomic Research Centre. I've Masters in Computer Science and Engg. with specialisation in Nuclear Engg. I work with Manoj Warrier who guides my computational ideas to apply on computational physics problems which I find quite intriguing. He probably knows lot more physics than me. Other than computational physics, some of my other projects range from distributed computing to computer vision, graphics and interfaces to using machine learning to stylistically grade computer codes. You can check my github repositories to find some of these.

A Collision Cascade

Clusters

IN CASCADES

Here we have the primary damage from four cascades at different energies in Fe and W. You can interact with each one to get a closer view at the myriad of
            shapes and sizes present. The first at the top left is Iron cascade at
            100keV, it is a big cascade with many interesting morphologies there are ring sort of shapes, multiple dumbbells arranged in many fashion and single dumbbells.  Dumbbells although are counted as one interstitial but when we consider the shape the other interstitial-vacancy pair needs to be accounted for. It is same with many other shapes. Next, we have Iron cascades at relatively lower energies 10keV and 5keV.Still we see some very peculier shapes. Tungsten at 50keV also shows some distinct clusters with many big crowdion like chains oriented in parallel fashion. Feel free to explore the interactive plots and get curious about different possibilities of morphologies, their properties and relationships, do certain shapes appear more in certain energies or elements, how many kind of shapes can be there and many more questions that we might have answers for by the end of the presentation or atleast an approach to explore the answers.

Cluster Shapes Matter

Decide diffusion (sessile / glissile)
Capture and Recombination properties
Thermal stability

The higher scale models can use the distribution of different cluster classes along with their properties as inputs.

In terms of the development of models to describe the evolution of radiation damage and its role in irradiation-induced changes in material properties, the important parameters are not only the total number of Frenkel defects per cascade but also the distribution of their population in clusters and the form and mobility of these clusters.
D.J. Bacon, F. Gao, Y.N. Osetsky, J. Nucl. Mater., 276 (1–3) (2000), pp. 1-12 ---

Apart from the interest to know more about these shapes, there are practical reasons to carry out a systematic study of cluster shapes that begins with classification. - Define diffusion profile, wheather a cluster will move, glissile or will not move, sessile. And if it will move, will it be 1-D movement or 3D random walk. These affect how a cluster will interact with other defects and graid boundaries. thermal stability, capture and recombination properties, dislocation properties. - higher scal models - The glissile clusters can move and interact with other defects and grain boundaries whereas the sessile clusters can be nucleation centers for defect-growth.

Motivation

Now that we can have big databases of collision cascades can we use the data itself to systematically find different classes of cluster defects and get insights into each of the classes. Can we do the following:

If we look at an interesting cluster in a cascade, can we ask which other cascades have similar clusters and how do they look.
and can this query be fast for a big database of collision cascades & clusters.
Can we derive from data what all shapes are possible for different elements and energies
and how the different classes of shapes relate to each other.

Methods Overview

We will quickly go through the ideas behind the methods used
                      at different steps before discussing the results. Here is the overview of different steps.
                      We start by identifying defects given xyz coordinates of all the atoms from MD simulations and nothing else. Unlike sphere based methods our algorithm requires no ambiguous threshold. We not only find defects that count but also defects that form the shape of a cluster and inherently mark them as extra defects which would be missing in a classical Wigner-Sietz approach. By extra defects we mean the extra interstitial-vacancy pairs required to define the shape such as one pair in a dumbbell shape is needed in addition to the one interstitial that we count to define the dumbbell shape. The algorithm is fast with O(n) time complexity without using any complicated data structure like kd-tree etc.

Next, we group the defects into clusters and then find the feature descriptors to
characterize their shape.

A cluster is characterized by the normalized histogram of angles between the
neighbouring defect triads and pair-wise distances. To find the similarity
between two clusters it is sufficient to find the distance between the two
feature vectors that represent them. The feature designed is simple, noise and
deformation invariant, computationally efficient but power- ful way to
characterize a shape.

We further use topological network graph based dimensionality reduction
techniques on the feature vectors. The network graph based dimensionality
reduction techniques are well established ways to find a representation in the
reduced dimensional space such that the distance between the distinct points is
maximized [19, 20] and similar points are represented closely. Now, we can
lay all the clusters on a 2D plane capturing the relationships amongst various
sort of morphologies.

We then use an unsupervised clustering algorithm to classify the clusters
without any inputs and assumptions made. We find nine families of around 25
classes. We next 
explore the properties of these classes such as dimen- sionality, sizes and distribution of cluster shapes among elements and energy ranges. However, once we have this
classification we can systematically talk about the primary damage and possibly use it in higher scale models in future. It seems it is just the beginning.

- Since, we don't apriori know alot about the kind of
                        classification we want, we would go with an exploratory
                        data analysis approach and get the information out of
                        the data without labels and other assumptions and
                        inputs about the data.
                        - We will keep our focus more on discussing the latter two steps.

Defect Identification

Motivation / Goals

Find and mark psuedo defects
Only final coordinates as inputs: no assumptions or ambiguous inputs
Space efficient: Does not need to have whole initial lattice in the memory
Fast: Can be implemented as O(N), N being number of atoms
Simple implementation: no specialized datastructures like kd-trees used

http://arxiv.org/abs/1811.10923

Related alogrithms: Sphere threshold based methods and Wigner-Seitz.

Algorithm

Calculation of closest lattice site
- Find Modulus of coordinates by lattice constant to find closest lattice site in the first unit cell.
- Find cell in which an atom is present by finding ceiling of quotient when coordinates are divided by lattice constant.
- Assign a number to each atom based on the ordering of lattice sites.
Enumeration
- If an atom is associated with a lattice site that is already marked by another atom, label all the associated atoms as interstitials and lattice site as vacancy. Also, label the vacancy and closest interstitial to it among all associated ones as pseudo.
- Label the lattice sites not associated with any atom as vacancy.

Grouping defects into Clusters

Feature Vector to Characterize Cluster Shapes

Motivation

Characterize cluster shapes in some qualitative sense
Local saliency but should include some sense of global shape
Gloss over small details, strong robustness to noise
Invariant to transformations (rotations, translations, scaling etc.)
Fast similarity search in a large database

The motivation for our feature descriptor is such that it should characterize shape in some qualitative sense. The feature descritor should retain the arrangement of local neighbourhood of a point defect that makes the shape of the cluster it is present in unique while glossing over the thermal noises, various transformation and extra defects appearing here and there. And these noises are inherent to our data. We are going to get clusters in rotated, translated, possibly scaled to different levels. There are going to be extra defects attached to a peculier structure like say a ring with a tail of a few extra defects, or an incomplete ring but we want the feature vector to show that there is some similarity between all these very similar clusters. We want these local salient characteristics to be represented but we want that overall global shape should also matter to some extent. We want the feature vector that we can use to pattern match to closest structures in a big database efficiently. Let us now see how we can achieve all these goals.

Motivation - Angles

Here, we see two typical structures with the angles between the neighbouring triads. In the ring shape most of the angles would be 120 or 60 degree while in crowdion these would be 0 or 180 degree. If we create a histogram of all the neighbouring triad angles, we would have very different distribution. This more or less forms the first part of our feature descriptor, however there are a few more details to it that you can read about or check in the open source code, but the idea is just this. We only take triads with vacancy as pivot and other two being either both vacancies or both interstitials in the neigh, This is equivalent to taking distribution of interstitials and vacancies around fixed vacancy sites. Adding more angles adds no extra information. Thus we get the local picture.

Motivation - Distances

Distance Measures

Euclidean
KL Divergence
Cosine
Quadratic Form Distance Functions

Some Typical Feature Vectors

Let's look at some of the typical cluster morphologies and their angular feature vector. The first cluster shape is picked by us and the next two are found by searching through the database for the closest matching clusters using the distances between the feature descriptors. The first shape we look at is a crowdion. We can see that the feature vectors as expected are showing peaks at zero degrees and 180 degrees or adjacent bins. Next, we have the ring shapes. As you can see the shapes although qualitatively similar, are not exactly same. However, their feature vectors are characteristic. The algorithm successfully finds the top two closest matching clusters having rings inspite of differences in the details. Here we have two dumbbells arranged in sort of perpendicular sloping fashion. The feature vectors are quite different from others. Next, we have a bunch of crowdions and dumbbells arranged in parallel orientation. Even, though the plots are done on the principle axes found using the PCA, the shapes still have differences in details. The features again are quite distinguished and the algorithm successfully finds the similar structures.

Other Features

Shape distance measures sensitive to noise, examples include Hausdorff distance, closest points search etc.
Shape Context method global feature, comparison of complete shapes, sensitive to noise inherent in our data
Saliency features from point-cloud applications targetted for large number of points specially surface points
Graph CNN, deep learning methods require labelling, less points per cluster and less data can affect accuracy

Visualizing Similarities Between All The Clusters At Once

Using neighbour graph dimensionality reduction techniques like t-SNE and UMAP

Similarity based dimensionality reduction - TSNE

Each point on the left hand side represents a cluster feature vector. The graph based dimensionality reduction techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding) here or UMAP (Uniform Manifold Approximation and Projection) are well established techniques for analysis of relationships between the data points (here clusters). Every point in left was actually more than 50 dimentional feature descriptor that we saw.

A rather interesting way to think of t-SNE dimensionality reduction is as a n-body problem. Where we place all the points in 2D. Each point experiences some force from
all others, repulsive and attractive, based on the similarity in higher, actual, basis. The points are moved slowly according to these forces. The other way which is
most common way to look at these problems is finding a probability distribution in higher dimensional space such that close points get sampled closely and vise-verse,
and find a probability distribution in the lower dimensional space that can mimic it.

The plot is interactive, you can click on any point on the left to see the
cluster itrepresents. We have used vacancy and interstitial cluster labels to
colour the points. Green are intersittils and red are vacancies. We see that
the dimensionality reduction does differentiate well between these two classes.
Now let's look a little bit closer. The island formed of many points at the top, if we click on points there, we would know that all of them are crowdions. However, in the same island of points on the left we have crowdions that are a bit distorted while
on right are the perfect collinear ones. The local relationship is well maintained
within the crowdion class. The dimensionality reduction on the feature descriptors
does seem promising. You can check some more points by interacting with the plot.

Classification

UMAP for dimensionality reduction
HDBSCAN for clustering

Here is an interactive plot, where you can click on a point on left to see the cluster it represents. Different colours represent different classes. The image below schematically shows how the dimensionality reduction above has placed different classes of shapes based on their similarity. We name the classes such that the related ones are in the same family represented by the initial numeric. On the lower left corner there is 1a class having all the crowdions. The second family has three classes all having a pair of parallel dumbbells, although arranged differently, In 2a both the dumbbells are perfectly aligned, in 2b those get slightly shifted and in 2c even more. As we go to different classes of 4th and 5th family, the number of dumbbells or crowdions increase and the sizes of the crowdion chains also increase, but they all remain parallely oriented. Just besides the classes that have parallel orientations appear the similar sized classes with random orientations. In this way the dimensionality reduction is retaining the global relationhip among the different classes. Let's look at the classes with parallel orientation first. We have 4a which has one dumbbell and one crowdion, then 4b which has two dumbbells and one crowdion, 4c has dumbell triplet arranged in slightly non-planar shifted fashion, 5a has three crowdions and dumbbells in highly non-planar fashion, 5b has four, 5c are medium sized around five to tens, 5d even more, culminating in 5f with hundreds of big crowdion chains. Now, let's look at the classes with random orientations of dumbbells and crowdions. The third family with classes 3a and 3b, appearing close to second family are dumbbells arranged in sort of slightly non-planar sloping T like shape and 7 like shape respectively. The classes of sixth family appear next to fourth and fifth with similar sizes. The class 6a has mostly triplets or a couple more dumbbells, 6b has a peculier tripod like arrangement with sometimes one leg shorter than others, 6c has 6+-2 randomly oriented dumbbells and sometimes a few crowdions arranged in almost planar fashion. 7a has many such randomly oriented dumbbells and a few crowdions totally non-planar while 7b has many longer crowdions sometimes having semi-formed ring like structures on one end. The number of defects in 7a are comparable to that of 5c while 7b is close to 5d. A little bit down to these classes appears the eighth class with clusters having 3D ring like shapes something similar is sometimes referred to as C15 in literature. There are vacancy clusters in ninth family, whose size grows from 9a to 9d. Although it is difficult to talk qualitatively about the shapes of these vacancy classes but we can see the differences in their properties as we explore next.

Properties of Classes

Sizes

X-axis has different cluster labels and the y-axis represents number of point defects. The values are the point estimates. We see that the initial small classes have very little variation in their sizes. The classes keep growing in size from first family to fifth, these classes have parallel orientations of dumbbells and crowdions except the third family with randomly oriented dumbbells. The interstitial cluster classes with randomly orientations don't go as big as the parallel oriented ones go. The biggest one among these, 7b, has parallel oriented crowdions with some semi-formed ring sort of random orientations on the ends, or set of parallel oriented crowdion chains which are not parallel to each other. The perfect random oriented classes like eighth or 7a are quite less in size. From this, we can conclude that parallel orientations can grow way more bigger than the defects with random orientations. We see in vacancy cluster classes, the ninth family, the last one has rather big clusters, while other three are comparable.

Dimensionality

On the left hand side we have variance on the first principle axis. A high value here implies linearity. On the right hand side we have variance on the first two principle axes. A high value here implies planarity. These are determined using Principle Component analysis applied on each cluster. On x-axes are class labels. On one hand, we have a crowdion class which is linear, another one that comes close is 2c which has two dumbbells that are shifted alot making the spread on first principle axes way more than in other two axes. Looking at planarity, small classes like 2a, 2c, 4a, 4b are perfectly planar while 2b, 3a and 4c too are close to planar. The most non-planar small sized class is 3b, the 7 shaped arrangement of two dumbbells. For its size 5a as well is quite a 3D arrangment of parallel triplet. The parallel orientation classes as they grow from medium sized 5c to big 5f, they gradually increase in the planarity. The eighth class with rings is 3D as expected. The 6a is a 3d structure while 6b and 6c can easily qualify as almost planar. In vacancy classes 9a can be said as almost planar, while 9d is quite 3d.

Distribution Across Elements & Energies

On x-axis we have class labels and on y-axis we have elements-energy pair. On left hand side is the box plot representation. The heat map on right hand side can be refferred for exact values. We see that the classes with parallel orientations are dominant in W such as 4th and 5th family while with random orientations are more prevalent in Fe than in W, that include 3rd and 8th family. The crowdions are predominantly present in W. This can be attributed to difference in stable ground configurations of Fe and W which are 110 dumbbell and 111 crowdion, respectively. We can say that in Fe dumbbells and crowdions prefer to arrange in random orientations and in W it is the opposite. In addition to this generalization, there are few more interesting things to notice here. The classes of pair of dumbbells in 2nd family prefer different elements. In Fe, the pair of dumbbells arrange in perfectly aligned manner represented by 2a class, while in W they prefer shifted arrangements represented by 2b and 2c. While 6a, 6b and 7a that represent small to medium sized random orientations preferably occur in Fe cascades, the 6b class with a tripod like shape appears mostly in W. We can certainly attribute these preferences to differences in stable ground configurations but still I find these very interesting to look at.

More Properties To Look At

Distribution across different angles of PKA launch
Dislocation loops, diffusion properties, stability etc.
Distribution across cascades with and without subcascades

Concluding Remarks

The classification gives a way to systematically study the zoo of defect clusters formed in primary damage due to irradiation.
Since the classification is all automated, it can be applied to large databases of simulations of new materials
The same approach, with possibly different feature vector, can be extended to find structures and classes in subcascades and cascades themselves.

THE END

Download Csaransh and give it a try on your data (https://github.com/haptork/csaransh/)

Discuss the results, get back with suggestions, feedback and code at github repository. We are open for collaboration.

Have a look at the paper describing each step in more details and to find references to related topics discussed. (http://arxiv.org/abs/1811.10923)

You can check the links to some cool ML algorithms the project uses: