**We all live in a connected world and generate a vast amount of connected data.**

Social networks, financial transaction systems, biological networks, transportation systems, and telecommunication nexus are all perfect real-world examples of this.

At times, it’s necessary to be able to visualize and represent data in a way that most people can analyze and interpret the significance. While neural networks such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) exist to serve that purpose, it’s possible to effectively recognize the true worth of datasets using a graph data structure.

In this article, we’ll examine the need for Graph Convolutional Networks (GCNs), understand the structure, and explore the possibilities on how machine learning methods can be used to build predictive models of connected data.

**Neural Networks — The genesis**

The development of Machine Learning algorithms has followed a clear trend from initial simplicity to need-driven complexity. The journey of Neural Networks started with a Perceptron which consists of a simple weight matrix W — a summing function and an activation function for output, where output is given by:

*Output = f(Wx + b)*

*A Perceptron*

*Source: **Semantic Scholar*

However, the limitation with the single-layer perceptron is that it can only capture linear relationships, which has led to the development of Multi-layer Perceptron (MLP) — a multi-layer feedforward artificial neural network. MLP is more expressive and at the same time more complex to handle the non-linear relationship in the data.

But, a multi-layer perceptron also exhibits a certain limitation. This model does not take spatial structure into account and thus suffers from the curse of dimensionality.

*A multi-layer perceptron model*

*Source: **FAVPNG*

To overcome these limitations, various deep learning architectures like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) came into the picture.

For example, in a CNN architecture, there are three integral operations: Pooling, Convolution, and operations in the Fully Connected Layer. The convolution layer helps in extracting features from the data while the pooling layer is used to reduce the spatial dimensions of the input and the computational complexity of the model that also helps in controlling overfitting.

*An architecture of CNN for the image classification task*

*Source: **mc.ai*

**The Inception of Graph Convolutional Network**

Most of the deep neural network architectures like CNN, RNN are used for applications like:

**Natural Language Processing (NLP):**use cases include Speech Recognition, Text Classification, and several others.**Computer Vision Applications:**frequent occurrence in Face Recognition, Image Classification, Action Recognition, etc.

It’s been also observed that a lot of real-world data does not ** ‘live’** on grids. Some relatively common examples include:

- Social Networks (Eg: Facebook, Twitter)
- Communication Networks (Eg: road, wireless, internet)
- Biological networks (Eg: gene, brain connectivity)
- Citation Networks
- Multi-agent systems

*Road maps* in social* networks*

*Source:* *Vidiani*

In the case of such networks, the inherent structure of the data was more like that of a graph and one had to learn from it. This meant that the standard deep learning architectures like CNNs and RNNs weren’t particularly useful here.

There had to be an enhanced architecture that could encompass real-time data and that led to the birth of the Graph Convolutional Network (GCN).

The architecture of GCN is similar to that of a traditional CNN but it takes graphs as input resulting in convolution and pooling operations to be different in principle. Also, GCN can be used for classification/segmentation/clustering tasks.

Now let’s take a look at some key differences between the two.

**CNN vs GCN**

The complete set of operations performed for GCN are similar except the Convolution and Pooling function as we dwell deeper into the details.

*The three integral operations.*

**Convolution in CNN vs Convolution in GCN**

Just like CNN aims to extract the most important information from the image to classify the image, a GCN passes a filter over the graph, looking for essential vertices and edges that can help classify nodes within the graph.

In the standard convolution, analogous to a graph, each pixel in an image is taken as a node where neighbors are determined by the filter size. The 2D convolution takes a weighted average of each pixel value of the red node along with its neighbors. The neighbors of a node are ordered and have a fixed size.

Whereas in graph convolution, to get the hidden representation of the red node, graph convolution operation takes the average value of the red node along with its neighbors. Different from image data, the neighbors of a node are unordered and variable in size.

* Source: **MissingLink*

**Graph Convolutional Network Architecture**

*An example architecture of GCN for the classification task*

*Source: **Zhihu*

A simple GCN in action goes through the following stages:

**Normalizing the graph structure**and creating input to the neural network with node properties and weights.**The graph convolution layer**passes a filter over the nodes and extracts essential features.**Leaky ReLU and dropout layers**perform a sort of pooling/downsampling over the first convolution.**Second graph convolution**is performed on the downsampled graph information.**Softmax layer**performs the final classification decision.

**How Does GCN Work?**

Here’s the basic definition of a GCN – a neural network that operates on graphs.

So, given a graph G = (V, E), a GCN takes as input:

- an input feature matrix
**N × F⁰**feature matrix,**X**, where**N**is the number of nodes and**F⁰**is the number of input features for each node, and - an
**N × N**matrix representation of the graph structure such as the adjacency matrix**A**of Graph**G**.

A hidden layer in the GCN can thus be written as:

*Hⁱ** = f(**H**ⁱ⁻¹, **A**)),* where *H**⁰ = *** X **and

*f*is a propagation rule.

Each layer ** Hⁱ **corresponds to an

*N*×

*F*

**feature matrix where each row is a feature representation of a node. At each layer, these features are aggregated to form the next layer’s features using the propagation rule**

*ⁱ**f*. In this way, features become increasingly more abstract at each consecutive layer.

The variants of GCN differ only in the choice of the propagation rule *f*.

*the propagation rules(f)*

*the propagation rules(f)*

One of the simplest propagation rules is:

*f*(Hⁱ, A) = σ(AHⁱWⁱ) …………… **(rule 1)**

where Wⁱ is the weight matrix for layer i and σ is a non-linear activation function such as the ReLU function. The weight matrix has dimensions Fⁱ × Fⁱ⁺¹; in other words, the size of the second dimension of the weight matrix determines the number of features at the next layer.

(*Note:** This operation is similar to a filtering operation in CNN since these weights are shared across nodes in the graph.)*

Shortcomings of this rule (rule 1):

- The aggregated representation of a node does not include its own features. It’s an aggregate of features of neighbor nodes.
- Nodes with large degrees will have large values in their feature representation, while nodes with small degrees will have small values. This can cause vanishing or exploding gradients problems and is also problematic for stochastic gradient descent algorithms which are sensitive to feature scaling.

To fix the shortcomings of rule 1, few modifications are proposed for propagation rule like identity matrix (I) is added to the adjacency matrix (A) to include the features of a node in its representation along with its neighbors.

Also, the feature representations are normalized by multiplication with the inverse degree matrix D⁻¹, turning the aggregate into a mean such that the scale of the aggregated feature representation is invariant to node degree.

So, the new propagation rule:

*f*(Hⁱ, A) = σ(D⁻¹ÂHⁱWⁱ),

where Â = A + I, I is the identity matrix, and D⁻¹ is the inverse degree matrix of Â.

*Spectral Rule of Propagation*

*Spectral Rule of Propagation*

In the spectral rule of propagation, while normalizing with the degree matrix D, it takes into account the degree of the i*th* node as well as that of the j*th* node.

This rule weighs neighbors in the weighted sum higher if they have a low degree and lower if they have a high degree.

*(**Note:** This may be helpful when low-degree neighbors provide more useful information then high-degree ones.)*

**Convolution in GCN**

There are two approaches for performing convolution on graphs:

- Spectral approach
- Spatial approach

In the spectral approach, the limitation is that graph structure is the same for all samples i.e. **homogeneous structure**.

But, since most of the real-world graph data have different sizes and structures for different samples i.e. **heterogeneous structure, **spectral approach can’t be used for such cases.

With the spatial approach, this problem is solved.

* Spatial Approach*

- It does not require a homogeneous graph structure.
- It requires preprocessing on the graph to enable learning on it.
- As in CNN, all images fed as input have to be of the same size.
- Here also heterogeneous graph structure needs to be mapped to a fixed-size which is then used for learning.
- This approach is called Graph-embed pooling.

*The flow of Spatial Convolution Operation*

In the spatial approach, graph convolution transforms only the vertex values whereas graph pooling transforms both the vertex values as well as an adjacency matrix.

*Graph Convolution Graph pooling*

**Implementation**

For the task of classification of nodes in a graph (semi-supervised classification), GCNs are used. The code is implemented in TensorFlow. A repository of the code can be found here.

The dataset used here is that of a **Citation Network – Cora dataset.**

So, what exactly is the citation network?

*Citation Network*

*Citation Network*

Citation Network is basically a social network that contains paper sources and is linked by co-citation relationships.

In simple terms, when a document *d _{i}* cites a document

*d*, this can be represented by an arrow going from the node representing document

_{j}*d*to the node representing document representing

_{i}*d*. In this way, the documents from a collection D form a directed graph, which is called a ‘citation graph’ or ‘citation network’.

_{j}*A representation of Citation Network.*

*Cora dataset*

The Cora dataset used here has **2708** scientific publications that need to be classified into corresponding classes and the citation network consists of **5429** links.

This dataset consists of Machine Learning papers. These papers are classified into one of the following **seven classes: ***Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, Theory.*

After stemming and removing stopwords, there are **1433 **unique words in the vocabulary, which is used for creating the feature vectors of the nodes (publications).

Then each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.

*Model inputs and architecture*

*Model inputs and architecture*

Input to the network:

- an N by N adjacency matrix (N is the number of nodes),
- an N by D feature matrix (D is the number of features per node), and
- an N by E binary label matrix (E is the number of classes).

The architecture of the network is:

- Input layer
- 2 hidden GCN layers
- And a dense layer.

Optimizer: AdamOptimizer.

Loss: Softmax Cross entropy.

**Use cases of GCN**

**Recommender Systems:**Graph-based recommender systems take items and users as nodes. By leveraging the relations between items and items, users and users, users and items, as well as content information, graph-based recommender systems are able to produce high-quality recommendations.**Traffic:**Accurately forecasting traffic speed, volume or the density of roads in traffic networks, another industrial-level application is taxi-demand prediction. Given historical taxi demands, location information, weather data, and event features.**Natural Language Processing:**A common application of GCNs in natural language processing is text classification. It utilizes the inter-relations of documents or words to infer document labels.**Computer Vision:**Applications of GCNs in computer vision include scene graph generation, action recognition, etc.**Chemistry:**In the field of chemistry, GCNs are used to study the graph structure of molecules/compounds. In a molecule/compound graph, atoms are considered as nodes and chemical bonds are treated as edges.

**Extensions of GCN**

**Attention Mechanisms —**attention-based GCNs maintain state information to capture the neighborhood properties of the nodes. This makes it possible to “remember” important nodes and give them higher weights throughout the learning process.**Graph Spatial-Temporal Networks —**these are GCNs that can support graphs with a fixed structure but inputs that change over time. For example, a traffic system with a fixed set of roads but variable traffic arriving over time.**Graph Generative Networks —**can generate new, realistic structures from data, similar to a Generative Adversarial Network. One promising application domain of GGNs is chemical compound synthesis.**Graph Auto-Encoders –**Graph auto-encoders aim to learn low dimensional vectors via an encoder, and then reconstruct the graph via a decoder. They are popular for learning graph embedding as they provide intermediate vector representation.

On concluding remarks, I would like to highlight that one of the drawbacks of GCN at this point in time is that with every layer of Graph convolution, it aggregates the feature of the node with its neighbor nodes. For example, the first layer will aggregate the features of its direct connections, then the second layer over it will take up the 2-hop neighbors into account.

Therefore, all the nodes in the network must be known beforehand. In other words, it can not provide labels (or classify) for any new entrant in the network at a later point in time as the network has not seen its features before and thus not aggregated it.

In short, it works well for static graphs and datasets which can be aptly mapped into the graphical structure, and hence makes it a suitable choice for everything that involves references and can be closely interconnected.

**References**

[1] Graph Convolutional Networks blog by Thomas Kipf.

[2] Semi-Supervised Classification with Graph Convolutional Networks paper by Thomas Kipf and Max Welling.

[3] An explanation of Graph Convolutional Neural Networks.