Placeholder Text!

Lessons learned from AI Masterclass

On July 7th 2018, Artificial Intelligence for Development (AID) in association with Facebook Developer Circles: Kathmandu, organized a 4-weeks long AI Masterclass series, of which content was provided exclusively by Facebook AI Research (FAIR) team.

Our Community Leader Aayush Poudel participated in this masterclass and below are his thoughts and what he learned by attending all the lectures in this masterclass series.


Believing that everyone who would read this piece has already heard of or used artificial intelligence in their lives, I’ll dive straight into the meaty bits of what I learnt with the AI Masterclass organized for four Saturdays with exclusive content from Facebook.

Day 1: What makes me think?
Neural Networks 101

For every action of a computer, there has to be a binary distinction for completion of a task. A simple yes or no would suffice; a simple 0 or 1, on or off is enough. However, that becomes restrictive when answers aren’t always so black and white. And in our lives, it’s rare for instances or problems to have such a distinct division: we live in a grey world. So, how we go about making a computer work in grey situations is by trying to model it as close to our learning process as possible.

The First Grey Bit

Since the early era of machine learning, computer scientists have tried modelling the computer brain by trying to replicate the functions of our own brains. Even with our brains, neurons (cells in the brain) work in a binary manner. The electronic impulses fired to form synapses (connections between the neurons) make our brains work. For a synapse to ‘fire’ (send information/electrical impulse), the incoming electrical impulse has to be above a certain limit (certain voltage.) We attempt to model our artificial intelligence’s brain in a similar manner. The scientists first pictured a perceptron to work similar to a neuron in our brain. A perceptron fired when certain mathematical properties were satisfied. The model of the perceptron was as follows:

Here, this model of the perceptron has several components. We will be skimming over this equation. The  sigma is the activation function for this perceptron (not elaborated on in this piece). Basically, the ‘y’ indicates what the output of the perceptron is. The ‘x’ is the input to the perceptron. The interpretation of ‘y’ varies depending upon the task that the model is supposed to accomplish (classification, regression, or clustering.) In our original equation, the segment within the angular brackets is the same segment ‘𝑎x’ that we find in the linear equation. What we input are, more often than not, vectors with several dimensions (features or characteristics, in simple words) and receive an output ‘y’ that can be interpreted in several ways.

Separate, Extrapolate, Segregate
Types of machine learning

Broadly, there are three kinds of machine learning tasks: classification, regression, and clustering. The first of the three is classification. It’s when the task at hand deals with separating the input into different predefined classes. A traditional and simple example is separating pictures of dogs and cats from a mixed batch of both such pictures. Regression is when one has to predict a real number value of a certain function on the basis of past data. Another common example is predicting house cost on the basis of all the data that is present about current houses and their prices. Clustering involves grouping together data that are similar to each other. It differs from classification solely due to the method which is used in these two tasks. While in classification the programmer is aware of the classes that the data belong to, in clustering the program is supposed to find relationships between the data and find proper clusters or classes they may belong to.


Here, we elaborate on the process of machine learning with the regression task. To understand how they fit into the equation, we first take an example of a linear regression model of the form:

In this equation, the input is ‘x’ while the output is ‘y’. The coefficient of ‘x’ is ‘a’ and the y-intercept is b. Making predictions (y) based on certain input (x) is what we do as well. The predictions work when our coefficient (a) and intercept (b) can accurately model the situation. To understand why this works, let’s take a look at a scatter plot:

For the system above, the equation that models it is:

i.e., the above equation ‘predicts’ the outputs (y) based on the inputs (x).

Here, the coefficients (2.0622,0.1922) are what model the prediction. They are the weights (a) and the biases (b) of the model in the form (a,b). Finding the correct weights and biases is what we extract from the data.


Models of the world, you have nothing to lose but your errors
Loss Function


The process of learning in Machine Learning occurs with error correction, always. Much similarly to the way we learn, our model is formed with a lot of errors and bad predictions. When we look at more examples and compare it with what we have learnt, we are able to correct our understanding of the matter. The same principle applies to our machine learning models. In order to reach the correct values of weights and biases, we have several methods, the most prominent of which is the gradient descent for such a regression problem. We start off with a random guess for the weights and biases. Then we calculate the mean squared error between predicted and real values. The calculation of these values is recorded in the loss function. Our goal in machine learning, most of the time, is to minimize this loss function’s value, i.e., to make our models have as low error as possible. Again, I’ll not go into the underlying math, as others have covered it to a much greater depth and clarity. Nonetheless, what occurs is the remodeling of our weights and biases according to what lowers the error in our model.

Fig: The Gradient Descent

We need to go deeper
Hidden Layers

Even in the largest of neural networks, the major part of the work is minimizing the error that occurs in each step of the way. The most rudimentary method of thinking about the neural network layer is as follows. In the first layer, the most basic characteristics of the input variable are studied. Only the fundamental features are dealt with to predict the model. With each subsequent addition of layers, the input characteristics are matched with different features that work together to better predict the output.

To explain this concept, we can use our previous example with the house pricing. Consider the following chart.


In this chart, it is evident that, although the model is somewhat accurate in predicting the house prices solely on the basis of house area, it is not perfect. What can be assumed is that, the house prices may depend also on the factor of how far it is from the town center. The number of bedrooms in the house may be another feature that is important for the price. All of these could be the features or variables to be calculated or modeled in the first layer of the network. The second layer could be any number of combinations that are relevant to the price. For example, it may be the case that houses with more rooms and large area may be cheaper than houses with less rooms and similarly large area. Such multivariate equations could be worked out in the deeper layers of the network. For the most part, as the first layer of the network only deals with the primary features of the house (area, number of rooms, distance from the town center, etc.,) it is a form of ‘shallow learning’, or traditional machine learning. When the additional multivariate calculations are involved in the deeper layers, then the network is what is called a deep learning neural network.


The Network
Neural nets

Superficially, perceptron is one kind of the neurons that form our neural networks. Several of these interpret the data and model it in different models. Finding all the correct weights and biases gives us a complete picture of the network.

Here, the first layer could be assumed to only be analyzing the basic or primary features of the input data while the hidden layers deduce the multivariate relationships of the features of an object. The output layer is what we receive as either a real valued output for our regression problem or a binary result for our classification problems.

Practice and Execution
Training and Testing

In machine learning, as with neural networks in general, the task that the model is set up to requires validation, i.e., we need to know if the model is predicting accurately. To do that, we first train the model on a dataset so that it is able to figure out the weights and biases relevant to the model. When the model is trained, we test it against a separate data set similar to what we intend real world scenarios to be and see how well the model does. The problems that the model might face are overfitting, where the model has high accuracy in the training set but low accuracy for the testing set. This may be because the model tried too hard to match the specifics of the training set and lost all sense of generality. On the other end of the spectrum is underfitting. This is when the model is too general and is not useful for predictions at all and has a low accuracy even in the training set.

Source: Overfitting and Underfitting: A Complete Example by William Koehrsen


This leads us to another important aspect of machine learning: supervision. There are mainly three kinds of ways in which data is fed into the system. The training data can either be sent in with all the correct labels. This means to say that, all the data is curated and labelled according to what the model should try to classify similar objects into. This is called supervised learning. It’s because the model is trained under the supervision of the programmer and their classification of the data. This is very similar to the way kids are taught manners by their parents. They get reprimanded for ill manners and rewarded for proper ones. We use this in classification problems. This is expensive in terms of labor, time, and money for the preliminary data management parts.

The next method is a bit wilder, if you would see it to be so. It involves letting the model play with the data as they like. The data is not labelled in any way and the model itself looks for relationships of features between the data units. This is what is called unsupervised learning, much like how we learn to not touch thorns and hot pots or learn to touch soft fabric or fur. This is what clustering employs. This is cheaper in terms of labor, time, and money for data collection and cleaning.

The next is semi-supervised or weakly supervised learning. Some data are labelled and most are not. Here, the data is classified according to the few labels provided and the rest of the data is supposed to be segregated accordingly by the model. This is cheaper than unsupervised learning but not as accurate as supervised learning.

Ending Remarks
Concluding the First Day

The first day, undoubtedly, was very heavy mathematically, and I’ve left out a lot of the details from that portion. However, it explored how integral knowledge of math is for the understanding of all the algorithms relevant to the problem at hand. The day ended with the participants talking to each other about what we totally misunderstood and laughed around about the same. It was an absolutely stunning start for the Masterclass.


Day 2: Can you hear me?
Speech Recognition

As per the notice, we’d be learning speech recognition, the technology with which our phones are able to schedule our appointments when we shout at them from the shower.

Spectrograms over Waveforms
Initial data collection

My high-school knowledge would dictate that, if you want to understand sound, you had to look at the waveform of the recording and its frequency related characteristics. The lecture started with the same point, and goes on to show why that would be painfully expensive. While a lot of data can be gathered from the waveform, a spectrogram would do the work much better.

A spectrogram is a simple graph that records auditory input in a very beautiful manner. On a two-dimensional plane, the x-axis is for time. The y-axis records frequency along with its amplitude by depicting warmer colours on frequencies of higher amplitudes. The figure below shows how it’s used.

Fig: Spectrogram 3d representation of the recording. Time is on the x-axis and frequency is on the y-axis. The elevations or warmer colours represent higher amplitudes of those frequencies. (Left) “This is what it looks like on a computer screen.” (Right) Sounds “/a/, /e/, /i/, /o/, /u/”
Source: . Notice that, for human sounds, the frequencies at the lower ends have higher amplitudes.

Using spectrograms as opposed to waveforms primarily has a two-fold benefit. First, the dimensions of data to be recorded is reduced to one-sixth of what would be recorded if we used waveforms. This makes the data much more efficient to work with as the underlying calculations would be much faster with lower dimensions. Moreover, since the spectrogram is essentially a picture-representation of sound, we can use highly optimized image processing models like the Convolutional Neural Networks we were going to learn about in later sessions.

Diff-rent Syl-a-bles
Filtering and Phoneme Separation

The next step in the process is to refine the data we collect. After we record the sound in a spectrogram, we focus on the frequency region that human sound usually lies in. Roughly speaking, we increase the sensitivity of the sound recognition in the lower frequency range that humans speak in using what are called ‘Mel Filter banks.’ After that, we separate the recording into phonemes. Phonemes are the segments of speech that produce a unique sound. Each individual syllable of a word can be considered to be a phoneme. The separation is manual and very expensive with current methods as it requires a tremendous amount of labour and time. However, it’s the only method we have as separation by computers is not as efficient as it is by a human.

After we separate the phonemes, then we train the model to predict the phonemes. The phonemes act as the vectors for the model, i.e., the input values. We use a Bayesian prediction model to predict which phonemes were uttered. What it does is basically return us with a prediction matrix according to how close the tested sound is against the model. Then, we use it to improve the model.

Additionally, we could also use affricates, consonant clusters, or diphthongs as vectors with their own probability distribution according to the recorded phoneme cluster. These connected sounds, or ‘dependencies’, could be exploited to better predict the phonemes as their associations produce unique sounds. An algorithm called Connectionist Temporal Classification can be used to match variable length audio inputs to their correct phonemes.

How do I write this sound?

After the phonemes are properly separated and predicted against their phonetic or IPA characters, they are then transcribed into possible words. This step has its own nuances regarding how one could improve the accuracy of transcribed words by checking to see if the generated sentences are probable or meaningful. This can be done by matching the predicted outputs against giant databanks of books and transcriptions and using the most probable ones. For example, if you said, “I’ll land soon”, there is the possibility that the algorithm also produces an alternate output along with the correct output. In this case, it could be, “Island soon.” It’s obvious that the first output about arriving somewhere can be found to have a greater probability of being a complete sentence as opposed to a declaration of an island coming soon.

So, what’s the hassle?

There are several areas where improvements are definitely possible, and the collection of relevant data is one of the most important. Even with proper data collection, the labelling costs for the data, as mentioned before, is extremely high. Consequently, unsupervised or semi supervised techniques for phonemic separation is vital for better results. Additionally, efficient noise cancellation, accuracy across accents, and a lot of other edge cases require special attention to make the speech recognition model as accurate as it can be made.

Even bigger problems exist for languages that don’t have sufficient recorded audio or languages with no standard orthography or writing practices (e.g. the Wu dialect of Mandarin is different enough to be considered a separate language but it doesn’t have its own standard writing system and it’s the same with Swiss German).

Now what?
Current Practices

For further improvement and research, current practices focus on extracting data in different ways such as through visual inputs of a person speaking as in the Blue-Lips Database for French audio data. Semi-supervised learning is also researched and attempted to reduce the cost of phonemic separation. In all that we saw that day, the conclusion that could be inferred was that, the knowledge of linguistics and psychoacoustics is vital for the improvement of current speech recognition models.

Day 3: The all-seeing AI
Computer Vision

While vision seems to be something that we, as humans, are able to get used to very easily, it’s a behemoth of a task when we try to make our computers understand vision the way we understand it. From understanding what separates a dog from a cat to distinguishing between a plane crash and a peaceful parking, computer vision is as important as it is difficult to perfect. In our class on the third day, we toured through the history of attempts at improving computer vision.

Old eyes, New eyes:
Feature Extraction

Initially, we need to understand what makes an image. On the most basic level, an image is an assemblage of pixels that form edges, corners, gradients, and a variety of ‘low-level’ characteristics or features, as we call it. Gradually, once we are able to efficiently detect a line or an edge, we move on to understanding shapes, or rather how they’re made. Then come the ‘high-level’ features such as eyes, ears, snouts, and so on and so forth when we’re analysing animal images. Upon feature extraction, they are semantically segmented into different objects. In older approaches, the low-level features were extracted manually and then the networks were trained to detect them accordingly. This method was, naturally, tedious, time-consuming, and expensive. This was the case when only ‘shallow’ machine learning models, i.e., models analogous to a single layer from deep neural networks, were used.

With advancements in the efficiency of deep neural networks, however, vision was significantly improved as compared to previous models of shallow learning. The feature extraction could be made automatic with the use of layered convolutional neural networks (CNNs, which will be elaborated later on.

ImageNet is a vast dataset consisting of 14.2 million images that have been hierarchically organized much similarly to Princeton’s WordNet, where related words or search items have properly delegated links to other words. They are all labelled. The ILSVRC (ImageNet Large Scale Visual Recognition Challenge) uses ImageNet for training and testing image recognition. The metric for accuracy of these models is the top-1 and top-5 error rates. Those metrics are the records of whether the label of the image class (what the image is) is one of the first or first five classes that the model predicted. The accuracy has greatly increased with implementation of deep neural network in recent years of the ILSVRC as depicted in the statistics shown below.

What’s your number?
MNIST Dataset

For a simple implementation of algorithms for vision (that is monstrously complicated beneath), the MNIST dataset is a neat example. The dataset consists of sixty-thousand or so images of handwritten digits from 0 to 9. These digits are laid out on a 28×28 grid of pixels with a single channel for colour (grayscale.) What that means is that the image is made up of pixels that can be represented as numbers between 0 to 255 according to the intensity of the pixel colour value, as demonstrated below:

Left:  A subset of the dataset. Right: Numeric representation of the pixel colour values as intensity between 0 and 255 of the colour white.

These images are then fed in as inputs to the neural network: in this case, CNNs. While in other deep learning implementation, all previous inputs are used as relevant vectors for the successive layers, in CNNs, only specific subsets of the entire pixel grid, called kernels. For the images in question, these kernels are of the dimension 1x3x3 where 3×3 is the kernel size and 1x3x3 is when only a single colour channel (grayscale) is used. For images with Red, Green, and Blue colour channels, kernels of size 3x3x3 are used. They are then aggregated by a technique called Pooling, where the kernels closer to each other are analysed in the deeper layers. What is important to note in this technique is that the position of the object in question relative to the frame of the image needs to be determined, which is also handled during the process of pooling.

I already know this!
Transfer learning

As it gets more and more evident when we go into Machine Learning, a large dataset is vital for proper training and model creation. However, it is not always possible to generate an enormous dataset for every project in image recognition. We can thus use the concept of transfer learning. Using models that were previously trained for other image recognition projects, the low-level features layers can be retained while higher level features can be trained for the network. Doing so enables us to use less amount of highly curated and labelled dataset efficiently for better accuracy with different image recognition projects.

For clarity, consider a model trained to differentiate cats from dogs. The low-level features can recognize the edges and shapes of the images. However, if we were to use the model to interpret images of buildings or banknotes, we could do so by extracting the low-level features’ layers from the model and build the higher-level features with the limited amount of highly curated dataset that we have.

To divide or to create?
Discriminative vs Generative:

In computer vision, there are two broad categories of neural networks on the basis of their focus. The first one is the Discriminative Neural Networks (DNN) which primarily focuses on how the image can be classified or segmented into classes or labels. The second is the Generative Neural Network (GNN) that generated data by inferring the images or generates images entirely. They are super useful when used together in Generative Adversarial Networks (GANs.) In GANs, a GNN outputs an image generated from a noisy input that it sends as input to the DNN for classification. The task of the network is simple: the DNN attempts to decrease its error rate of image classification by specifying whether an input image is one generated by the GNN or an actual image from the real world, and the task of the GNN is to generate images close to the real-world examples from scratch so as to confuse the DNN into thinking that the fake (or generated) images are actually real.

For studying about this, one could refer to Fader Nets (

Who is the man in this picture?
Image Captioning

Using Recurrent Neural Networks (that we would explore in the next sessions), captions could be generated from Images using both Computer Vision and Natural Language Processing methods mixed into one. These could be very useful for the visually impaired.

What can’t I do?

While these models may seem to get the picture (pun intended) pretty accurately, they are far from perfect. The image datasets we have may contain various unintended biases that are difficult to monitor if not closely curated. A simple example of a bias may be the prevalence of some object in the dataset to be much higher than what is observed in the real world, i.e., for a cat-dog distinction model, the dataset may have 10x more images of dogs than cats, which would produce unintended false bias in the model. To curb such problems, datasets like CLEVR ( are used.

Moving on

Needless to mention, the utility of refined computer vision could be monumental. For starter, they could be used to fill in video frames to increase the frame-rate of a recorded video. By integrating physics into the system, the videos could also be extrapolated based on what the starting conditions of the images are. One could even use it to intelligently increase the resolution of images and to make the details of blurry images more refined. As we go deeper into these networks we have built, we can be assured that intelligent computer vision, as good as or better than human’s, could be a reality in the future.

Day 4.1: How does this make you feel?
Natural Language Processing, Sentiment Analysis, Inference, Question Answering.

Everything we speak, hear, read, or write begins with us making use of meaning before we use words. What I mean to say is that, we associate certain meaning to certain context before we try and understand what certain words said in the context mean and which of the many meanings that a word bears is relevant. Similarly, when we try to train our network to understand what we mean when we say “Siri, show me the nearest McDonalds,” we need to have a way through which we can make Siri understand our natural language.

What are words?
Word-embeddings and Context-embeddings

First thing we need to stress on before we continue is the use of the term “Natural” in Natural Language Processing. It’s a simple fact that computers are most intimate with structured languages like C, Python, etc. They literally function using machine code which are strict languages with definite meanings. Those are easily understood by a machine but are difficult to communicate with for humans, unless they are programmers. In such a situation, how can a human communicate with a computer using human languages? That is when the term ‘natural’ comes into play. How humans interact is not definite. Aside from visual cues or body language, even our speech is not definite to have a single meaning. Consequently, our language is natural—it is very raw in its use for communication. We have to first feed it into our systems by structuring it in a way that computers understand, i.e., using numbers.

In order to convert our words and sentences into numbers, we use what are called ‘dimensions’. A simple example follows:

Fig: Dimensional characteristics of words as vectors in dimensions Programming Languages and Animals

In the above figure, it is apparent that the words such as lion, tiger, pandas, python, and Java have some sort of relation with contexts such as Animals or Programming Languages. The lines with arrows at their ends indicate the vectors that they form with respect to the given contexts (context-embeddings) which act as dimensions. Here, the words (word-embeddings) lion and tiger are more closely related to the context Animals that they are to Programming Languages. Java, on the other hand, is more related to programming languages than it is to Animals. However, python (both the animal and the programming language), and pandas (both the animal and the python module) show themselves to be related to both the contexts. The contexts here are the ‘dimensions’ and the vectors here are the word vectors. These vectors can thus be used numerically. For example, tiger can have the value of (0.2,0.8) in dimensions Programming Languages and Animals, while python can have the value of (0.8,0.8) in the same dimensions.

Thus, the dataset consists of words that may or may not be labelled. The output that we desire from a system for Natural Language Processing is the context. This helps us make our computer understand what basic concepts we’re referring to. The above example is a ridiculously simplified version of how word vectors are actually formed. Here, I have only included two dimensions (Animals, Programming Languages), however, in practice, there are more than hundreds, and maybe even thousand such dimensions. Our model should aim to accurately understand what context we are using our words in. The underlying implementation involves a plethora of vector calculations and dot products. But, before of focusing on those, we must understand what problems may befall us when we use this technique.

For words that have different forms but have similar meanings, such as ‘love’, ’loving’, ‘loved’, etc., it is essential for us to associate them closely. For this, we could use sub-word n-grams to model our data. In this case, trigrams could relate the words as follows:
love:      lov+ove
loving:   lov+ovi+vin+ing
Doing this can help effectively model our data so that it shows significant information about related words such as adjectivized or verbed nouns.

What do we do with this information?
Applications of NLP

Natural Language Processing sees its application in numerous questions. Some of them are topic classification, sentiment analysis, natural language inference, and question answering. For starters, using NLP, one could classify a sentence into categories according to its most relevant context, i.e., it could brief through a big text and extract the contexts that were most prevalent in said text. While it sounds simple enough, nothing that has been mentioned here is really that simple.

There is no good or bad News. There is only News.
Sentiment Analysis

When working with sentences, it is impractical to use the entire sentence as an input into the system because of the infinite permutations that sentences could have. They aren’t easy to model. Instead, we can use the sentences by breaking them down into their component words. This is the concept of ‘bag of words.’ Then, using vectorization of the words, one can determine the context that it is relevant in. After that, using adjectives and cues from the sentence, the system can classify the sentiment of the sentence, i.e., infer if it means something ‘good’ or ‘bad’. This is most readily applied to movie reviews that can be classified as reviews that the movie was good or reviews that it was bad.

However, there may be words that produce different meanings on their own and an entirely different meaning when used together. For example, gut means the visceral parts of the human while wrenching means twisting but ‘gut wrenching’ as a phrase means something that is a bad experience. Additionally, negations such as ‘not good’ or ‘not bad’ have relevant meaning only when used together. For such instances, the words must be in that same order when we feed them into the network. We can use the same concept we used earlier for sub-words. Earlier, we used that concept to relate similar words. For sentences, we can form structures similar to n-grams as follows:
what a nice day: what_a+a_nice+nice_day+what+a+nice+day.
I do not want this: I_do+do_not+not_want+want_this+I+do+not+want+this.

What do you mean?
Inference, Question Answering

The moment we feel the most interactive with a machine is when they are able to draw inferences from our sentences. When we see JARVIS from Iron Man, the only tangible reason why we consider it to be artificially intelligent is its ability to understand what commands Tony Stark gives to it. That is what natural language inference and question answering is. Inference is when the computer is able to provide a basic synopsis of a large text. To accomplish this is to succeed in understanding the language. For training the network in this, we feed pairs of sentences to the network along with labels that tell it ‘yes’ or ‘no’ according to the case that one sentence of the pair is an inference to the other. Similar implementations result in a system that is able to answer questions based on internet searches.

Until now, one must be wary of the fact that our use of the ‘bag of words’ technique fails to account for the loss of word order, which, in languages with little to no inflection of nouns and verbs and heavy dependence on word orders could be a huge issue. To fix this problem in this approach, Recurrent Neural Networks are used which are discussed here in the later sections.

Day 4.2: What is the meaning of life?
Recurrent Neural Networks and Image Captioning

It follows from the previous session that a major problem with Natural Language Processing with the concept of the word bag would not work so well with languages where word orders are not too flexible. In English, as little to no inflection occurs with adjectives, verbs, or adverbs, word orders are required for similar sentences to be distinguished. An example is as follows:

“The dog chases the cat”, “The cat chases the dog”

For both of these sentences, the word bag consists of the same words: {dog, cat, chases, the}, but we know that these are two entirely different sentences. The solution proposed for this is Recurrent Neural Networks. The underlying implementations may get a bit too complicated to fully explain, so I’ll not be going too deep into the details—for those more learned, forgive me for the excuse for an explanation that is about to follow—but what basically occurs is that the word orders are preserved for outputs of the previous neural network using matrices that record intrinsic information on word orders. This works nicely for short sentences. However, when sentences start getting long, the information stored on the matrix about the sentence fails to fully capture what was said towards the beginning of the sentence. For this, several algorithms are used. Some of them include CRF, HMM, LSTM, etc. This can capture the word order very efficiently.

Conditional Random Fields that are a class of statistical modelling method are also useful for structured recognition and prediction of natural language. Hidden Markov Models are used to tag sentence parts and phrases with respect to their functions as parts of speech (POS). Long Short-Term Memory (LSTM) works by using the sequential data as a tensor. While the implementation is important, the overarching reason for using it is to store a corresponding hidden state h that can contain information from arbitrary points earlier in the sequence of words.

That’s not all that the RNNs do. Along with word order retention, they can also predict how likely a phrase of words is a sentence that makes sense, broadly speaking, by the use of conditional probability. For example, with a sequence of words the associated probability of occurrence can be modelled as follows:

The output is then sent into a Soft-max function to form the probability matrix.

#DeepLearning #MakingJARVIS
Image captioning

Finally, what more could be done using RNNs is image captioning. Since the use of techniques of modelling for sentence sequence retention can take word order into account, generative networks could be used for formulating sentences using a bag of words and probability distribution statistics. So where do we go with this? We know that, with the use of neural network analysis of an image, we can extract features relevant to the image. Those features are essentially words that describe the image. Those words can be used to caption the image by forming a sentence through the use of Natural Language Processing techniques. That is to say that, the last hidden state of an RNN encoder for word order retention is used as the first hidden state that acts as an input for the generation of relevant sentences.

Taking Bigger Leaps
Concluding Remarks

While the community accepts that the entire field of Artificial intelligence with its Machine Learning strides, Deep Network developments, and Deeper Learning Research and advancement, it is still in its infancy. The field is growing and so is our understanding of how we perceive our world around us. By putting together bits and pieces of information from our several senses, we are able to introduce a novel image of what we believe is representative of what we experience. Suffice to say, with constant improvement of the techniques we use, and the active AI communities we support, we will be closer to giving rise to a world that understands how it runs and can replicate itself to perfection. With such a goal at hand, one could definitely assume that machines will, in the near or far future, be able to help us understand what it means to be alive, functioning, conscious, and human.


AID team would like to thank Aayush Poudel for his active participation in this AI Masterclass series and we are thankful that he shared his thoughts and lessons he has learned, among us. We are delighted to have someone like Aayush as our Community Leader.

If you are interested in attending AI events, make sure to visit our Events section frequently, we will be posting our activities there and if you like to interact with us and ask us questions and want to give some suggestions, please make sure to join our Facebook group called Developer Sessions and interact with us there directly.

We value your Feedback

Please reach out to us and let us know what you think about our initiative.