ViT-RCNN already leveraged a pre-trained ViT as the backbone for an R-CNN object detector but it still relies heavily on convolution neural networks and strong 2D inductive biases. Briefly put, ViTFRCNN interprets the output sequence of ViT to 2D spatial feature maps and utilizes region-wise pooling operations and R-CNN architecture to decode the features for object-level perception. There have been other similar works, like DEtection TRansformer (DETR), that introduce 2D inductive bias by leveraging pyramidal feature hierarchies and CNNs.

However, these architectures are performance-oriented and they don’t reflect the properties of the vanilla Transformer. ViT is designed to model long-range dependencies and global contextual information instead of local and region-level relations. Moreover, ViT doesn’t have hierarchical architecture like CNNs to handle the large variations in the scale of visual entities. But Transformers are born to transfer, so we can’t dismiss them without testing whether a pure ViT can transfer pre-trained general visual representations from image-level recognition to the much more complicated 2D object detection task

To test the efficacy of vanilla transformer models, Yuxin Fang, Bencheng Liao, et al, created You Only Look at One Sequence (YOLOS), a series of object detection models based on the ViT architecture with the fewest possible modifications and inductive biases.

YOLOS closely follows the ViT architecture, there are two simple changes:

- YOLOS drops the [CLS] token used for image classification and adds one hundred randomly initialized detection [DET] tokens to the input patch embedding sequence for object detection.
- The image classification loss used in ViT is replaced with a bipartite matching loss to perform object detection similar to DETR.

YOLOS is pre-trained on the relatively small ImageNet-1k dataset. It is then fine-tuned using the COCO object detection dataset. It is important to reiterate that the whole model isn’t trained on the COCO dataset per se, YOLOS learns a general representation using ImageNet-1k and is then fine-tuned on the COCO dataset. If you’ve trained a custom object detection model, or simply employed transfer learning, you have taken a pre-trained model, frozen most of its layers, and fine-tuned the final few layers for your specific dataset/use case. Similarly, in YOLOS all the parameters are initialized with the ImageNet-1k pre-trained weights except for the MLP heads for classification & bounding box regression, and the one hundred [DET] tokens.

The randomly initialized detection [DET] tokens are used as substitutes for object representation. This is done to avoid inductive bias of 2D structure and any prior knowledge of the task that can be introduced during label assignment. When YOLOS models are fine-tuned on COCO, an optimal bipartite matching between predictions generated by [DET] tokens and the ground truth is established for each forward pass. This serves the same purpose as label assignment but is completely unaware of the input 2D structure, or even that it is 2D in nature.

What this means is that YOLOS does not need to re-interpret ViT’s output sequence to a 2D feature map for label assignment. YOLOS is designed with minimal inductive bias injection in mind. The only inductive biases it has are inherent from the patch extraction at the network stem part of ViT and the resolution adjustment for position embeddings. Besides these, YOLOS adds no non-degenerated convolutions, i.e., non 1 x 1 convolution on ViT. Any performance-oriented aspects of modern CNN architectures such as pyramidal feature hierarchy, region-wise pooling, and 2D spatial attention are not added.

This is all done in order to better demonstrate the versatility and transferability of Transformer from image recognition to object detection in a pure sequence-to-sequence manner with minimal knowledge about the spatial structure of the input. And as YOLOS doesn’t know about the spatial structure and geometry, it is feasible for it to perform any dimensional object detection as long as the input is flattened to a sequence in the same way. In addition to that, YOLOS can easily be adapted to various Transformers available in NLP and computer vision.

To test its capabilities YOLOS was compared with some modern CNN-based object detectors. The smaller YOLOS variant YOLOS-Ti achieves impressive performance compared with existing highly-optimized CNN object detectors like YOLOv4 Tiny. It has strong AP and is competitive in FLOPs and FPS even though it was not intentionally designed to optimize these factors.

Although YOLOS-Ti performs better than the DETR counterpart, the larger YOLOS models with width scaling are less competitive. YOLOS-S with more computations is 0.8 AP lower compared with a similar-sized DETR model. What’s even worse is that YOLOS-B cannot beat DETR with over 2× parameters and FLOPs. And despite the fact that YOLOS-S with dwr(fast) scaling outperforms the DETR counterpart, the performance gain cannot be clearly explained by the corresponding CNN scaling methodology.

With these results, we have to keep in mind that YOLOS is not designed to be yet another high-performance object detector. It is merely a touchstone for the transferability of ViT from image recognition to object detection. To compare it with state-of-the-art models like YOLOR or YOLOX would be unfair. There are still many challenges that need to be resolved, but the performance on COCO is promising nonetheless. These initial findings effectively demonstrate the versatility and generality of Transformer to downstream tasks.

]]>Bernoulli trial is a random experiment that has exactly two possible outcomes and the probability of each outcome remains the same each time the experiment is conducted.

\[ P(X = x) = \begin{cases} p &\text{if } 1 \\ (1 - p) &\text{if } 0 \end{cases} \]

A coin flip is an example of an experiment with a binary outcome. Coin flips meet the other requirement as well — the outcome of each individual coin flip is independent of all the others. The outcomes of the experiment don’t need to be equally likely as they are with flips of a fair coin, the following things also meet the prerequisites of the Bernoulli trial:

- The answer to a True or False question
- Winning or losing a game of Blackjack
- Attempting to convince visitors of a website to buy a product, the yes or no outcome is whether they make a purchase.

The binomial distribution is the probability distribution of a series of random Bernoulli trials that produce binary outcomes, each of these outcomes is independent of the rest.

A series of 5 coin flips is a straightforward example, using the formula mentioned below we can get the probabilities of getting 1, 2, 3, 4, or 5 heads.

\[ P(X) = \binom{n}{x}p^x(1-p)^{n-x} \]

Here

- \(n\) is the total number of trials of an event.
- \(x\) corresponds to the number of times an event should occur.
- \(p\) is the probability that the event will occur.
- \(1-p\) is the probability that the event will not occur.
- \( \binom{n}{x}\) is the formula for all possible combinations,i.e., \( \frac{n!}{x!(n-x)!}\)

Let's use this to calculate the probability of getting 5 heads in 10 tosses, and verify the findings by simulating the experiment 10,000 times.

\[ P(X) = \binom{10}{5}\frac{1}{2}^5(1-\frac{1}{2})^{5} \]

\[ => .246 \]

```
import numpy as np
trials = 10000
n = 10
p = 0.5
prob_5 = sum([1 for i in np.random.binomial(n, p, size = trials) if i==5])/trials
print('The probability of 5 heads is: ' + str(prob_5))
```

```
The probability of 5 heads is: 0.242
```

But what if you wanted to plot the distribution of probabilities for all possible values of \(x\)?

```
import matplotlib.pyplot as plt
import seaborn as sns
# Number of trials
trials = 1000
# Number of independent experiments in each trial
n = 10
# Probability of success for each experiment
p = 0.5
# Function that runs our coin toss trials
def run_trials(trials, n, p):
head = []
for i in range(trials):
toss = [np.random.random() for i in range (n) ]
head.append (len ([i for i in toss if i>=0.50]))
return head
heads = run_trials(trials, n, p)
# Ploting the results as a histogram
fig, ax = plt.subplots(figsize=(14,7))
ax = sns.distplot(heads, bins=11, label='simulation results')
ax.set_xlabel("Number of Heads",fontsize=16)
ax.set_ylabel("Frequency",fontsize=16)
# Ploting the actual/ideal binomial distribution
from scipy.stats import binom
x = range(0,11)
ax.plot(x, binom.pmf(x, n, p), 'ro', label='actual binomial distribution')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='r', lw=5, alpha=0.5)
plt.legend()
plt.show()
```

Does the shape of the plot remind you of normal distribution? Well, it should. This is because of the concept of *Normal approximation*. If n is large enough, a reasonable approximation to \(B(n, p)\) can be given by the normal distribution \(N(np, np(1-p)\). This can be seen visually in the Galton Board demonstration below.

The Student's T distribution is bell-shaped & symmetric, similar to the normal distribution, but has more massive tails. It is used in place of the normal distribution when we have small samples (n<30). T distribution starts resembling normal distribution when the sample size increases.

\[ f(t) = \frac {\varGamma \frac{\nu + 1}{2}}{\sqrt {\nu \pi} \varGamma \frac{\nu}{2}}(1 + \frac{t^2}{\nu})^\dfrac{\nu + 1}{2} \]

Don't worry about this formula, you'll probably never need it but if you want to, you can learn more about what it means here.

**What's the need for Student's T distribution?**

Statistical normality is overused. It‘s not as common and only really occurs in the theoretical ‘limits’. To garner normality, you need to have a substantial well-behaved independent dataset but for most cases, small sample sizes and independence are usually what we have. We tend to have sub-optimal data that we manipulate to look normal but in fact, those anomalies in the extremes are telling you something is up.

Lots of things are ‘approximately’ normal. That’s where the danger is.

A lot of the unlikely events are actually more likely than we expect and this is not because of skewness, but because we’re modeling the data wrong. Let's take an over-exaggerated example of ambient temperature forecast over a 100 year period. Is 100 years of data enough to assume normality? Yes? No! The world has been around for millions and millions of years, 100 years is insignificant. However, to reduce computational complexity and make things simpler we choose a small portion of the total data in most cases. By doing so we underestimating the tails.

Therefore, it is important to know what you don’t know and the distribution that you’re using for your inferences. The normal distribution is used for inferences mainly because it offers a tidy closed-form solution but in reality, the difficulties in solving harder distributions are why, at times, they make better predictions.

Let's say we have a random variable with a mean μ and a variance σ², derived from a normal distribution. Then we know that if we find a sample estimate of the mean (say μ⁰), then the following variable z = [μ⁰-μ] / σ is a normal distribution, but, it now has a mean of 0 and a variance of 1. We’ve normalized the variable, or, we’ve standardized it.

However, imagine we have a relatively small sample size (say n < 30) which is part of a greater population. Then our estimate of mean (μ) will remain the same, however, our estimate of standard deviation, our sample standard deviation, has a denominator of n-1 (Bessel's correction).

Because of this, our attempt to normalize our random variable has not resulted in a standard normal, but rather has resulted in a variable with a different distribution, namely: a Student-t Distribution of the form:

\[ t = \frac{|\bar{x} - \mu|} {\Large {\frac {s}{\sqrt{n}}}} \]

\(\bar{x}\) is the sample mean, μ is the estimate of the population mean, s is the standard error, and n is the number of samples.

This is significant to note because it tells us that even though something may be normally distributed, in small sample sizes, the dynamics of sampling from this distribution completely change, which is largely being characterized by Bessel’s correction.

The Poisson distribution gives us the probability of certain events happening when we know how often the events have occurred. Simply put, the Poisson Distribution function gives the probability of observing \(k\) events in a given length of the time period and the average events per time period. \[ P(\text{k events in interval}) = e ^ {-\frac{event}{time} time \kern{1px} period} \frac { {\frac{event}{time} time \kern{1px} period} ^ k} {k!} \] The \(\frac{event}{time} * \text{time period}\) is usually simplified into a single parameter, \(λ\), lambda, the rate parameter. Lambda can be thought of as the expected number of events in the interval. \[ P(\text{k events in interval}) = \frac {e ^ {-\lambda} \lambda ^ k} {k!} \]

here:

- \(k\) is the number of occurrences
- \(e\) is Euler's number (e = 2.71828...)
- \(λ\) is the rate parameter.

Let's use an example to make sense of all this. Suppose that in a coffee shop, the average number of customers per hour is 4. We can use the Poisson distribution formula to calculate the probability of getting k number of customers in the shop.

```
from scipy.stats import poisson
import numpy as np
import matplotlib.pyplot as plt
no_of_customers = np.arange(0, 15, 1)
lamda = 4
probs = poisson.pmf(no_of_customers, lamda)
plt.figure(figsize=(15, 10))
plt.bar(no_of_customers, probs)
plt.show()
```

What about the probability of seeing less than let's say, 7 customers? We can simply get that by integrating the Poisson curve for the limits of 0 -> 7, or simply add the probabilities of all values of k < 7.

```
print(f"Probability of seeing less than 7 customers is {np.around(sum(probs[:8]),3)}")
```

```
Probability of seeing less than 7 customers is 0.949
```

**Conditions for Poisson Distribution**:

- The events can occur independently.
- An event can occur any number of times.
- The rate of occurrence is constant; i.e., the rate does not change based on time.

As the rate parameter, λ, changes so does the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters.

```
from scipy.stats import poisson
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 50, 1)
plt.figure(figsize=(15, 10))
# poisson distribution data for y-axis
y1 = poisson.pmf(x, mu=10)
y2 = poisson.pmf(x, mu=15)
y3 = poisson.pmf(x, mu=25)
y4 = poisson.pmf(x, mu=30)
plt.plot(x, y1, marker='o', label = "lambda = 10")
plt.plot(x, y2, marker='o', color = "red", label = "lambda = 15" )
plt.plot(x, y3, marker='o', color = "green", label = "lambda = 25")
plt.plot(x, y4, marker='o', color = "magenta", label = "lambda = 30")
plt.legend()
plt.show()
```

The most likely number of events in the interval for each curve is the rate parameter. This makes sense because the rate parameter is the expected number of events in the interval and therefore when it’s an integer, the rate parameter will be the number of events with the greatest probability.

With that we're done with the probability distributions, in the next article we'll learn about sampling and the central limit theorem. 👋

- Basics Of Statistics & Why Should You Care About Median And Mode
- Probability: What Exactly Does It Tell Us?

In this part of the series, we'll be diving into the world of probability distributions, but before we do that let's first briefly go over the very thing that these functions map- *Random Variables*.

A random variable is the set of all possible outcomes of a random process. Say what!? Simply put, a random variable can take any one of the many possible outcomes of a random process/experiment. Remember events and experiments from the probability blog? Well, a random variable is used to represent all possible events for an experiment.

To illustrate this let's take a fairly simple example of getting a sum of 7 when rolling two dice:

The two dice can take a total of 36 combinations, we could map each of these outcomes using a random variable, but for the sake of our example we are only considering two outcomes:

- the sum of the upwards facing sides is 7
- the sum of the upwards facing sides is not 7

Let's say we call our random variable X (obviously), then

\[ X = \begin{cases} 0 &\text{if } sum \not= 7 \\ 1 &\text{if } sum = 7 \end{cases} \]

With normal algebraic variables, you can solve some equations and get a definite answer(or two, yes I am looking at you \( x^2 \) ) for the value that a particular variable can take. However, in the case of a random variable, it can take many values and there's no definite answer that will always be true. It's is more useful therefore to talk in terms of probabilities, the probabilities of the random variable taking value1, value2, and so on. Continuing with the dice example

\[ P(X=0) = \frac 5 6 \] \[ P(X=1) = \frac 1 6 \]

One last thing about random variables, they can either be discrete or continuous.

The discrete random variable is one that can take on only a countable number of distinct values.

Examples of discrete random variables :

- the attendance of an afternoon lecture on Friday
- number of defective light bulbs in the box

The probability distribution of a discrete variable is the list of the probabilities associated with every possible outcome. Like our dice example from earlier.

Continuous random variables can take an infinite number of possible values. They usually correspond to measurements.

Examples of continuous random variables:

- the exact height of a giraffe
- the amount of cheese in a pizza

The probability distribution of a continuous random variable is defined through a range of real numbers it can take and the probability of the outcomes is represented by the area under a curve. Take for instance the random variable that maps the probability of students in a class.

**But what is a probability distribution? **

A Probability Distribution is a mathematical function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as the relative likelihood that the value of the random variable would equal to that sample(data point).

A probability distribution function takes all of the possible outcomes of a random variable as input and gives us their corresponding probability values.

When thinking about a series of experiments people new to statistics may think deterministically such as “I flipped a coin 5 times and produced 2 heads”. So the outcome is 2, where is the distribution? The distribution of outcomes occurs because of the variance or uncertainty surrounding them. If both you and your friend flipped 5 coins, it’s pretty likely that you would get different results (you might get 2 heads while your friend gets 1). This uncertainty around the outcome produces a probability distribution, which basically tells us what outcomes are more likely (such as 2 heads) and which outcomes are relatively less likely (such as 5 heads).

There are many different types of probability distribution, these are the one's we will be covering:

- Normal Distribution
- Standard Normal Distribution

- Binomial Distribution
- Student's T Distribution
- Poisson Distribution

Let's start with what is inarguably the most common probability distribution- Normal Distribution. When plotted it is a bell-shaped curve, it is described by the two parameters μ and σ, where μ represents the population mean, or center of the distribution, and σ the population standard deviation.

\[ y = \frac {1} {\sigma \sqrt{2 \ pi}} e ^{-\frac {(x- \mu)^2}{2 \sigma ^2}} \]

This type of distribution is usually symmetric, and its mean, median, and mode are equal.

In symmetric distributions, one half of the distribution is a mirror image of the other half. Skewness is an asymmetry in the statistical distribution in which the curve appears distorted either to the left or to the right. A lot of real-world data forms a skewed distribution.

When a distribution is skewed to the left, the tail on the curve's left-hand side is longer than the tail on the right-hand side, and the mean is less than the mode. This situation is called **negative skewness**. The average human lifespan forms a negatively skewed distribution. This is because most people tend to die after reaching the mean age, and only a small number of people die too soon. If such data is plotted along a linear line, most of the values would be present on the right side, and only a few values would be present on the left side.

When a distribution is skewed to the right, the tail on the curve's right-hand side is longer than the tail on the left-hand side, and the mean is greater than the mode. This situation is called **positive skewness**. The income distribution of a state is an obvious example of a positively skewed distribution. A huge portion of the total population residing in a particular state tends to fall under the category of a low-income earning group, while only a few people fall under the high-income earning group.

When you are working with a symmetrical distribution for continuous data, the mean and median(sometimes even mode) are equal. In this case, the choice doesn't matter because all of them represent the relevant information. However, if you have a skewed distribution, the median is often the better measure of central tendency.

When working with skewed data, the tail region acts as an outlier for the statistical models and can be detrimental to their performance, especially regression-based models. There are some statistical models that are robust to outliers like tree-based models but skewed data limits our ability to try different models This creates a need for transforming skewed distribution into a more "*normal*" distribution, you do so by using log transformation.

Log transformation simply replaces each variable x with a log(x). The choice of the logarithm base depends on the purpose of statistical modeling but natural log is a good place to start.

The empirical rule, also known as the three-sigma rule or 68-95-99.7 rule, states that for the normal distribution, nearly all of the data will fall within three ranges of standard deviations of a mean.

The empirical rule can be understood through the following:

- 68.3% of the data falls within the first standard deviation from the mean.
- 95.5% falls within two standard deviations.
- 99.7% falls within three standard deviations.

The empirical rule helps determine outcomes when not all the data is available. It allows us to gain insight into where the data will fall, once all of it is available.

It is also used to test how *normal* a data set is. If the data does not adhere to the empirical rule, then it is not considered a normal distribution and must be treated accordingly.

A Standard Normal Distribution is a normal distribution that has a mean \(\mu\) equal to zero and standard deviation \(\sigma\) is equal to one. We can standardize any distribution using the Z-score formula. Z-score (also called a standard score) is a measure of how many standard deviations below or above the population mean a raw score is. Simply put, the z-score gives you an idea of how far from the mean a data point is.

\[ Z = \frac {x - \mu}{\sigma} \]

Different distributions that are normal will have different means and standard deviations. To be able to find the probability of a given characteristic (say the height of students in a class) lying within a given interval, we have to integrate the density function within the limits given the \(\mu\) and \(\sigma\) of the distribution. For each study, we would have to follow a similar process which can be tedious. This is where standard normal distribution to our rescue. For the standard normal distribution the value of the parameters are known:(\(\mu = 0\), \(\sigma = 1\)). So for a standard normal distribution, we have the standard normal chart which shows us the area under the curve for all values between 0 to +-3 \(\sigma\). Standardization of a normal variate does not affect the characteristics of the distribution and hence we can use it to easily compute required probabilities.

TLDR: Using standardized normal distribution makes inferences and predictions much easier.

We're gonna stop here for now, in the second part we'll cover the remaining three distributions, and then go on to learn about sampling and hypothesis testing.

]]>- Detecting traffic objects
- Segmenting the drivable area
- Detecting lanes

There are numerous state-of-the-art algorithms that handle each of these tasks separately. Take for instance Mask R-CNN and YOLOR for object detection, or models like UNet and PSPNet for semantic segmentation. Despite their excellent individual performances, processing each of these tasks one by one takes a long time. In addition to that, the embedded devices these models are ultimately deployed have very limited computational resources, this makes the sequential approach even more impractical.

These traffic scene understanding tasks have a lot of related information, for example, the lanes often mark the boundary of the drivable area, and most traffic objects are generally located within the drivable area. YOLOP, You Only Look Once for Panoptic Driving Perception, takes a multi-task approach to these tasks and leverages the related information to build a faster, more accurate solution.

YOLOP has one shared encoder and three decoder heads to solve specific tasks. There are no complex shared blocks between different decoders to keep the computation to a minimum and allow for easier end-to-end training.

The encoder consists of a backbone network and a neck network. YOLOP employs the lightweight CSPDarknet as the backbone, which is used to extract features from the input images. It supports feature propagation and reuse, which reduces the number of parameters and calculations. The neck network is responsible for feature engineering, it manipulates the extracted image features to get the most out of them. It consists of a Spatial Pyramid Pooling (SPP) module and a Feature Pyramid Network (FPN) module. The SPP model generates and fuses features of different scales, and the FPN module fuses features at different semantic levels. Thus, the neck network generates rich features containing multiple scales and multiple semantic level information. (Here, concatenation is used to fuse the features.)

For the object detection task, YOLOP adopts an anchor-based multi-scale detection technique similar to that of YOLOv4. There are two reasons behind this choice, firstly the single-stage detection networks are faster than the two-stage detection networks. Secondly, the grid-based prediction mechanism is more relevant to the other two semantic segmentation tasks. The YOLOP detect head is composed of a Path Aggregation Network. The FPN in the neck network transfers semantic features top-down, and PAN transfers image features bottom-up. YOLOP combines them to obtain a better feature fusion effect, the multi-scale fusion feature map thus obtained is used for detection. If you want to learn more about how the grid-based detection mechanism works, check out this in-depth explanation.

The drivable area segmentation head and the lane line segmentation head use the same network structure. The features of size (W/8, H/8, 256) from the bottom layer of the FPN are fed to the segmentation branch. It applies three upsampling processes and restores the feature map to (W, H, 2), which represents the pixel-wise probability for the drivable area and lane line in the input image. Where other segmentation networks would have an SPP module, the YOLOP segmentation heads don’t need one because of the shared SPP module in the neck network.

YOLOP employs straightforward loss functions, it has three individual loss functions for the three decoder heads. The detection loss is the weighted sum of classification loss, object loss, and bounding box loss. Both loss functions of drivable area segmentation head and lane line segmentation head contain cross-entropy loss with logits. The lane line segmentation has an additional IoU loss for its effectiveness in predicting spare categories. The overall loss function of the model is a weighted sum of all three losses.

The creators of YOLOP experimented with different training methodologies. They tried training end to end, which is quite useful in cases where all tasks are related. Furthermore, they also examined some alternating optimization algorithms which train the model step by step. Where each step focuses on one or multiple related tasks. What they observed is that the alternating optimization algorithms offer negligible improvements in performance, if any.

YOLOP was tested on the challenging BDD100K dataset against the state-of-the-art models for the three tasks. It beats Faster RCNN, MultiNet, and DLT-Net in terms of accuracy for the object detection task and can infer in real time. For the drivable area segmentation task, YOLOP outperforms models like MultiNet and DLT-Net by 19.9% and 20.2%, respectively. Moreover, it is 4 to 5 times faster than both of them. Similarly, for the lane detection task, it outperforms the existing state-of-the-art models by a factor of up to 2.

It is one of the first models— if not the first— model to perform these three tasks simultaneously in real-time on an embedded device like Jetson TX2 and achieve state-of-the-art performance.

The YOLO series has seen a lot of developments in 2021. In an earlier article, we compared YOLOR and YOLOX, two state-of-the-art object detection models, and concluded that YOLOR is better in terms of performance and general-purpose use whereas YOLOX is better suited for edge devices. With the introduction of YOLOP, a question now arises - “where does YOLOP fit in all this?” And the short answer is - it doesn’t really fit there at all.

You see, both YOLOX and YOLOR, regardless of their different approaches aim to solve the general-purpose object detection task. On the other hand, YOLOP was created solely for the purpose of traffic scene understanding, this is reflected in its design choices and its performance when it is trained to perform only object detection(as can be observed in the table above).

Do you want to learn one of the most pivotal computer vision tasks—object detection—and convert it into a marketable skill by making awesome computer vision applications like the one shown above? Enroll in AugmentedStartup's YOLOR course HERE today! It is a comprehensive course on YOLOR that covers not only the state-of-the-art YOLOR model and object detection fundamentals, but also the implementation of various use-cases and applications, as well as integrating models with a web UI for deploying your own YOLOR web apps.

This post is a part of my crossposts series which means I wrote this for someone else, the original article has been referred to in the canonical URL and you can read it here.

]]>In addition, it can be easily extended to multi-model learning like OpenAI’s CLIP. This allows YOLOR to develop its implicit knowledge even further and use other mediums of data such as audio and text. Now, you might be thinking what’s all this “implicit knowledge” and “subconsciousness” stuff, don’t worry it’s nothing related to spirituality. You can get a better understanding of exactly what all this means and how YOLOR works in our awesome breakdown of YOLOR’s research paper and architecture. Read it HERE.

In this tutorial we’ll do the following:

- Install YOLOR and its dependencies on Colab
- Run it on an image
- Run it on a video

We are going to do all of this on Colab, so you can either create a new notebook of your own or get the base notebook here. So with that out of the way, let’s get started! The first thing we are going to do is mount our Google Drive to Colab so we have persistent storage just in case we get disconnected in the middle of things.

```
from google.colab import drive
drive.mount('/content/drive')
```

This will prompt a link for signing into your Google account, when you sign in you’ll get the authorization code required for mounting your Drive to Colab. Now we can get started with setting up YOLOR for inference, first of all, let’s clone the YOLOR GitHub repo and navigate into the newly-created directory.

```
!git clone https://github.com/augmentedstartups/yolor
%cd yolor
```

From within the directory, we will install the requirements.

```
!pip install -qr requirements.txt
```

Once that’s finished we’ll need to install two more things before we can run the model:

- Mish-Cuda: The PyTorch CUDA implementation of the self-regularized mish activation function
- Pytorch Wavelets: Python module for computing 2D discrete wavelet and the 2D dual-tree complex wavelet transforms using PyTorch.

You need not concern yourself with the details of these requirements unless you want to tinker with the architecture of YOLOR. Just ensure that both of these are installed inside the `yolor`

directory.

```
# Installing Mish CUDA
!git clone https://github.com/JunnYu/mish-cuda
%cd mish-cuda
!python setup.py build install
# Moving back to the yolor directory
%cd ..
# Installing PyTorch Wavelets
!git clone https://github.com/fbcotter/pytorch_wavelets
%cd pytorch_wavelets
!pip install .
```

Now that we have set everything up, all we need to do is download the pre-trained models using the bash script provided in the `scripts`

sub-directory.

```
!bash scripts/get_pretrain.sh
```

With that; the setup is complete and we can finally use the models to detect objects in images/videos. Let’s start with images, the repo already had one image for testing but you can always upload your own. Make sure to edit the `--source`

option to the path of the image you want to use, and the `--output`

option to specify where you want the output image to be stored.

```
!python detect.py --source inference/images/horses.jpg --cfg cfg/yolor_p6.cfg --weights yolor_p6.pt --conf 0.25 --img-size 1280 --device 0 --output /content/drive/MyDrive/YOLOR-Output
```

To make inference on videos, all you need to do is pass the path to a video to the `--source`

option.

```
!python detect.py --source test.mp4 --cfg cfg/yolor_p6.cfg --weights yolor_p6.pt --conf 0.25 --img-size 1280 --device 0 --output /content/drive/MyDrive/YOLOR-Output
```

Feel free to play around with the confidence (`--conf`

) value and different YOLOR variants. Just keep in mind that you’ll need to change both the config file (`--cfg`

) and the weights (`--weights`

) options.

Do you want to learn more about YOLOR and how it can be used to make awesome computer vision applications like the one shown above? Enroll in AugmentedStartup's YOLOR course HERE today! It is a comprehensive course on YOLOR that covers YOLOR and object detection fundamentals, implementation, and building various applications, as well as integrating models with a Streamlit UI for building your own web apps.

This post is a part of my crossposts series which means I wrote this for someone else, the original article has been referred to in the canonical URL and you can read it here.

]]>Simply put transformations are geometric distortions enacted on an image, these distortions can be simple things like resizing, rotation, translation. However, there are also some more complex distortions such as warp transformation which is used to correct perspective issues arising from the point of view an image was captured from. Transformations are classified into two broad categories in computer vision - Affine and Non-Affine Transformations

**Affine transformations** include things such as scaling, rotation, and translation. The key point to remember is that the lines that were parallel in the original image remain parallel in the transformed images. So whether you scale an image, rotate, or scale it, the parallelism between lines is maintained.

Non-affine transformations are very common in computer vision, and they originate from different camera angles. Take for instance the illustration above, on the left, you're looking at the square from a top-down perspective. As you slowly start to move the camera downwards your view will become skewed. The points that are further from you will start to appear closer together than the points closest to you.

Now let's actually see how we can apply these transformations in OpenCV.

Translations are very simple, it's just basically moving an image in one direction. This can be up, down, or even diagonally. To perform translations we use OpenCV's `cv2.warpAffine`

function which requires a translation matrix \( T \). Let's quickly understand the translation matrix without getting into too much geometry, it takes the following form:

\[ T = \begin{equation} \begin{bmatrix} 1 & 0 & T_x \\ 0 & 1 & T_y \\ \end{bmatrix} \end{equation} \]

Here

- \( T_x \) : represents shift along x-axis
- \( T_y \) : represents shift along y-axis

Now let's actually apply the transformation to an image. We are going to translate an image down diagonally, we'll do so by shifting it a quarter of its height and width.

```
import cv2
import numpy as np
image = cv2.imread('images/ripple.jpg')
height, width = image.shape[:2]
quarter_height, quarter_width = height/4, width/4
# making the translation matrix
T = np.float32([[1, 0, quarter_width], [0, 1,quarter_height]])
# warpAffine takes as argument the image, translation matrix T
# and the dimensions of the output image
img_translation = cv2.warpAffine(image, T, (width, height))
cv2.imshow('Original', image)
cv2.imshow('Translation', img_translation)
cv2.waitKey()
cv2.destroyAllWindows()
```

Rotations are also pretty simple as well but there are some quirks. Similar to our translation transformation, rotation is done using the `cv2.warpAffine`

function but instead of passing in a translation matrix \( T \), we pass a rotation matrix \( M \) that can be easily created using `cv2.getRotationMatrix2D`

function.
\[
M =
\begin{equation}
\begin{bmatrix}
\cos\theta & - \sin\theta \\
\sin\theta & \cos\theta \\
\end{bmatrix}
\end{equation}
\]

Here \( \theta \) is the angle of rotation.

`cv2.getRotationMatrix2D`

takes three arguments a tuple representing the coordinates of the pivot point `(x,y)`

, the angle of rotation \( \theta \) , a float `scale`

value.
"What's the need for a scale value?" That's what you're thinking, right? Well, let's try rotating our image by 45 degrees about its center and see what happens.

```
image = cv2.imread('images/ripple.jpg')
height, width = image.shape[:2]
# dividing the dimensions by two to rototate the image around its centre
rotation_matrix = cv2.getRotationMatrix2D((width/2, height/2), 45, 1)
rotated_image = cv2.warpAffine(image, rotation_matrix, (width, height))
cv2.imshow('Rotated Image', rotated_image)
cv2.waitKey()
cv2.destroyAllWindows()
```

Notice the black space around it? and how parts of the image are cropped out. This is further exaggerated when working with portrait or landscape images. So if we want to use the whole image after rotating it regardless of its initial size and orientation, scaling it down makes sense.

```
rotation_matrix = cv2.getRotationMatrix2D((width/2, height/2), 45, .7)
rotated_image = cv2.warpAffine(image, rotation_matrix, (width, height))
cv2.imshow('Rotated & Scaled Image', rotated_image)
cv2.waitKey()
cv2.destroyAllWindows()
```

If you only want to rotate the image by 90 degrees without worrying about the size and orientation of the image you can use the `cv2.transpose`

function.

```
rotated_image = cv2.transpose(image)
cv2.imshow('Rotated Image - Transpose', rotated_image)
```

Another handy function worth remembering is the `cv2.flip`

function that flips the image based on the `axis`

argument.

```
# Horizontal Flip
flipped_h = cv2.flip(image, 1)
# Vertical Flip
flipped_v = cv2.flip(image, 0)
```

If you have read the first article of the series, you already have a decent understanding of how resizing works in OpenCV. To recap, you can use the `cv2.resize`

function to resize an image. There are two ways of selecting the target size, you can either pass the desired dimensions as a tuple

```
resized_images = cv2.resize(image, (256, 256))
```

or scale it using the scaling factors `fx`

and `fy`

```
resized_image = cv2.resize(image, None, fx=.5, fy=.5)
```

Resizing is a simple transformation, for the most part, the only nuanced thing is the choice of interpolation method. Interpolation is an estimation method that creates new data points between a range of known data points. In the context of images and resizing, interpolation is used to create new data points when "zooming in" or expanding the image, and deciding which points to exclude when "zooming out" or shrinking the image. There are a bunch of interpolation methods supported by OpenCV but comparing them is beyond the scope of this article, you can find a good comparison here.

Here's a list of all the methods and general intuition regarding when to use what:

- cv2.INTER_AREA - Good for shrinking or downsampling
- cv2.INTER_NEAREST - Fastest
- cv2.INTER_LINEAR - Good for zooming or upsampling (default)
- cv2.INTER_CUBIC - Better
- cv2.INTER_LANCZOS4 - Best

You can select the interpolation method by passing the respective flag to the `interpolation`

argument.

```
img_scaled = cv2.resize(image, None, fx=2, fy=2, interpolation = cv2.INTER_CUBIC)
```

Image pyramids are a multi-scale representation of an image, they allow us to quickly scale images. Scaling down reduces the height and width of the new image by half, and similarly, scaling up increases the dimensions by a factor of 2. This is extremely useful when working with something like object detectors that scale images each time they look for an object.

You can use the `and`

to quickly scale up and scale down images, respectively.

```
image = cv2.imread('images/ripple.jpg')
smaller = cv2.pyrDown(image)
larger = cv2.pyrUp(image)
cv2.imshow('Original', image )
cv2.imshow('Smaller ', smaller )
cv2.imshow('Larger ', larger )
```

Cropping images is fairly straightforward and you have probably done it before. It is useful for extracting regions of interest from images. Images are stored as arrays so cropping images is just a matter of slicing the arrays appropriately.

```
image = cv2.imread('images/ripple.jpg')
height, width = image.shape[:2]
# let's get the starting pixel coordiantes (top left of cropping area)
start_row, start_col = int(height * .25), int(width * .25)
# now the ending pixel coordinates (bottom right)
end_row, end_col = int(height * .75), int(width * .75)
# Simply use array indexing to crop out the area we desire
cropped = image[start_row:end_row , start_col:end_col]
cv2.imshow("Original Image", image)
cv2.imshow("Cropped Image", cropped)
```

**Arithmetic operations** allow us to directly add or subtract the color intensity. It calculates the per-element operation of two arrays. The overall effect is an increase or decrease in brightness.

```
image = cv2.imread('images/ripple.jpg')
# creating a matrix of ones, then multiply it by a scaler of 100
# This gives a matrix with the same dimensions
# as our image with all values being 100
X = np.ones(image.shape, dtype = "uint8") * 100
# We use this to add this matrix X, to our image
# Notice the increase in brightness
added = cv2.add(image, X)
cv2.imshow("Added", added)
# Likewise we can also subtract
# Notice the decrease in brightness
subtracted = cv2.subtract(image, X)
cv2.imshow("Subtracted", subtracted)
```

One thing to note when adding or subtracting from an image is that you can't exceed 255 or go below 0. For instance, if the value of a pixel becomes 256 after addition it will revert to 255, and if the value of a pixel becomes -2, it will revert to 0.

**Bitwise operations** perform logical operations like AND, OR, NOT & XOR on two images. They are used with masks as we saw in the first article of the series.

Let's create some shapes to demonstrate how these work.

```
# making a sqare
square = np.zeros((300, 300), np.uint8)
cv2.rectangle(square, (50, 50), (250, 250), 255, -2)
cv2.imshow("Square", square)
# making a ellipse
ellipse = np.zeros((300, 300), np.uint8)
cv2.ellipse(ellipse, (150, 150), (150, 150), 30, 0, 180, 255, -1)
cv2.imshow("Ellipse", ellipse)
```

Now let's perform some logical operations using these.

```
# shows only where the shapes intersect
and = cv2.bitwise_and(square, ellipse)
cv2.imshow("AND", and)
# shows where either square or ellipse is
bitwiseOr = cv2.bitwise_or(square, ellipse)
cv2.imshow("OR", bitwiseOr)
# shows where either shape exist by itself
bitwiseXor = cv2.bitwise_xor(square, ellipse)
cv2.imshow("XOR", bitwiseXor)
# shows everything that isn't part of the square
bitwiseNot_sq = cv2.bitwise_not(square)
cv2.imshow("NOT - square", bitwiseNot_sq)
```

Convolution is a mathematical operation performed on two functions (f and g) producing a third function (f \( \otimes \) g) which is a modified version of one of the original functions. This new function expresses how the shape of one is modified by the other.

In the context of computer vision, convolution is the process of adding pixel values of the image to their local neighbors, weighted by a kernel. The kernel is just a small matrix, the size and values change depending on the type of image processing operation we want to apply.

Despite being denoted by \( \otimes \), the matrix operation being performed is not the normal matrix multiplication.

\[
\left (

\begin{equation}
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9 \\
\end{bmatrix}
\otimes
\begin{bmatrix}
a & b & c \\
d & e & f \\
g & h & i \\
\end{bmatrix}
\end{equation}
\right ) [2,2] = (a.9) + (b.8) + (c.7) + (d.6) + (e.5) + (f.4) + (g.3) + (h.2) + (i.1)
\]

As you may have guessed it blurring is the operation where we average the pixels within a region. The blurring operation blends the edges in the images and makes them less noticeable.

The following is a 5 x 5 kernel used for *average*blurring. We multiply it by 1/25 to normalize i.e. sum to 1, otherwise, the intensity of the image would increase.

\[ Kernel = \frac{1}{25} \begin{equation} \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ \end{bmatrix} \end{equation} \]

We can apply any kernel on an image using the `cv2.filter2D`

function, let's apply a 5 x 5 and 9 x 9 kernel on an image.

```
image = cv2.imread('images/pavement.jpeg')
# creating our 5 x 5 kernel
kernel_5x5 = np.ones((5, 5), np.float32) / 25
# use the cv2.fitler2D to conovlve the kernal with an image
blurred_5x5 = cv2.filter2D(image, -1, kernel_5x5)
cv2.imshow('5x5 Kernel Blurring', blurred_5x5)
# creating our 9 x 9 kernel
kernel_9x9 = np.ones((9, 9), np.float32) / 81
blurred_9x9 = cv2.filter2D(image, -1, kernel_9x9)
cv2.imshow('9x9 Kernel Blurring', blurred_9x9)
`
```

Here's an awesome animation illustrating exactly how this convolution blurring process happens (*source*).

Commonly used blurring methods and what they do:

- cv2.blur – Averages values over a specified window like the ones we implemented above
- cv2.GaussianBlur – Similar to average blurring but uses a Gaussian window (more emphasis on points around the center) \[ Gaussian Kernel = \frac{1}{273} \begin{equation} \begin{bmatrix} 1 & 4 & 7 & 4 & 1 \\ 4 & 16 & 26 & 16 & 4 \\ 7 & 26 & 41 & 26 & 7 \\ 4 & 16 & 26 & 16 & 4 \\ 1 & 4 & 7 & 4 & 1 \\ \end{bmatrix} \end{equation} \]
- cv2.medianBlur – Uses median of all elements in the window
- cv2.bilateralFilter – It blurs the image while keeping edges sharp. It also uses a Gaussian filter, but adds one more Gaussian filter that is a function of pixel difference. This pixel difference function makes sure only those pixels with similar intensity to the central pixel are considered for blurring. This helps preserve the edges since pixels at edges will have large intensity variation. It highly effective for removing noise.

```
blur = cv2.blur(image, (3,3))
cv2.imshow('Averaging', blur)
# Instead of box filter, gaussian kernel
Gaussian = cv2.GaussianBlur(image, (7,7), 0)
cv2.imshow('Gaussian Blurring', Gaussian)
# Takes median of all the pixels under kernel area and central
# element is replaced with this median value
median = cv2.medianBlur(image, 5)
cv2.imshow('Median Blurring', median)
# Bilateral is very effective in noise removal while keeping edges sharp
bilateral = cv2.bilateralFilter(image, 9, 75, 75)
cv2.imshow('bilateral Blurring', bilateral)
```

Sharpening is the opposite of blurring, it strengthens or emphasizes the edges in an image. In the case of sharpening, our kernel matrix will usually sum to one, so there is no need to normalize.

\[ Kernel = \frac{1}{25} \begin{equation} \begin{bmatrix} -1 & -1 & -1 \\ -1 & 9 & -1 \\ -1 & -1 & -1 \\ \end{bmatrix} \end{equation} \]

```
kernel_sharpening = np.array([[-1,-1,-1],
[-1,9,-1],
[-1,-1,-1]])
# applying the sharpening kernel to the image
sharpened = cv2.filter2D(image, -1, kernel_sharpening)
cv2.imshow('Image Sharpening', sharpened)
```

Perspective transformation enables use to extrapolate what an image would look like from a different camera angle. Consider the image above, it is a warped perspective of a perfectly rectangular piece of A4 paper as viewed from an angle. We can get the top-down perspective of this by using the `cv2.warpPerspective`

function. The `cv2.warpPerspective`

function applies a *Perspective Transformation* matrix, \( M \). OpenCV provides a method `cv2.getPerspectiveTransform`

that generates this \( M \) matrix for us, it takes as an argument the two sets of coordinates. Firstly, the four corners of the object or region of interest and four corners of what we want the object/region to occupy.

It's important to remember that the second set of coordinates should correspond to sides that have a ratio similar to the ratio of dimensions of the object in real life. For example, an A4 sheet of paper has the ratio of 1:1.41, so we can use any set of coordinates as long the lengths of the sides have the ratio of 1:1.41. It's not that the transform won't work, just that it won't come out as nice. The whole point of the perspective transform is to fix skewed images, if the ratio is wrong we will just transform one skewed image into another.

```
image = cv2.imread('images/scan.jpg')
# cordinates of the 4 corners of the original image
points_A = np.float32([[320,15], [700,215], [85,610], [530,780]])
# cordinates of the 4 corners of the desired output
# We use the ratio of an A4 Paper 1 : 1.41
points_B = np.float32([[0,0], [420,0], [0,594], [420,594]])
# using the two sets of four points to compute
# the Perspective Transformation matrix, M
M = cv2.getPerspectiveTransform(points_A, points_B)
# warpPerspective also takes as argument
# the final size of the image
warped = cv2.warpPerspective(image, M, (420,594))
cv2.imshow('warpPerspective', warped)
```

Pretty neat! Right?

That's it for now folks! In the next article, we will take a deeper dive into thresholding, learn about edge & contour detection. We will then use what we have learned so far to automate this process further to create our own *scanner application*. See ya!

There are two broad paradigms of object detection techniques- single-stage detectors and two-stage detectors. Two-stage models like R-CNN, have an initial stage Region Proposal Network(RPN) that’s solely responsible for finding candidate regions in images along with an approximate bounding box for it. The second stage network uses the local features proposed by the RPN to determine the class and create a more refined bounding box. The separation of responsibility allows the second stage network to focus on learning the features isolated to the regions of interest proposed by the RPN, and thus leads to improved performance.

On the other hand, single-stage approaches such as YOLO have grid cells that map to different parts of the image, and every single grid cell is associated with several anchor boxes. Where each anchor box predicts an objectness probability(probability of the presence of an object), a conditional class probability, and bounding-box coordinates. Most grid-cells and anchors see "background" or no-object regions and a few see the ground-truth object. This hampers the learning ability of the CNN.

Historically, this was one of the main reasons for lower accuracy for single-stage detectors compared to two-stage approaches. However, over recent years new single-stage detection models have outperformed, if not matched, their two-stage counterparts. Let’s take a look at two of the latest YOLO variants introduced this year- YOLOR and YOLOX.

Humans gain knowledge through deliberate learning (explicit knowledge), or subconsciously (implicit knowledge). The combination of the two types enables human beings to effectively process data, even unseen data. Furthermore, humans can analyze the same data points from different angles for different objectives. However, convolutional neural networks can only look at data from a single perspective. And the features outputted by CNNs are not adaptable to other objectives. The main cause for the issue is that CNNs only make use of the features from neurons, the explicit knowledge. All whilst not utilizing the abundant implicit knowledge.

YOLOR is a unified network that integrates implicit knowledge and explicit knowledge. It pre-trains an implicit knowledge network with all of the tasks present in the COCO dataset to learn a general representation, i.e., implicit knowledge. To optimize for specific tasks, YOLOR trains another set of parameters that represent explicit knowledge. Both implicit and explicit knowledge are used for inference.

As mentioned earlier YOLO models take the image and draw a grid of different small squares. And then from these small squares, they regress off of the square to predict the offset they should predict the bounding box at. These grid cells alone give us tens of thousands of possible boxes, but YOLO models have anchor boxes on top of the grid. Anchor boxes have varying proportions that allow the model to detect objects of different sizes in different orientations.

For example, if you have sharks and giraffes in a dataset, you would need skinny and tall anchor boxes for giraffes and wide and flat ones for sharks. Although the combination of these two enables the model to detect a wide range of objects, they also pose an issue in the form of increased computation cost and inference speed. Another limiting aspect of YOLO models is the coupling of bounding box regression and object detection tasks that causes a bit of a tradeoff.

YOLOX addresses both of these limitations, it drops box anchors altogether, this results in improved computation cost and inference speed. YOLOX also decouples the YOLO detection head into separate feature channels for box coordinate regression and object classification. This leads to improved convergence speed and model accuracy.

Despite offering significant performance and speed boosts to existing state-of-the-art YOLO variants, YOLOX falls short of YOLOR in terms of sheer accuracy/mAP. However, YOLOX does show a lot of promise in terms of bringing better models to edge devices. Even with smaller model sizes, YOLOX-Tiny and YOLOX-Nano outperform their counterparts—YOLOv4-Tiny & NanoDet—significantly, offering a boost of 10.1% and 1.8% respectively.

On the object detection leaderboard for the COCO dataset, the only YOLO variant that comes close to YOLOR is Scaled-YOLOv4, and YOLOR is a whopping 88% faster than it! In addition to improved performance in object detection tasks, YOLOR’s unified network is very effective for multi-task learning. This means that the implicit knowledge learned by the model can be leveraged to perform a wide range of tasks beyond object detection such as keypoint detection, image captioning, pose estimation, and many more. Furthermore, YOLOR can be extended to multi-model learning like CLIP, enabling it to expand its implicit knowledge base even further and leverage other forms of data such as text and audio.

If you enjoyed this introduction to state-of-the-art object detection techniques and want to learn how to build real-world applications using YOLOR, you can enroll in AugmentedStartup's YOLOR course here.

This post is a part of my crossposts series which means I wrote this for someone else, the original article has been referred to in the canonical URL and you can read it here.

]]>Feature descriptors take an image and compute feature descriptors/vectors. These features act as a sort of numerical "fingerprint" that can be used to differentiate one feature from another. The Histogram of Oriented Gradients (HOG) algorithm counts the occurrences of gradient orientation in localized portions of an image. It divides the image into small connected regions called cells, and for the pixels within each cell, the HOG algorithm calculates the image gradient along the x-axis and y-axis.

These gradient vectors are mapped from 0-255, pixels with negative changes are black, pixels with large positive changes are black, and pixels with no changes are grey. Using these two values, the final gradient is calculated by performing vector addition.

Let’s say we are using 8x8 pixel-sized cells, after obtaining the final gradient direction and magnitude for the 64 pixels, each cell is split into angular bins. Each bin corresponds to the gradient direction, with 9 bins of 20° for 0-180°. This enables the Histogram of Oriented Gradients (HOG) algorithm to reduce 64 vectors to just 9 values. HOG is generally used in conjunction with classification algorithms like Support Vector Machines(SVM) to perform object detection.

Convolution Neural Networks (CNNs) are not able to handle multiple instances of an object or multiple objects in an image. R-CNN first performs selective search to extract many different-sized region proposals from the input image to work around this limitation. Each of these region proposals is labeled with a class and a ground-truth bounding box. A pre-trained CNN is used to extract features for the region proposals through forward propagation. Next, these features are used to predict the class and bounding box of this region proposal using SVMs and linear regression.

R-CNN selects thousands of region proposals and independently propagates each of these through a pre-trained CNN. This slows it down considerably and makes it harder to use for real-time applications. To overcome this bottleneck Fast R-CNN performs the CNN forward propagation once on the entire image.

Both R-CNN and Fast R-CNN produces thousands of region proposal, most of which are redundant. Faster R-CNN reduces the total number of region proposals by using a region proposal network(RPN) instead of selective search to further improve the speed.

The R-CNN family of object detectors can be divided into two subnetworks by the Region-of-Interest (ROI) pooling layer:

- a shared, “fully convolutional” subnetwork independent of ROIs
- an ROI-wise subnetwork that does not share computation.

ROI pooling is followed by fully connected (FC) layers for classification and bounding box regression. The FC layers after ROI pooling do not share among different ROIs and take time. This makes R-CNN approaches slow, and the fully connected layers have a large number of parameters. In contrast to region-based object detection methods, Region-based Fully Convolutional Network (R-FCN) is fully convolutional with almost all computation shared on the entire image.

R-FCN still uses an RPN to obtain region proposals, but in contrast to the R-CNN family, the fully connected layers after ROI pooling are removed. Rather, all major computation is done before ROI pooling to generate the score maps. After the ROI pooling, all the region proposals use the same set of score maps to perform average voting. So, there is no learnable layer after the nearly cost-free ROI layer. This reduces the number of parameters significantly, and as a result, R-FCN is faster than Faster R-CNN with competitive mAP.

Single Shot Detector (SSD) has two parts:

- a backbone model for extracting features
- an SSD convolutional head for detecting objects.

The backbone model is a pre-trained image classification network (like ResNet) from which the last fully connected classification layer has been removed. This leaves us with a deep neural network that can extract semantic meaning from the input image while preserving the spatial structure of the image. The SSD head is simply one or more convolutional layers added to this backbone model. The outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the activations of the final layers.

CNNs require a fixed-size input image, which limits both the aspect ratio and the scale of the input image. When used with arbitrary sized images CNNs fit the input image to the fixed size via cropping or warping. However, cropping might result in the loss of some parts of the object, while warping can lead to geometric distortion. Furthermore, a pre-defined scale does not work well with objects of varying scales.

CNNs consist of two parts: the convolutional layers, and fully connected layers. The convolutional layers work in a sliding window manner do not require a fixed image size and can generate feature maps of varying sizes. The fully connected layers, or any other classification algorithm like SVM for that matter, requires fixed-size input. Hence, the fixed-size constraint comes only from the fully connected layers.

Such input vectors can be produced by the Bag-of-Words (BoW) approach that pools the features together. Spatial Pyramid Pooling (SPP) improves upon BoW, it maintains spatial information by pooling in local spatial bins. These bins have sizes proportional to the image size, this means the number of bins remains unchanged regardless of the image size. To use any deep neural network with images of arbitrary sizes, we simply replace the last pooling layer with a spatial pyramid pooling layer. This not only allows arbitrary aspect ratios but also allows arbitrary scales.

SPP-net computes the feature maps from the entire image once and then pools the features in arbitrary regions to generate fixed-length representations for the detector. This avoids repeatedly computing the convolutional features. SPP-net is faster than the R-CNN methods while achieving better accuracy.

You Only Look Once (YOLO) detectors take the image and draw a grid of small squares. From these small squares, they regress off of the square to predict the offset they should put the bounding box at. On top of these grid cells, YOLO models have anchor boxes with varying proportions that enable the model to detect objects of different sizes in different orientations. For example, if you have cars and traffic lights in a dataset, you would need skinny and tall anchor boxes for traffic lights and wide and flat ones for cars. YOLO uses a single deep neural network to predict the bounding box and the class of the objects.

This list is by no means exhaustive, there are a plethora of object detection techniques beyond the ones mentioned here. These are just the ones that have seen widespread recognition and adoption so far.

If you enjoyed this introduction to state-of-the-art object detection techniques and want to learn how to build real-world applications using YOLOR, you can enroll in AugmentedStartup's YOLOR course here.

This post is a part of my crossposts series which means I wrote this for someone ekse, the original article has been referred to in the canonical URL and you can read it here.

]]>In this article, we will go through the *Frequentist* interpretation of probability as it is easier to understand from a beginner’s point of view. You can read about the other interpretations of probability here.

The most important questions of life are indeed, for the most part, really only problems of probability. — Pierre-Simon Laplace

Events are one of the most basic concepts in statistics, simply put they are the results of experiments or processes. An event can be certain, impossible, or random.

An experiment is defined as the execution of certain actions within a set of conditions.

An event is said to be certain if it will occur at every performance of the experiment. Take, for instance, the event of getting a number less than 7 when throwing a die. On the other hand, impossible events are events that will not occur as a result of the experiment. For example, the event of getting a 7 when throwing a die.

And lastly, an event is called a random event if it may or may not occur as a result of the experiment. These are outcomes of random factors whose influence cannot be predicted accurately if they can be predicted at all. Let’s take the case of a dice roll, the experiment is influenced by random factors like the shape and physical characteristics of the dice, the strength & method of the throw, and environmental factors such as air resistance. Accurately predicting factors like these is basically impossible.

Let’s go more in-depth in analyzing a random event, for instance, a fair coin toss. A fair experiment is one in which all outcomes are equally likely. There are two possible outcomes of a coin toss — heads or tails. The outcome of the flip is considered random because the observer cannot analyze and account for all the factors that influence the result. Now, if I were to ask you what is the probability of the toss resulting in heads, you would probably say ½, but why?

Let’s name the event that came up as head A, and let’s say the coin is tossed `n`

times. Then the probability of event A can be defined as its frequency in a long series of experiments, or:

```
Probability of event A = Number of ways A can happen / Total number of outcomes
```

Let’s take another example, a deck of playing cards. There are four Aces in a deck, what is the probability of drawing an Ace from the deck?

Total number of outcomes: 52 (total number of cards)

Number of ways the event can happen: 4

So the probability becomes 4/52, or 1/13

Now you might say that the chances of drawing an ace or getting tails don’t always match these expectations. And you have good reasons to say so. If we were to conduct a series of tests, the probability of an event A would fluctuate around a constant value `P(A)`

for large `n`

. This is the value we call the probability of event A. It basically means that if we were to carry out the experiment for a sufficiently large `n`

, ideally infinite, we would get the event A `P(A) x n`

times.

The probability of all events lies in the range `[0, 1]`

, the probability of 0 indicates impossibility, and 1 indicates certainty.

Two events A and B are said to be independent events if the occurrence of one of them does not influence the probability of seeing the other. For example, having the information that a coss toss resulted in heads on the first toss doesn’t provide us with any useful information for predicting what the outcome of the second toss will be. The probability of a head or a tail remains 1/2 regardless of the outcome of the previous toss. The probabilities of independent events are multiplied to get the total probability of occurrence of all of them.

Let’s illustrate this with an example: What is the probability of getting tails three times in a row?

Firstly, let’s calculate this using our original formula of probability.

Possible outcomes of 3 coin tosses: HHH, HHT, HTH, THH, TTT, TTH, THT, HTT

Number of ways the event can happen: 1

Total number of outcomes: 8

The probability is ⅛, but we know that the outcomes of consecutive coin tosses are independent so we can simply multiply them to get the probability of TTT: `P(T) x P(T) x P(T) = ½ x ½ x ½ = ⅛`

On the other hand, if the occurrence of one event changes the probability of the other, the events are said to be dependent. Knowing that the first card drawn from a deck is a Jack changes the probability of drawing another Jack from 4/52 to 3/51. This is precisely why counting cards is a thing.

Disjoint events, also known as mutually exclusive events, cannot happen at the same time. For example, the outcome of a coin toss cannot be head and tail at the same time. On the flip side, the events that are not disjoin can happen at the same time, and can thus overlap. To avoid double-counting in such cases the probability of overlapping should be subtracted from total probability.

Take for instance the probability of drawing a Queen or Spades card from a deck. There are several ways this can happen: 4 (because there are 4 Queens) and 13 (there are 13 cards of the suit Spade). However, 1 card overlaps between them - the Queen of Spades.

So `P(Queen or Spades) = P(Queen) + P(Spades) - P(Queen of Spades) = 4/52 + 13/52 -1/52 = 4/13`

The joint probability for two non-disjoint events A and B is the probability of them occurring at the same time. The joint probability is obtained by multiplying the probabilities of the two events. For example, the total probability of drawing a black Ace card is `P(Ace ∩ black) = 2/52 = 1/26`

. This can also be calculated using the formula: `P(Ace ∩ black) = P(Ace) x P(black) = 4/52 x 26/52 = 1/26`

.

The symbol “∩” in joint probability indicates an intersection, the probability of two events A and event B is the same as the intersection of A and B sets. This can be best visualized with the help of Venn diagrams.

Marginal probability is the probability of an event occurring irrespective of what may or may not have happened. It can be thought of as unconditional probability as is not affected by other events.

`P(A|B)`

, which means the probability of event A given that event B has occurred, or simply given B. It is given by
`P(A|B) = P(A ∩ B) / P(B)`

Let’s illustrate this using an example: In a class of 100 students, 23 students like to play both football and basketball and 45 of them like to play basketball. What is the probability that a student who likes playing basketball also likes football?

`P(football | basketball) = P(football ∩ basketball) / P(basketball)`

`P(football | basketball) = .23/.45 = .51`

That’s it for this one, if some of these concepts are still a bit blurry, don’t worry. It will start to make more sense when we talk about these in the context of things like the Bayes theorem and probability distributions in later blogs.

]]>Object detection is a computer vision task that aims to identify and locate objects within an image or video. Specifically, it draws bounding boxes around detected objects, enabling us to locate where the objects are and track how they move. It is often confused with image classification, so before we proceed, it's crucial that we clarify the distinctions between them.

Image classification merely assigns a class to an image. A picture of a single dog and a picture of two dogs are treated similarly by classification models and they both receive the label "dog". On the other hand, the object detection model draws a box around each dog and labels the box "dog". Here's an example of what this difference looks like in practice:

Image Classification vs Object DetectionGiven object detection's unique capability of locating and identifying objects, it can be applied in all kinds of tasks:

- Face detection
- Self-driving cars
- Video surveillance
- Text Extraction
- Pose estimation

This isn't an exhaustive list, it just introduces some of the primary ways in which object detection is helping shape the future of computer vision.

Now that we know a bit about what object detection is, and what it can be used for, let's understand how it actually works. In this section, we will take a look at the two broad paradigms of object detection models, and then explain the flow of the object detection process by elaborating on YOLO, a popular object detection model.

Two-stage models have two parts or networks, the initial stage consists of a region proposal model that extracts possible object regions from an image. A prime example of two-stage models is R-CNN that uses selective search to extract thousands of object regions for the second stage. The second stage model uses the local features of the object regions proposed by the initial stage to classify the object and to obtain a refined bounding box.

This separation of responsibility enables the second stage model to focus solely on learning the distribution of features for the object(s), making for a more accurate model overall. This added accuracy, however, comes at the cost of computational efficiency and thus two-stage detectors can't perform real-time detection on most systems.

Single-stage detectors, on the other hand, divide the input image into grid cells, and every single grid cell is associated with several bounding boxes. Each bounding box predicts an objectness probability(probability of the presence of an object), and a conditional class probability. Most grid-cells and bounding boxes see "background" or no-object regions and a few see the ground-truth object. This hampers the CNN's ability to learn the distribution of features for the objects leading to poor performance, especially for small objects. On the flip side, the model generalizes well to new domains.

The YOLO - You Only Look Once - network uses features from the entire image to predict the bounding boxes, moreover, it predicts all bounding boxes across all classes for an image simultaneously. This means that YOLO reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

YOLO divides the input image into an S × S grid. If a grid cell contains the center of an object, it is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores associated with the boxes. The confidence scores indicate how confident the model is that the box contains an object, and additionally how accurate the box is. Formally, the confidence is defined as P(Objectness) x IOU_truthpred. Simply put, if no object exists in a cell, the confidence score becomes zero as P(Objectness) will be zero. Otherwise, the confidence score is equal to the intersection over union (IOU) between the predicted box and the ground truth.

Each of these B bounding boxes consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width(w) and height(h) are predicted relative to the whole image. The confidence prediction represents the IOU between the predicted box and any ground truth box.

Each grid cell also predicts C conditional class probabilities, P(Class-i |Object). These probabilities are conditioned on the grid cell containing an object. Regardless of the number of associated bounding boxes, only one set of class probabilities per grid cell is predicted.

Finally, the conditional class probabilities and the individual box confidence predictions are multiplied. This gives us the class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

YOLO performs classification and bounding box regression in one step, making it much faster than most CNN-based approaches. For instance, YOLO is more than 1000x faster than R-CNN and 100x faster than Fast R-CNN. Furthermore, its improved variants such as YOLOv3 achieved 57.9% mAP on the MS COCO dataset. This combination of speed and accuracy makes YOLO models ideal for complex object detection scenarios.

Do you want to solidify your knowledge of object detection and convert it into a marketable skill by making awesome computer vision applications like the one shown above? Enroll in AugmentedStartup's YOLOR course here today! It is a comprehensive course on YOLOR that covers not only the state-of-the-art YOLOR model and object detection fundamentals, but also the implementation of various use-cases and applications, as well as integrating models with a web UI for deploying your own YOLOR web apps.

OpenCV offers the `cv2.VideoCapture`

class that can be used to work with videos. The `VideoCapture(source)`

constructor takes the video source as the argument. This argument can take many forms:

`0`

or`1`

: These`int`

values correspond to the primary and secondary camera on your system. If there's only one camera`1`

will return a blank window.`"path"`

: We can also path to a local video file.`"IP_address"`

: Another possible video source can be a live stream from a remote CCTV camera or IoT device.

Whatever video source we work with, the video is read frame by frame by the `VideoCapture.read()`

method. `read()`

returns a tuple of two values, first is a boolean value that signifies if the frame has been read successfully, and the second is the frame.

Let's try it out!

```
import numpy as np
import cv2
# create the VideoCapture object with the relevant source argument
cap = cv2.VideoCapture(0)
# read through the frames sequentially
while True:
ret, frame = cap.read()
if ret:
smaller_frame = cv2.resize(frame, (0, 0), fx=0.5, fy=0.5)
cv2.imshow('IP Cam Stream', smaller_frame)
# exit if the video finishes
else:
break
# checking if the ESC key is pressed to break out of loop
if cv2.waitKey(1) == 27:
break
#releasing the camera resource so other processes can use it
cap.release()
cv2.destroyAllWindows()
```

Now, with a local file.

```
cap = cv2.VideoCapture("video\clownfish.mp4")
```

What if you don't have a webcam? Or just a low-res one? No worries, you can just use your phone! There is a wide range of apps—for both Android and iOS—that allow you to access your phone's camera output over the internet. You just need to install the app, open it, give it the camera permissions and you're good to go. It's actually scary to see how easy it is!

I have personally used DroidCam and IP Webcam and I have mentioned the IP address routes for both of them in the comments.

```
# DroidCam: "IP address:PORT/mjpegfeed"
# IP Webcam: "IP address:PORT/video"
cap = cv2.VideoCapture("https://192.168.1.4:8080/video")
```

Facts are stubborn things, but statistics are pliable ~ Mark Twain

Let's say we are working as data analysts at a fintech start-up and we've just been handed the credentials to their database. What do we do with it? Do we just start making a bunch of fancy reports about the data? No. First, we need to understand what the data can tell us about the business and see if this information can be leveraged to solve any existing problems.

Another thing we can do is draw insights from the data, but what are insights?

Insight is valuable knowledge obtained by analyzing data. This definition is rather vague and broad, and that is so for a reason. A lot of things can fall under the category of insights - what ad campaign will be more effective, what portions of a supply chain are slowing down the logistics network, what category of expenses is the most wasteful, what is the likelihood of a defective product in a particular manufacturing batch, etc.

But the problem is that it is not possible to understand something by simply glancing at the data in most cases. Even while working with a thousand data points it is not feasible to go through each of them individually and look for patterns. To make the process of drawing insights and making assumptions easier; the data needs to be summarized by a set of easily interpretable values or attributes.

That is where descriptive statistics come in, they are used to condense the information into a few easy to use data points. There are two sets of descriptive statistics -

- Measures of Central Tendency: They are used to summarize the most common or most "central" value of an attribute. For example, the average height or weight of a group of people.
- Measures of Dispersion: They describe the variations in an attribute. For example, the normal blood pressure or heartbeat range.

Before we get into the different measures let's quickly understand a very fundamental classification of data: Qualitative vs Quantitative

**Quantitative Data**, as the name suggests, consists of mathematical values that indicate a quantity, amount, or measurement of a property. When we deal with quantitative measures, the numbers mean themselves. That is, there is no additional information required: 4.2 is 4.2 and 100 is 100.

**Discrete** data scale is one that is quantitative but does not take up all the space. Let's take the number of siblings as an example — we may have 1 sibling, 3 siblings, 5 siblings, or even 10, but we cannot have .5 or 4.25.

**Continuous** scale on the other hand takes up all the space, it can be anything from -∞ to +∞, can be fractional. For example, we can measure time in days, hours, seconds, milliseconds and so on. The continuous scale is determined throughout all possible values.

**Qualitative Data** reflects the properties or qualities of objects. Numbers here don't mean themselves, but they signify some qualities or properties of objects. In other words, they serve as markers for some categories. For example, let's say we compare people living with different colors of eyes. We can encode people with blue eyes by 1, black eyes by 2, and so on. Here 1,213... don't mean anything except that they denote these categories.

Qualitative variables are divided into nominal and ordinal types. Let's take a closer look at what each of these types means, starting with nominal variables. The only information **nominal variables** contain is information about an object belonging to a certain category or group. It means that these variables can only be measured in terms of belonging to some significantly different classes, and you will not be able to determine the order of these classes.

For example, we earlier considered the example of people with different colors of eyes — blue eyes, green eyes, brown eyes. These categories will all be nominal variables — there is no order in these values.

**Ordinal variables** differ slightly from nominal variables by the fact that there is an order to the category, a preference. So, values not only divide objects into classes or groups but also order them. For example, we have grades at school — A, B, C, D, F. And in this case we can say for sure that the person who got an A is probably more prepared for the test than the person who received an F. In this case, we cannot say to what extent, but we can say for sure that A is better than D.

To understand the various measures of central tendency let's create some data that we can use to illustrate the different measures.

```
data = [5, 6, 8, 9, 1, 2, 8, 4, 6, 2, 4, 8, 6, 1]
```

Mean, or more specifically the arithmetic mean is the first statistical measure we are taught in school. It is given by

```
def mean(x):
return sum(x) / len(x)
mean(data)
4.23076923076
```

The median for the data arranged in increasing order is defined as :

- If
*n*is an odd number - The middle value - If
*n*is an even number - Mean of the two middle values

```
def median(data):
n = len(data)
sorted_data = sorted(data)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_data[midpoint]
else:
# if even, return the average of the middle values
low = midpoint - 1
high = midpoint
return (sorted_data[low] + sorted_data[high]) / 2
median(data)
4
```

The mode is the most commonly occurring data value. May have more than one value.

```
def mode(x):
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems() if count == max_count]
mode(data)
6
```

Consider the distribution of household income in the U.S. Commonly, when we discuss income or want to compare the income of two places, we use the median income. This is the income level below which half of the households earn. This is indicated on the plot, as about $50,000 (in 2010). This is a useful measure of the general "location" of "central tendency" of the data. You can imagine if we wanted to discuss the general income of say New Jersey vs. Kansas, that the median is a useful statistic. The median is useful for skewed data, like this income data, or for data with outlying values.

The mode isn't indicated on the plot, but it is clearly the $15000 - $19999 category. This tells you something different about the data, basically that more than any other category, more households earn $15000 - $20000.

Finally, the mean is useful for symmetric data. In skewed data, it is influenced by outlying values. So, in the case of this income data, it would likely be considerably higher than the 50,000 median, because the few households with extremely high incomes pull up the average. This might misrepresent the general income level.

A general framework for deciding when to use what:

Type of Variable | Optimal Measure of Central Tendency |

Nominal | Mode |

Ordinal | Median |

Quantitative(not skewed) | Mean |

Quantitative(skewed) | Median |

The range is the difference between the smallest and the largest data points in the sample.

```
max(data) - min(data)
8
```

Variance measures how far data points are distributed from the mean of the sample. It is the average of the squared differences from the Mean.

```
import numpy as np
np.var(data)
7.0
```

Standard deviation is the square root of the variance.

```
import numpy as np
np.std(data)
2.6457513110645907
```

A question now arises, if the standard deviation is just the square root of the variance why do we need to calculate it? After all, the variance does a decent enough job of describing the distribution of the data. Here's why:

The variance of a data set measures the mathematical dispersion of the data relative to the mean. However, though this value is theoretically correct, it is difficult to apply in a real-world sense because the values used to calculate it were squared. The standard deviation, as the square root of the variance, gives a value that is in the same units as the original values, which makes it much easier to work with and easier to interpret in conjunction with the concept of the normal curve.

Measure the dispersion of height (cm) in terms of area (cm^2), doesn't make sense now, does it? There are other reasons (and uses) for calculating the standard deviation, we'll talk about these later when discussing statistical tests and normal distributions.

What's the difference between variance and standard deviation?

**Standard Deviation for Population**

Similarly, we use "*n-1*" when calculating the variance using sample data

Steps to calculate the standard deviation for a sample:

- Compute the square of the difference between each value and the sample mean.
- Add those values up.
- Divide the sum by n-1.
- Take the square root to obtain the Standard Deviation.

In step 1, we compute the difference between each value and the mean of those values. We don't know the true mean of the population; all we know is the mean of the sample. Except for the rare cases where the sample mean happens to equal the population mean, the data will be closer to the sample mean than it will be to the true population mean. So, the value you compute in step 2 will probably be a bit smaller (and can't be larger) than what it would be if we used the true population mean in step 1. To make up for this, we divide by n-1 rather than n.

This use of ‘n-1’ is called Bessel’s correction method.

Sample Standard Deviation vs. Population Standard Deviation

Descriptive statistics provide us with information about the data sample we already have. For example, we could calculate the mean and standard deviation of SAT results for 50 students, and this could give us information about this group of 50 students. But very often we are not interested in a specific group of students, but in students in general - for example, all students in the US. Any sample of data that includes all the data we are interested in is called a population.

Very often it happens that we do not have access to the entire population in which we are interested, but only a small sample. For example, we may be interested in exam scores for all students in the US. It is not possible to collect the exam scores of all students, so instead, we will measure a smaller sample of students (e.g. 1000 students), which is used to display a larger population of all US students. But it is important that the data sample accurately reflects the overall population. The process of achieving this is called sampling(we will discuss sampling strategies in a later post).

Inferential statistics allow us to draw conclusions about the population from sample data that might not be immediately obvious. Inferential statistics emerges because sampling leads to a *sampling error* (what we accommodated for with Bessel’s correction), and therefore the sampling does not perfectly reflect the population. The methods of inferential statistics are:

- parameter estimation
- hypothesis-testing

We'll learn about these in later posts. That's it for now! 👋

So what exactly constitutes an image? The individual pixels, obviously 😜. Digitally, each pixel is represented by a tuple or set of three values that represent its color in terms of the primitive colors- red, green and blue. These pixel tuples are stored in a two-dimensional array that has the same dimensions as the image. This makes the dimensions of the image array MxNx3 where M and N are the image dimensions. If an image array was flattened out and this is what it would look like:

3-dimensional arrays, that's all colored images are. Another type of image that we'll encounter in computer vision tasks is grayscale images. Grayscale images take the three color values for each pixel and condense the information into one value; making the image array flat with the dimensions MxN. There are many variants of grayscale each using different proportions of the R, G & B values. The default grayscale variant in OpenCV uses the following formula:

X = 0.299 R + 0.587 G + 0.114 B

OpenCV is an open-source computer vision library that is used to perform image analysis, processing, manipulation and much more. It is rather easy to learn, you don't need to be a machine learning expert, you don't even need to be that good at Python to get started.

OpenCV can be easily installed from PyPI using `pip`

:

`pip install opencv-python`

If that doesn't work try `pip3 install opencv-python`

or `python -m pip install opencv-python`

.

So now that we have installed OpenCV, let's load an image and display it. But before we can do that we have to import the module. A weird thing to note about the OpenCV module is that it is referred to as `cv2`

rather than `opencv-python`

or `opencv`

.

```
import cv2
```

To read an image from local storage we use the `cv2.imread(path, flag)`

method. It takes two arguments, the first argument `path`

specifies the location of the image, the argument `flag`

is optional and is used to specify the format in which we want to load the object. There are three flags:

`cv2.IMREAD_COLOR`

or`1`

: It loads the image in the BGR 8-bit format ignoring the alpha channel or the transparency values.`cv2.IMREAD_GRAYSCALE`

or`0`

: Used to load the image as a grayscale image.`cv2.IMREAD_UNCHANGED`

or`-1`

: Loads the image as it is, with the alpha channel.

You can find the list of all filetypes supported by `imread()`

method here.

Let's try loading our image without any flag and then as a grayscale image.

```
normal_image = cv2.imread('images/ripple.jpg')
greyscale_image = cv2.imread('images/ripple.jpg', 0)
```

This code will execute without errors but we won't get any outputs. To see what we've just loaded we'll need to use the `imshow(winname, img)`

method. The `imshow()`

method creates a window with the name `winname`

that displays the image `img`

.

```
cv2.imshow("Normal Image", normal_image)
cv2.imshow("Greyscale Image", greyscale_image)
```

This code will create two windows that display the normal and grayscale version of the image but the windows will disappear before we'll be able to observe them. To keep them alive we'll use two more lines of code.

```
# waits the specified amount of time(in milliseconds) for a key press.
cv2.waitKey(0) # will wait for infinite amount of time if '0' is specified.
cv2.destroyAllWindows() # manually closing all windows
```

Executing these 7 lines of code will create two windows like this:

Based on the primitive color theory of human perception, the RGB color model predates the electronic age. It is an additive color model in which colors are represented in terms of their red, green, and blue components. RGB describes a color as a tuple of three components. Each component can take a value between 0 and 255, where the tuple (0, 0, 0) represents black and (255, 255, 255) represents white.

OpenCV already has the three separate components albeit in the reverse order, BGR.

The reason the early developers at OpenCV chose BGR color format is that back then BGR color format was popular among camera manufacturers and software providers. E.g. in Windows, when specifying color value using COLORREF they use the BGR format 0x00bbggrr.

We can split these components by either slicing the array:

```
B = image[:,:,0]
G = image[:,:,1]
R = image[:,:,2]
```

or by using the `cv2.split()`

method:

```
B, G, R = cv2.split(image)
```

Let's display these components using `imshow()`

```
cv2.imshow("Blue Channel", B)
cv2.imshow("Green Channel", G)
cv2.imshow("Red Channel", R)
```

Huh, what went wrong? Well, nothing. Our image was successfully split into its three components, it's just that it went from being a 3xMxN array to three MxN arrays. And OpenCV treats all 2D arrays as grayscale. To visually see the three color components we'll need to add the missing two components with values zero.

```
import numpy as np
# creating an array of zeros with the dimensions of the input image
zeros = np.zeros(image.shape[:2], dtype = 'uint8')
cv2.imshow("Blue Channel", cv2.merge([B, zeros, zeros]))
cv2.imshow("Green Channel", cv2.merge([zeros, G, zeros]))
cv2.imshow("Red Channel", cv2.merge([zeros, zeros, R]))
```

The RGB color model is mainly used for the sensing, representation, and display of images in electronic systems, though it has also been used in conventional photography. But the human eye perceives color and brightness differently than the typical RGB sensors. When twice the number of photons of a particular wavelength hit the sensor of a digital camera, it creates twice the signal, i.e., a linear relationship. This is not how human eyes work, we perceive double the amount of light as only a fraction brighter, i.e., a non-linear relationship. Similarly our eyes are also much more sensitive to changes in darker tones than brighter tones.

The HSV model was created by computer graphic researchers in the 1970s in an attempt to better model how the human visual system perceives color attributes. Most of the color selector tools in multimedia applications make use of this model.

HSV model separates the color information (Hue) from the intensity (Saturation) and lighting (Value). Separating these values allows us to set better threshold values that work regardless of the lighting changes. Even by singling out only the hue values, we are able to obtain a very meaningful representation of the base color that works much better than RGB. The end result is a more robust color thresholding using simpler parameters.

Hue is a continuous representation of color so 0 and 360 are the same hue. Geometrically you can picture the HSV color space as a cylinder or a cone with H being the degree, saturation being the distance from the center, and value being the height.

To convert OpenCV's default BGR arrays into an HSV image we can use the `cv2.cvtcolor(img, code)`

method that takes two arguments: the source image array `img`

and the `code`

that indicates the conversion type. You can find a list of all conversion codes here.

```
hsv_image = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
cv2.imshow("HSV Image", hsv_image)
cv2.imshow("Hue Channel", hsv_image[:, :, 0])
cv2.imshow("Saturation Channel", hsv_image[:, :, 1])
cv2.imshow("Value Channel", hsv_image[:, :, 2])
```

Now, this doesn't make intuitive sense as the RGB components did, but it isn't supposed to. The kind of image manipulation that we just performed happens behind the scenes in our brain. There's all sorts of math happening in there that allows us to do straightforward things like detecting contours, it's just that we're not doing it consciously.

Armed with our newfound rudimentary knowledge of computer vision we now set out to find Nemo. To accomplish this we will make use of the HSV color model and create a "mask". But before we start making our mask, let's understand what it is.

A mask is a very basic filter that sets some of the pixel values in an image to zero, or some other *background* value. Simply put, it allows you to hide some portions of an image and to reveal some portions. We'll be using it as an image segmentation method to separate the clownfish from the background. Beyond segmentation, masking has widespread applications and is used in many types of image processing, including motion detection, edge detection, and noise reduction.

First of all we'll need to load the image

```
image = cv2.imread("images/fish.jpg")
```

Displaying 3 windows of HD images on one screen isn't feasible, so we'll scale it scale down before converting to HSV. We can manually define the size of the image using the `cv2.resize(image, size_tuple)`

method:

```
resized_images = cv2.resize(image, (512, 228))
```

or scale it down using scaling factors `fx`

and `fy`

:

```
resized_image = cv2.resize(image, None, fx=.4, fy=.4)
hsv_image = cv2.cvtColor(resized_image, cv2.COLOR_BGR2HSV)
```

To find the range of hue values corresponding to the orange and white colors of the fish we'll need to do a lot of trial and error. Rather than repeatedly stopping the code to tweak the values we'll use value sliders to vary the range during runtime. To do this we'll create a resizable `namedWindow`

and add a bunch `Trackbar`

objects using the `createTrackbar('trackbar_name', 'window_name', range_start, range_end, on_change)`

method. The `Trackbar`

objects require an `on_change`

function that is called when the slider is moved but we are not going to need this so we'll simply create a redundant function `placeholder()`

that does nothing.

```
def placeholder(x):
pass
cv2.namedWindow('Trackbars', cv2.WINDOW_NORMAL)
cv2.createTrackbar('min blue', 'Trackbars', 0, 255, placeholder)
cv2.createTrackbar('min green', 'Trackbars', 0, 255, placeholder)
cv2.createTrackbar('min red', 'Trackbars', 0, 255, placeholder)
cv2.createTrackbar('max blue', 'Trackbars', 0, 255, placeholder)
cv2.createTrackbar('max green', 'Trackbars', 0, 255, placeholder)
cv2.createTrackbar('max red', 'Trackbars', 0, 255, placeholder)
```

Now that we have a way to find the range of hues, let's put it to work. To see the real-time effects of the hue sliders we'll need to continuously update the mask output screen. We'll do this using a while loop break out of it using the `ESC`

key which has the ASCII value of 27. We fetch the positions of the sliders from our Trackbar window using the `getTrackbarPos('trackbar_name', 'window_name')`

method and pass them to `inRange(hsv_image, hue_thresh_min_tuple, hue_thresh_max_tuple)`

. This creates a mask that filters out the pixel with hue values that don't lie within the defined thresholds. The mask creates a black-and-white image in which only the pixels within the hue range are illuminated, i.e., colored white.

We print the final values of the hue range so we can hardcode them for our final masks.

```
cv2.imshow('Base Image', resized_image)
cv2.imshow('HSV Image', hsv_image)
while True:
# fetching the threshold values from the sliders
min_blue = cv2.getTrackbarPos('min blue', 'Trackbars')
min_green = cv2.getTrackbarPos('min green', 'Trackbars')
min_red = cv2.getTrackbarPos('min red', 'Trackbars')
max_blue = cv2.getTrackbarPos('max blue', 'Trackbars')
max_green = cv2.getTrackbarPos('max green', 'Trackbars')
max_red = cv2.getTrackbarPos('max red', 'Trackbars')
mask = cv2.inRange(hsv_image, (min_blue, min_green, min_red), (max_blue, max_green, max_red))
# showing the mask image
cv2.imshow('Mask Image', mask)
# checking if the ESC key is pressed to break out of loop
key = cv2.waitKey(10)
if key == 27:
break
print(f'min blue {min_blue} min green {min_green} min red {min_red}')
print(f'max blue {max_blue} max green {max_green} max red {max_red}')
#destroying all windows
cv2.destroyAllWindows()
```

Similarly, using the sliders we find out the hue threshold for the white as well. We obtain the final mask by combining the two.

```
mask_orange = cv2.inRange(hsv_image, orange_min, orange_max)
mask_white = cv2.inRange(hsv_image, white_min, white_max)
final_mask = mask_orange + mask_white
```

Using this combined mask gives us

And if we want to add the color back in, we can do so by masking the resized RGB image using our final mask. The `bitwise_and()`

method basically does the logical "and" operation wherever the mask is white. Simply put, white + any color = the same color and black + any color = black. If that doesn't make sense, don't fret about it we'll look into bitwise operations in the image manipulation post.

```
output = cv2.bitwise_and(resized_image, resized_image, mask=final_mask)
cv2.imshow('Mask Image', output)
```

And there you have it, finding Nemo using basic color masking.

]]>