Learn more
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.
The machine Learning revolution will stay with us for a long time and so will be the future of Machine Learning.
Algorithm:
A Machine Learning algorithm is a set of rules and statistical techniques used to learn patterns from data and draw significant information from it. It is the logic behind a Machine Learning model. An example of a Machine Learning algorithm is the Linear Regression algorithm.
Artificial Intelligence:
Artificial Intelligence is the ability of machines to function like the human brain.
Artificial Intelligence is all about training machines to mimic human behavior, specifically, the human brain and its thinking abilities. Similar to the human brain, AI systems develop the ability to rationalize and perform actions that have the best chance of achieving a specific goal.
Artificial Intelligence focuses on performing 3 cognitive skills just like a human – learning, reasoning, and self-correction.
Association analysis:
Association analysis is the task of finding interesting relationships in large datasets. These interesting relationships can take two forms: frequent item sets or association rules.
Association Rule:
Association rules are "if-then" statements that help to show the probability of relationships between data items, within large data sets in various types of databases. Association rule mining has a number of applications and is widely used to help discover sales correlations in transactional data or in medical data sets.
Binary Classification:
It is a type of classification with two outcomes, for eg – either true or false.
Boosting:
Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Some of the boosting algorithms are:
AdaBoost(Adaptive Boosting)
GBM
XGBM
LightGBM
CatBoost
Bootstrapping:
Bootstrapping is the process of dividing the dataset into multiple subsets, with replacement. Each subset is of the same size of the dataset. These samples are called bootstrap samples.
Classification:
Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.
The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into. An easy to understand example is classifying emails as “spam” or “not spam.”
In machine learning, classification is a supervised learning concept which basically categorizes a set of data into classes. The most common classification problems are – speech recognition, face detection, handwriting recognition, document classification, etc.
Clustering:
Clustering or grouping is the detection of similarities.
Clustering is an unsupervised learning method used to discover the inherent groupings in the data. For example: Grouping customers on the basis of their purchasing behaviour which is further used to segment the customers. And then the companies can use the appropriate marketing tactics to generate more profits.
Example of clustering algorithms: K-Means, hierarchical clustering, etc.
Data Mining:
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Data Science:
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
Data Transformation:
Data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.
Dataframe:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. DataFrame accepts many different kinds of input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A series
Another DataFrame
Dataset
A dataset (or data set) is a collection of data. A dataset is organized into some type of data structure. In a database, for example, a dataset might contain a collection of business data (names, salaries, contact information, sales figures, and so forth). Several characteristics define a dataset’s structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis.
Decision Trees
The decision tree models can be applied to all those data which contains numerical features and categorical features. Decision trees are good at capturing non-linear interaction between the features and the target variable. Decision trees somewhat match human-level thinking so it’s very intuitive to understand the data.
Deep Learning
Deep Learning is a subset of Machine Learning where similar Machine Learning Algorithms are used to train Deep Neural Networks so as to achieve better accuracy in those cases where the former was not performing up to the mark.
Dendrogram
A dendrogram is a diagram representing a tree that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering.
Dependent Variable
A dependent variable is what you measure and which is affected by independent / input variable(s). It is called dependent because it “depends” on the independent variable. For example, let’s say we want to predict the smoking habits of people. Then the person who smokes “yes” or “no” is the dependent variable.
Dimensionality Reduction
Dimensionality Reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. Some of the benefits of dimensionality reduction:
It helps in data compressing and reducing the storage space required
It fastens the time required for performing same computations
It takes care of multicollinearity that improves the model performance. It removes redundant features
Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely
It is helpful in noise removal also and as result of that we can improve the performance of models
Dummy Variable
Dummy Variable is another name for Boolean variable. An example of dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age < 25) and 1 means value is false (i.e. age >= 25)
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.
The results of hierarchical clustering can be shown using dendrograms.
Histogram
A histogram is a graphical representation that organizes a group of data points into user-specified ranges. It is one of the methods for visualizing data distribution of continuous variables.
Histograms are widely used to determine the skewness of the data.
Hyperparameter
A hyperparameter is a parameter whose value is set before training a machine learning or deep learning model. Different models require different hyperparameters and some require none. Hyperparameters should not be confused with the parameters of the model because the parameters are estimated or learned from the data.
Some keys points about the hyperparameters are:
They are often used in processes to help estimate model parameters.
They are often manually set.
They are often tuned to tweak a model’s performance
Number of trees in a Random Forest, eta in XGBoost, and k in k-nearest neighbours are some examples of hyperparameters.
Hypothesis
A hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
Labeled Data
A labeled dataset has a meaningful “label”, “class” or “tag” associated with each of its records or rows. For example, labels for a dataset of a set of images might be whether an image contains a cat or a dog.
Labeled data are usually more expensive to obtain than the raw unlabeled data because preparation of the labelled data involves manual labelling every piece of unlabeled data.
Labeled data is required for supervised learning algorithms.
Lasso Regression
LASSO stands for Least Absolute Selection Shrinkage Operator. Shrinkage is basically defined as a constraint on attributes or parameters.
The algorithm operates by finding and applying a constraint on the model attributes that cause regression coefficients for some variables to shrink toward a zero.
Variables with a regression coefficient of zero are excluded from the model.
So, lasso regression analysis is basically a shrinkage and variable selection method and it helps to determine which of the predictors are most important.
Linear Regression
Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable(target) based on the given independent variable(s). So, this regression technique finds out a linear relationship between a dependent variable and the other given independent variables. Hence, the name of this algorithm is Linear Regression.
This best fit line is known as the regression line and represented by a linear equation Y= aX + b.
Logistic RegressionIt is a classification algorithm in machine learning that uses one or more independent variables to determine an outcome. The outcome is measured with a dichotomous variable meaning it will have only two possible outcomes.
The goal of logistic regression is to find a best-fitting relationship between the dependent variable and a set of independent variables. It is better than other binary classification algorithms like nearest neighbor since it quantitatively explains the factors leading to classification.
it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression. Since, it predicts the probability, the output values lies between 0 and 1 (as expected).
Machine Learning
Machine learning is a subset of Artificial Intelligence (AI) which provides machines the ability to learn automatically & improve from experience without being explicitly programmed to do so. In the sense, it is the practice of getting Machines to solve problems by gaining the ability to think.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce framework is usually composed of three operations:
Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.
Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
Reduce: worker nodes now process each group of output data, per key, in parallel.
Market Basket Analysis
Market Basket Analysis (also called as MBA) is a widely used technique among the Marketers to identify the best possible combinatory of the products or services which are frequently bought by the customers. This is also called product association analysis.Association analysis mostly done based on an algorithm named “Apriori Algorithm”. The Outcome of this analysis is called association rules. Marketers use these rules to strategize their recommendations.
When two or more products are purchased, Market Basket Analysis is done to check whether the purchase of one product increases the likelihood of the purchase of other products. This knowledge is a tool for the marketers to bundle the products or strategize a product cross sell to a customer.
Mean
The mean is usually referred to as 'the average'. It is the sum of all the values in the data divided by the total number of values in the data. The mean is calculated for numerical variables. The mean is a type of average value, which describes where center of the data is located.
For example, if the numbers are 1,2,3,4,5,6,7,8,8 then the mean would be 44/9 = 4.89.
Median
Median of a set of numbers is usually the middle value. When the total numbers in the set are even, the median will be the average of the two middle values. Median is used to measure the central tendency.
To calculate the median for a set of numbers, follow the below steps:
Arrange the numbers in ascending or descending order
Find the middle value, which will be n/2 (where n is the numbers in the set)
MIS
A management information system (MIS) is a computer system consisting of hardware and software that serves as the backbone of an organization’s operations. An MIS gathers data from multiple online systems, analyzes the information, and reports data to aid in management decision-making.
Objectives of MIS:
To improve decision-making, by providing up-to-date, accurate data on a variety of organizational assets
To correlate multiple data points in order to strategize ways to improve operations
ML-as-a-Service (MLaaS)
Machine learning as a service (MLaaS) is an array of services that provide machine learning tools as part of cloud computing services. This can include tools for data visualization, facial recognition, natural language processing, image recognition, predictive analytics, and deep learning. Some of the top ML-as-a-service providers are:
Microsoft Azure Machine Learning Studio
AWS Machine Learning
IBM Watson Machine Learning
Google Cloud Machine Learning Engine
BigML
Mode
Mode is the most frequent value occuring in the population. It is a metric to measure the central tendency, i.e. a way of expressing, in a (usually) single number, important information about a random variable or a population.
Mode can be calculated using following steps:
Count the number of time each value appears
Take the value which appears the most
A distribution of values with only one mode is called unimodal. A distribution of values with two modes is called bimodal. In general, a distribution with more than one mode is called multimodal. Mode can be found for both categorical and numerical data.
Here is a numerical example: 4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12
Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7 and 12.
Model
A model is the main component of Machine Learning. A model is trained by using a Machine Learning Algorithm. An algorithm maps all the decisions that a model is supposed to take based on the given input, in order to get the correct output.
Multi-Class Classification
The classification with more than two classes, in multi-class classification each sample is assigned to one and only one label or target.
Multi-label Classification
This is a type of classification where each sample is assigned to a set of labels or targets.
Multivariate Regression
Multivariate, as the word suggests, refers to ‘multiple dependent variables’. A regression model designed to deal with multiple dependent variables is called a multivariate regression model.
Consider the example – for a given set of details about a student’s interests, previous subject-wise score etc, you want to predict the GPA for all the semesters (GPA1, GPA2, …. ). This problem statement can be addressed using multivariate regression since we have more than one dependent variable.
Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.
Natural Language Processing
Natural Language Processing is a field which aims to make computer systems understand human speech. NLP is comprised of techniques to process, structure, categorize raw text and extract information.
ChatBot is a classic example of NLP, where sentences are first processed, cleaned and converted to machine understandable format.
Neural Networks or Artificial Neural Networks
Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms.
Nominal Variable
Nominal variables are categorical variables having two or more categories without any kind of order to them.
For example, a column called “name of cities” with values such as Delhi, Mumbai, Chennai, etc.
We can see that there is no order between the variables – viz Delhi is in no particular way higher or lower than Mumbai (unless explicitly mentioned).
Normal Distribution
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the bell curve, because it has a peculiar shape of a bell. Mostly, a binomial distribution is similar to normal distribution. The difference between the two is normal distribution is continuous.
Normalization
Normalization is the process of rescaling your data so that they have the same scale. Normalization is used when the attributes in our data have varying scales.
For example, if you have a variable ranging from 0 to 1 and other from 0 to 1000, you can normalize the variable, such that both are in the range 0 to 1.
NoSQL
NoSQL means Not only SQL. A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. It can accommodate a wide variety of data models, including key-value, document, columnar and graph formats.
Types of NoSQL:
Column
Document
Key-Value
Graph
Multi-model
Numpy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
Ordinal Variable
Ordinal variables are those variables which have discrete values but has some order involved.
Outlier
Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
Overfitting
A model is said to overfit when it performs well on the train dataset but fails on the test set. This happens when the model is too sensitive and captures random patterns which are present only in the training dataset. There are two methods to overcome overfitting:
Reduce the model complexity
Regularization
Pandas
Pandas is an open source, high-performance, easy-to-use data structure and data analysis library for the Python programming language. Some of the highlights of Pandas are:
A fast and efficient DataFrame object for data manipulation with integrated indexing.
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format.
Flexible reshaping and pivoting of data sets.
Columns can be inserted and deleted from data structures for size mutability.
High performance merging and joining of data sets.
Parameters
Parameters are a set of measurable factors that define a system. For machine learning models, model parameters are internal variables whose values can be determined from the data.
For instance, the weights in linear and logistic regression fall under the category of parameters.
Pattern Recognition
Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and regularities in data. Classification is an example of pattern recognition wherein each input value is assigned one of a given set of classes.
In computer vision, supervised pattern recognition techniques are used for optical character recognition (OCR), face detection, face recognition, object detection, and object classification.
Polynomial Regression
In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. ... For this reason, polynomial regression is considered to be a special case of multiple linear regression
Pre-trained Model
A pre-trained model is a model created by someone else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.
For example, if you want to build a self learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.
Predictor Variable
Predictor variable is used to make a prediction for dependent variables.
Predictor VariableIt is a feature(s) of the data that can be used to predict the output.
Principal Component Analysis (PCA)
Principal component analysis (PCA) is an approach to factor analysis that considers the total variance in the data, and transforms the original variables into a smaller set of linear combinations. PCA is sensitive to outliers; they should be removed.
It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It’s often used to visualize genetic distance and relatedness between populations.
PyTorch
PyTorch is an open source machine learning library for python, based on Torch. It is built to provide flexibility as a deep learning development platform. Here are a few reasons for which PyTorch is extensively used :
Easy to use API
Python support
Dynamic computation graphs
Random Forest Regressor
Random Forests are an ensemble(combination) of decision trees. It is a Supervised Learning algorithm used for classification and regression. The input data is passed through multiple decision trees. It executes by constructing a different number of decision trees at training time and outputting the class that is the mode of the classes (for classification) or mean prediction (for regression) of the individual trees.
Regression
Regression is the prediction of a numeric value. Most people have probably seen an example of regression with a best-fit line drawn through some data points to generalize the data points.
It is supervised learning method where the output variable is a real value, such as “amount” or “weight”.
Example of Regression: Linear Regression, Ridge Regression, Lasso Regression
Regularization is a technique used to solve the overfitting problem in statistical models. In machine learning, regularization penalizes the coefficients such that the model generalize better. We have different types of regression techniques which uses regularization such as Ridge regression and lasso regression.
Reinforcement Learning
It is an example of machine learning where the machine is trained to take specific decisions based on the business requirement with the sole motto to maximize efficiency (performance). The idea involved in reinforcement learning is: The machine/ software agent trains itself on a continual basis based on the environment it is exposed to, and applies it’s enriched knowledge to solve business problems. This continual learning process ensures less involvement of human expertise which in turn saves a lot of time!
Important Note: There is a subtle difference between Supervised Learning and Reinforcement Learning (RL). RL essentially involves learning by interacting with an environment. An RL agent learns from its past experience, rather from its continual trial and error learning process as against supervised learning where an external supervisor provides examples.
A good example to understand the difference is self driving cars. Self driving cars use Reinforcement learning to make decisions continuously like which route to take, what speed to drive on, are some of the questions which are decided after interacting with the environment. A simple manifestation for supervised learning would be to predict the total fare of a cab at the end of a journey.
Reinforcement Learning is a part of Machine learning where an agent is put in an environment and he learns to behave in this environment by performing certain actions and observing the rewards which it gets from those actions.
This type of Machine Learning is comparatively different. Imagine that you were dropped off at an isolated island! What would you do?
Panic? Yes, of course, initially we all would. But as time passes by, you will learn how to live on the island. You will explore the environment, understand the climate condition, the type of food that grows there, the dangers of the island, etc. This is exactly how Reinforcement Learning works, it involves an Agent (you, stuck on the island) that is put in an unknown environment (island), where he must learn by observing and performing actions that result in rewards.
Reinforcement Learning is mainly used in advanced Machine Learning areas such as self-driving cars, AplhaGo, etc.
Response Variable
Response variable (or dependent variable) is that variable whose variation depends on other variables.
It is the feature or the output variable that needs to be predicted by using the predictor variable(s).
Ridge Regression
Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. It has uses in fields including econometrics, chemistry, and engineering.
Root Mean Squared Error (RMSE)
RMSE is a measure of the differences between values predicted by a model or an estimator and the values actually observed. It is the standard deviation of the residuals. Residuals are a measure of how far from the regression line data points are.
Rotational Invariance
In mathematics, a function defined on an inner product space is said to have rotational invariance if its value does not change when arbitrary rotations are applied to its argument.
Semi-Supervised Learning
Standard Deviation
Standard deviation signifies how dispersed is the data. It is the square root of the variance of underlying data. Standard deviation is calculated for a population.
Standard error
A standard error is the standard deviation of the sampling distribution of a statistic. The standard error is a statistical term that measures the accuracy of which a sample represents a population. In statistics, a sample mean deviates from the actual mean of a population this deviation is known as standard error.
Standardization
Standardization (or Z-score normalization) is the process where the features are rescaled so that they’ll have the properties of a standard normal distribution with μ=0 and σ=1, where μ is the mean (average) and σ is the standard deviation from the mean.
Stochastic Gradient Descent
Stochastic Gradient Descent is a type of gradient descent algorithm where we take a sample of data while computing the gradient. The update to the coefficients is performed for each training instance, rather than at the end of the batch of instances.
The learning can be much faster with stochastic gradient descent for very large training datasets and often one only need a small number of passes through the dataset to reach a good or good enough set of coefficients.
Supervised Learning
Supervised learning is a technique in which we teach or train the machine using data which is well labeled.
To understand Supervised Learning let’s consider an analogy. As kids we all needed guidance to solve math problems. Our teachers helped us understand what addition is and how it is done. Similarly, you can think of supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher that will train you to understand patterns in the data. The labeled data set is nothing but the training data set.
Support vector machine(SVM)
The support vector machine(SVM) is a classifier that represents the training data as points in space separated into categories by a gap as wide as possible. New points are then added to space by predicting which category they fall into and which space they will belong to.
Testing Data
After the model is trained, it must be tested to evaluate how accurately it can predict an outcome. This is done by the testing data set.
Training Data
The Machine Learning model is built using the training data. The training data helps the model to identify key trends and patterns essential to predict the output.
Unsupervised Learning
Unsupervised learning involves training by using unlabeled data and allowing the model to act on that information without guidance.
Think of unsupervised learning as a smart kid that learns without any guidance. In this type of Machine Learning, the model is not fed with labeled data, as in the model has no clue that ‘this image is Tom and this is Jerry’, it figures out patterns and the differences between Tom and Jerry on its own by taking in tons of data.VarianceVariance is used to measure the spread of given set of numbers and calculated by the average of squared distances from the mean
Let’s take an example, suppose the set of numbers we have is (600, 470, 170, 430, 300)
To Calculate:
1) Find the Mean of set of numbers, which is (600 + 470 + 170 + 430 + 300) / 5 = 394
2) Subtract the mean from each value which is (206, 76, -334, 36, -94)
3) Square each deviation from the mean which is (42436, 5776, 50176, 1296, 8836)
4) Find the Sum of Squares which is 108520
5) Divide by total number of items (numbers) which is 21704