65 Machine Learning Interview Questions 2026

A collection of technical interview questions for machine learning and computer vision engineering positions.

Recently added: Natural Language Processing (NLP) Interview Questions 2026

Preparation Resources

ML Engineer Interview Course
Mock ML Interview: Get ready for your next interview by practicing with ML engineers from top tech companies and startups.
All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
Machine Learning by Tom Mitchell
Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen

This page is sponsored by Jobbyo:

<p align=center style="font-style:italic;">"If you’re focused on interview prep, let <a href="https://jobbyo.ai/?linkId=lp_801223&sourceId=akhalel&tenantId=jobbyoai">Jobbyo</a> handle the busywork of applying. It automates applications and keeps your job search organized while you stay sharp"</p>

Use promo code "MLQUESTIONS" at checkout to get 20% off your Premium subscription for 3 months!

Questions

1) What's the trade-off between bias and variance? [src]

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. [src]

2) What is gradient descent? [src]

[Answer]

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

3) Explain over- and under-fitting and how to combat them? [src]

[Answer]

ML/DL models essentially learn a relationship between its given inputs(called training features) and objective outputs(called labels). Regardless of the quality of the learned relation(function), its performance on a test set(a collection of data different from the training input) is subject to investigation.

Most ML/DL models have trainable parameters which will be learned to build that input-output relationship. Based on the number of parameters each model has, they can be sorted into more flexible(more parameters) to less flexible(less parameters).

The problem of Underfitting arises when the flexibility of a model(its number of parameters) is not adequate to capture the underlying pattern in a training dataset. Overfitting, on the other hand, arises when the model is too flexible to the underlying pattern. In the later case it is said that the model has “memorized” the training data.

An example of underfitting is estimating a second order polynomial(quadratic function) with a first order polynomial(a simple line). Similarly, estimating a line with a 10th order polynomial would be an example of overfitting.

4) How do you combat the curse of dimensionality? [src]

Feature Selection(manual or via statistical methods)
Principal Component Analysis (PCA)
Multidimensional Scaling
Locally linear embedding
[src]

5) What is regularization, why do we use it, and give some examples of common methods? [src]

A technique that discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. Examples

Ridge (L2 norm)
Lasso (L1 norm)
The obvious disadvantage of ridge regression, is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models. [src]

6) Explain Principal Component Analysis (PCA)? [src]

[Answer]

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning to reduce the number of features in a dataset while retaining as much information as possible. It works by identifying the directions (principal components) in which the data varies the most, and projecting the data onto a lower-dimensional subspace along these directions.

7) Why is ReLU better and more often used than Sigmoid in Neural Networks? [src]

Computation Efficiency: As ReLU is a simple threshold the forward and backward path will be faster.
Reduced Likelihood of Vanishing Gradient: Gradient of ReLU is 1 for positive values and 0 for negative values while Sigmoid activation saturates (gradients close to 0) quickly with slightly higher or lower inputs leading to vanishing gradients.
Sparsity: Sparsity happens when the input of ReLU is negative. This means fewer neurons are firing ( sparse activation ) and the network is lighter.

[src1] [src2]

8) Given stride S and kernel sizes for each layer of a (1-dimensional) CNN, create a function to compute the receptive field of a particular node in the network. This is just finding how many input nodes actually connect through to a neuron in a CNN. [src]

The receptive field are defined portion of space within an inputs that will be used during an operation to generate an output.

Considering a CNN filter of size k, the receptive field of a peculiar layer is only the number of input used by the filter, in this case k, multiplied by the dimension of the input that is not being reduced by the convolutionnal filter a. This results in a receptive field of k*a.

More visually, in the case of an image of size 32x32x3, with a CNN with a filter size of 5x5, the corresponding recpetive field will be the the filter size, 5 multiplied by the depth of the input volume (the RGB colors) which is the color dimensio. This thus gives us a recpetive field of dimension 5x5x3.

9) Implement connected components on an image/matrix. [src]

10) Implement a sparse matrix class in C++. [src]

[Answer]

11) Create a function to compute an integral image, and create another function to get area sums from the integral image.[src]

[Answer]

12) How would you remove outliers when trying to estimate a flat plane from noisy samples? [src]

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. [src]

13) How does CBIR work? [src]

[Answer] Content-based image retrieval is the concept of using images to gather metadata on their content. Compared to the current image retrieval approach based on

MLQuestions

Install / Use

README

65 Machine Learning Interview Questions 2026

Recently added: Natural Language Processing (NLP) Interview Questions 2026

Preparation Resources

Questions

1) What's the trade-off between bias and variance? [src]

2) What is gradient descent? [src]

3) Explain over- and under-fitting and how to combat them? [src]

4) How do you combat the curse of dimensionality? [src]

5) What is regularization, why do we use it, and give some examples of common methods? [src]

6) Explain Principal Component Analysis (PCA)? [src]

7) Why is ReLU better and more often used than Sigmoid in Neural Networks? [src]

8) Given stride S and kernel sizes for each layer of a (1-dimensional) CNN, create a function to compute the receptive field of a particular node in the network. This is just finding how many input nodes actually connect through to a neuron in a CNN. [src]

9) Implement connected components on an image/matrix. [src]

10) Implement a sparse matrix class in C++. [src]

11) Create a function to compute an integral image, and create another function to get area sums from the integral image.[src]

12) How would you remove outliers when trying to estimate a flat plane from noisy samples? [src]

13) How does CBIR work? [src]