CrackingMachineLearningInterview
A repository to prepare you for your machine learning interview, involving most of the questions asked by all the tech giants and local companies. Do this to Ace your Machine Learning Engineer Interviews
Install / Use
/learn @shafaypro/CrackingMachineLearningInterviewREADME
CrackingMachineLearningInterview
A practical interview preparation repository for Machine Learning Engineer, AI Engineer, Data Scientist, Deep Learning Engineer, Data Engineer, and DevOps or platform-focused roles.
This README now serves three purposes:
- It keeps the original core ML interview questions.
- It adds a more organized 2026 interview-prep layer focused on modern ML engineering topics such as LLMs, RAG, evaluation, agents, safety, and production systems.
- It acts as the main entry point for related tracks including AI/GenAI, data engineering, and DevOps.
Who this repository is for
- Machine Learning Engineer
- Data Scientist
- Deep Learning Engineer
- AI Engineer
- Software Engineer working on AI/ML products
- Data Engineer
- MLOps Engineer
- DevOps / Platform Engineer
How to use this repository
- Start with the 2026 Interview Roadmap if you are preparing for current AI/ML interviews.
- Use 2026 Additional Questions and Answers for modern interview rounds.
- Use the AI / GenAI, Data Engineering, and DevOps sections for specialized interview tracks.
- Use the Classic Question Bank for core ML, statistics, deep learning, and algorithms.
- Use Preparation Resources and References to build a targeted study plan.
Quick Navigation
- 2026 Interview Roadmap
- 2026 Additional Questions and Answers
- AI / GenAI Track
- Data Engineering Track
- DevOps Track
- Preparation Resources and References
- Study Pattern
- Classic Question Bank
- Contributions
About
- Github Profile: Shafaypro ©
- Repository: CrackingMachineLearningInterview
Image References
- Image references are included for educational purposes. Please see the repository references for attribution where applicable.
Sharing
Feel free to share the repository link in your blog, study notes, or interview preparation material.
Repository Structure
docs/2026-interview-roadmap.md: current interview focus areas for ML Engineer and AI Engineer roles.docs/2026-additional-questions.md: modern 2026 question bank covering LLMs, RAG, evaluation, agents, and production AI.docs/resources-and-references.md: books, references, and additional interview topics.docs/study-pattern.md: recommended preparation topics and study structure.ai_genai/: GenAI and LLM engineering topics.data_engineering/: data engineering interview topics and platform concepts.devops/: DevOps, infrastructure, and deployment topics.README.md: repository landing page plus the original classic ML interview question bank.
AI / GenAI Track
Use this track for AI Engineer, GenAI Engineer, LLM Engineer, Applied AI, and agent-platform interviews.
Core topics:
Data Engineering Track
Use this track for pipeline, ETL, orchestration, warehouse, lakehouse, and streaming interviews.
Core topics:
- Apache Spark
- Apache Kafka
- Apache Airflow
- dbt Introduction
- dbt Interview Guide
- Apache Iceberg
- Delta Lake
- DuckDB
- OpenClaw
DevOps Track
Use this track for infrastructure, CI/CD, containers, orchestration, and IaC interviews.
Core topics:
Classic Question Bank
Difference between SuperVised and Unsupervised Learning?
Supervised learning is when you know the outcome and you are provided with the fully labeled outcome data while in unsupervised you are not
provided with labeled outcome data. Fully labeled means that each example in the training dataset is tagged with the answer the algorithm should
come up with on its own. So, a labeled dataset of flower images would tell the model which photos were of roses, daisies and daffodils. When shown
a new image, the model compares it to the training examples to predict the correct label.

What is Reinforcment Learning and how would you define it?
A learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be
explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current
knowledge) .Semisupervised learning is also known as Reinforcment learning, in reinforcment learning each learning steps involved a penalty
criteria whether to give model positive points or negative points and based on that penalizing the model.
![]()
What is Deep Learning ?
Deep learning is defined as algorithms inspired by the structure and function of the brain called artificial neural networks(ANN).Deep learning
most probably focuses on Non Linear Analysis and is recommend for Non Linear problems regarding Artificial Intelligence.
Difference between Machine Learning and Deep Learning?
Since DL is a subset of ML and both being subset of AI.While basic machine learning models do become progressively better at whatever their
function is, they still need some guidance. If an AI algorithm returns an inaccurate prediction, then an engineer has to step in and make
adjustments. With a deep learning model, an algorithm can determine on its own if a prediction is accurate or not through its own neural network.

Difference between SemiSupervised and Reinforcment Learning?
Difference between Bias and Variance?
Bias is definned as over simpliciation assumption assumed by the model,
Variance is definned as ability of a model to learn from Noise as well, making it highly variant.
There is always a tradeoff between these both, hence its recommended to find a balance between these two and always use cross validation to
determine the best fit.
What is Linear Regressions ? How does it work?
Fitting a Line in the respectable dataset when drawn to a plane, in a way that it actually defines the correlation between your dependent
variables and your independent variable. Using a simple Line/Slope Formulae. Famously, representing f(X) = M(x) + b.
Where b represents bias
X represent the input variable (independent ones)
f(X) represents Y which is dependent(outcome).
The working of linear regression is Given a data set of n statistical units, a linear regression model assumes that the relationship between the
dependent variable y and the p-vector of regressors x is linear. This relationship is modeled through a disturbance term or error variable ε — an
unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. Thus the model takes the
form Y = B0 + B1X1 + B2X2 + ..... + BNXN
This also emplies : Y(i) = X(i) ^ T + B(i)
Where T : denotes Transpose
X(i) : denotes input at the I'th record in form of vector
B(i) : denotes vector B which is bias vector.
UseCases of Regressions:
Poisson regression for count data.
Logistic regression and probit regression for binary data.
Multinomial logistic regression and multinomial probit regression for categorical data.
Ordered logit and ordered probit regression for ordinal data.
What is Logistic Regression? How does it work?
Logistic regression is a statistical technique used to predict probability of binary response based on one or more independent variables.
It means that, given a certain factors, logistic regression is used to predict an outcome which has two values such as 0 or 1, pass or fail,
yes or no etc
Logistic Regression is used when the dependent variable (target) is categorical.
For example,
To predict whether an email is spam (1) or (0)
Whether the tumor is malignant (1) or not (0)
Whether the transaction is fraud or not (1 or 0)
The prediction is based on probabilties of specified classes
Works the same way as linear regression but uses logit function to scale down the values between 0 and 1 and get the probabilities.
What is Logit Function? or Sigmoid function/ where in ML and DL you can use it?
The sigmoid might be useful if you want to transform a real valued variable into something that represents a probability. While the Logit fun
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
