Skip to main content
Home  /  Knowledge Hub  /  Interview Questions

Interview Questions& Model Answers

Real questions. Real answers. Built from 20 years of actual hiring and being hired.

54
Total Questions
3
Technologies
3
Levels
✕ Clear filters

Showing 5 questions · Intermediate · Machine Learning

Clear all filters
ML-INT-001 How does a Random Forest work and why is it better than a single decision tree?
Machine Learning AI/ML Intermediate
5/10
Answer

A Random Forest builds many decision trees on random subsets of data and features then aggregates their predictions. It is better than a single tree because averaging many uncorrelated trees reduces variance without increasing bias.

Deep Explanation

A single decision tree is prone to overfitting — it can grow arbitrarily complex and memorize training data. Random Forest addresses this with two randomness sources: bagging (each tree trains on a bootstrap sample — random sample with replacement of the training data) and feature randomness (at each split only a random subset of features is considered). These two mechanisms ensure the trees are decorrelated. Aggregating many decorrelated slightly overfit trees through voting (classification) or averaging (regression) dramatically reduces variance. Random Forests also provide feature importance scores by measuring how much each feature reduces impurity across all trees.

Real-World Example

At a financial institution a Random Forest model for loan default prediction consistently outperformed single decision trees by 8-12% AUC across quarterly retraining cycles. The interpretability of feature importance scores also helped explain decisions to regulators making it preferable to black-box alternatives.

⚠ Common Mistakes

Assuming more trees always help — there is a point of diminishing returns (typically 100-500 trees). Not tuning max_depth and min_samples_split allowing trees to overfit. Ignoring class imbalance when using Random Forest for classification. Using Random Forest for very high-dimensional sparse data where gradient boosting typically performs better.

🏭 Production Scenario

A production fraud detection model using a single deep decision tree had to be retrained daily due to instability — small changes in training data caused large swings in predictions. Switching to a Random Forest made predictions stable across daily retraining reducing manual monitoring overhead significantly.

Follow-up Questions
What is the difference between Random Forest and Gradient Boosting? How does feature importance work in Random Forest? What is out-of-bag (OOB) error??
ID: ML-INT-001  ·  Difficulty: 5/10  ·  Level: Intermediate
ML-INT-002 What is the difference between precision recall and F1 score — and when does each matter?
Machine Learning AI/ML Intermediate
5/10
Answer

Precision is the fraction of positive predictions that are actually positive. Recall is the fraction of actual positives that were correctly identified. F1 is their harmonic mean. Which matters depends on the cost of each type of error.

Deep Explanation

Precision = TP / (TP + FP). High precision means when you predict positive you are usually right (few false alarms). Recall = TP / (TP + FN). High recall means you catch most actual positives (few misses). There is a precision-recall tradeoff — increasing the classification threshold raises precision but lowers recall. F1 score = 2 * (precision * recall) / (precision + recall) balances both. Choose based on business cost: in spam detection low precision (legitimate emails marked spam) is worse than low recall (some spam gets through) — optimize precision. In cancer screening low recall (missing cancers) is catastrophic — optimize recall. In fraud detection both matter differently depending on churn cost vs fraud loss.

Real-World Example

A medical imaging AI for tumor detection: recall is paramount — missing a tumor (false negative) is far worse than a false alarm (false positive) that leads to an additional test. The model was tuned to 98% recall at 60% precision flagging many non-tumors for human review rather than risking misses.

⚠ Common Mistakes

Using accuracy as the primary metric for imbalanced datasets — 99% accuracy on a dataset where 99% of examples are negative tells you nothing useful. Not understanding that F1 is undefined when both precision and recall are zero. Optimizing the wrong metric because the business cost of each error type was not clearly defined.

🏭 Production Scenario

A production spam filter optimized for F1 score was generating too many false positives (legitimate business emails marked as spam). The client measured success by user complaints about missed emails not by spam caught. Reframing as a precision optimization problem and raising the threshold resolved the operational issue.

Follow-up Questions
What is the ROC curve and AUC? When do you use macro vs micro vs weighted averaging for multiclass metrics? What is the Matthews Correlation Coefficient??
ID: ML-INT-002  ·  Difficulty: 5/10  ·  Level: Intermediate
ML-INT-005 What is word embedding and how does Word2Vec or similar models create semantic representations?
Machine Learning AI/ML Intermediate
5/10
Answer

Word embeddings are dense numerical vectors representing words where semantically similar words have similar vectors. Word2Vec trains a neural network to predict surrounding words (skip-gram) or predict a word from its context (CBOW) — the learned weights become the word vectors.

Deep Explanation

Traditional NLP represented words as one-hot vectors (10000-dimensional for a 10000-word vocabulary with a single 1 and all other 0s). These are high-dimensional sparse and have no semantic relationships — 'king' and 'queen' are just as different as 'king' and 'banana'. Word2Vec trains a shallow neural network on a large text corpus to either predict context words from a center word (skip-gram) or predict the center word from context words (CBOW). The weights learned for the hidden layer become the word vectors (typically 100-300 dimensions). The resulting vectors capture semantic relationships: king - man + woman ≈ queen. Similar words cluster together in vector space. GloVe (Global Vectors) is an alternative approach using word co-occurrence statistics. Modern LLMs use contextual embeddings (the same word has different vectors in different contexts) which are more powerful but require more compute.

Real-World Example

In a product recommendation system at an e-commerce company Word2Vec was trained on product purchase sequences (treating each purchase as a 'word' and each customer's purchase history as a 'sentence'). Products frequently bought together got similar embeddings. Recommendation became a nearest-neighbor search in embedding space — fast and semantically meaningful.

⚠ Common Mistakes

Confusing static word embeddings (Word2Vec GloVe — one vector per word) with contextual embeddings (BERT GPT — context-dependent vectors). Not handling out-of-vocabulary words in production (Word2Vec has no representation for words not in the training vocabulary — use subword models like FastText). Normalizing embeddings before cosine similarity comparison.

🏭 Production Scenario

A job matching platform trained Word2Vec on job descriptions and resumes treating skills as vocabulary. The model learned that 'React' and 'ReactJS' and 'React.js' map to nearby vectors even though they are different strings. This enabled matching across skill name variations that exact string matching would miss completely.

Follow-up Questions
What is the difference between Word2Vec and BERT embeddings? What is FastText and how does it handle out-of-vocabulary words? What is the curse of dimensionality and how does it affect embedding space??
ID: ML-INT-005  ·  Difficulty: 5/10  ·  Level: Intermediate
ML-INT-004 What is gradient boosting and how does it differ from Random Forest?
Machine Learning AI/ML Intermediate
6/10
Answer

Gradient boosting builds trees sequentially each correcting the errors of the previous. Random Forest builds trees in parallel independently. Gradient boosting typically achieves higher accuracy but is slower to train and more prone to overfitting if not carefully tuned.

Deep Explanation

Gradient boosting is an ensemble method that builds trees one at a time with each new tree trained on the residual errors (the gradient of the loss function) of the combined previous trees. The final prediction is a weighted sum of all tree predictions. Because each tree is small (weak learner) and trained on residuals the ensemble gradually improves. Key implementations: XGBoost (adds regularization column subsampling parallel tree construction) LightGBM (leaf-wise growth instead of depth-wise extremely fast) CatBoost (native categorical feature handling symmetric trees). Random Forest: trees are independent any order each sees a bootstrap sample random feature subsets. Gradient boosting: trees are sequential each sees all data focused on hardest examples.

Real-World Example

Kaggle competitions are dominated by gradient boosting (XGBoost LightGBM) for tabular data problems. Industry production: credit scoring (LightGBM) click-through rate prediction (XGBoost at scale) fraud detection. When accuracy is critical and training time is not the primary constraint gradient boosting almost always outperforms Random Forest on structured data.

⚠ Common Mistakes

Not tuning learning_rate and n_estimators together (lower learning rate requires more trees). Ignoring early stopping — without it gradient boosting inevitably overfits. Not tuning max_depth (should be shallow 3-7) — deep trees cause overfitting. Using gradient boosting for non-tabular data (images text) where neural networks are appropriate.

🏭 Production Scenario

A price optimization model for an airline used Random Forest and achieved 0.79 AUC. Switching to LightGBM with tuned hyperparameters (learning_rate=0.05 2000 trees with early stopping) improved AUC to 0.86 translating to measurable revenue improvement in A/B testing.

Follow-up Questions
What is the difference between XGBoost and LightGBM? What is early stopping in gradient boosting? What is the difference between gradient boosting and AdaBoost??
ID: ML-INT-004  ·  Difficulty: 6/10  ·  Level: Intermediate
ML-INT-003 What is feature leakage and why is it one of the most dangerous mistakes in production ML?
Machine Learning AI/ML Intermediate
7/10
Answer

Feature leakage (data leakage) is when information from the future or from the target variable is included in the training features causing artificially high training metrics that completely fail to generalize to production.

Deep Explanation

Leakage occurs when a feature contains information the model would not have access to at prediction time. Types: target leakage (the feature is derived from or correlated with the target in a way not available before the outcome) train-test contamination (preprocessing statistics like mean imputation computed on the full dataset including test set) temporal leakage (future data used to predict past events — common in time-series feature engineering) and identifier leakage (customer ID correlated with target due to historical accident). Leakage is insidious because it makes models look extraordinarily good in development — 99% AUC that collapses to 55% in production.

Real-World Example

A fraud detection model achieved 0.98 AUC during development. In production it performed at chance level. Investigation revealed one feature: 'transaction_reversal_count' — a field that gets updated AFTER a fraud case is confirmed. It was perfectly predictive because it contained the outcome itself. Removing it and rebuilding took three months.

⚠ Common Mistakes

Using data from after the prediction timestamp in feature engineering for time-series models. Fitting preprocessing (scalers imputers encoders) on the entire dataset including test set — must fit on training set only and transform test set. Joining tables using keys that correlate with the target for non-obvious reasons. Not doing a temporal sanity check on feature availability before deployment.

🏭 Production Scenario

A hospital readmission risk model showed 91% AUC in validation and 58% AUC in production. The post-mortem identified that discharge diagnosis codes — which are finalized after the readmission determination — had been included as features. They were highly predictive because they were effectively recorded after the outcome was known.

Follow-up Questions
How do you systematically detect feature leakage? What is a temporal cross-validation strategy? How do feature stores help prevent training-serving skew??
ID: ML-INT-003  ·  Difficulty: 7/10  ·  Level: Intermediate