Interview Questions& Model Answers

Real questions. Real answers. Built from 20 years of actual hiring and being hired.

1,774

Total Questions

Technologies

Levels

Showing 5 questions · Advanced · Machine Learning

Clear all filters

ML-ADV-004 What is the difference between batch gradient descent stochastic gradient descent and mini-batch gradient descent? ▾

Machine Learning AI/ML Advanced

7/10

Answer

Batch GD computes gradients on the entire dataset — slow but stable. Stochastic GD (SGD) computes gradients on one example — fast but noisy. Mini-batch GD computes on a subset (typically 32-256 examples) — balancing speed and stability. Mini-batch is the standard for deep learning.

Deep Explanation

Batch gradient descent: compute loss and gradients across all training examples then update weights. Advantage: stable convergence guaranteed direction toward minimum. Disadvantage: extremely slow for large datasets (must process all data before updating) cannot fit large datasets in memory. SGD: compute gradient on one random example update weights immediately. Advantage: fast updates can escape local minima due to noise. Disadvantage: noisy updates cause loss to oscillate even near minimum hard to parallelize. Mini-batch: compromise — compute gradient on a random subset (batch size). Advantages: vectorized computation uses GPU parallelism efficiently noise helps escape local minima more stable than pure SGD. Batch size is a key hyperparameter: smaller batches (16-32) more noise better generalization larger batches (512-2048) more stable faster wall-clock time but may generalize worse (sharp vs flat minima research). Modern optimizers (Adam AdaGrad RMSprop) adapt learning rate per parameter addressing many SGD limitations.

Real-World Example

Training GPT-scale models: batch sizes of 2048-8192 tokens are used across hundreds of GPUs. The batch is distributed across GPUs (data parallelism) with gradients averaged across GPUs before weight updates. Learning rate warmup (gradual increase from 0) is used because large batch sizes are sensitive to initial learning rate choice.

⚠ Common Mistakes

Using batch size 1 (pure SGD) on modern GPU hardware — wastes parallelism. Not adjusting learning rate when changing batch size (linear scaling rule: if you double batch size double learning rate). Using a constant learning rate when training benefits from decay (use cosine annealing or linear decay). Not shuffling training data before each epoch causing the model to see data in the same order repeatedly.

🏭 Production Scenario

A production deep learning model was trained with batch size 4 because the researcher was worried about memory. Training took 72 hours. Using gradient accumulation (accumulate gradients over 32 steps before updating) achieved effectively batch size 128 without exceeding memory limits reducing training time to 18 hours with better final performance.

Follow-up Questions

What is the learning rate warmup and why is it used? What is gradient accumulation and when do you use it? What is the difference between Adam and AdamW??

ID: ML-ADV-004 · Difficulty: 7/10 · Level: Advanced

ML-ADV-001 What is the vanishing gradient problem and how do modern architectures solve it? ▾

Machine Learning AI/ML Advanced

8/10

Answer

During backpropagation in deep networks gradients shrink exponentially as they propagate backward through many layers making early layers learn very slowly or not at all. Solutions include ReLU activations batch normalization residual connections and careful weight initialization.

Deep Explanation

In backpropagation gradients are computed by multiplying partial derivatives through each layer using the chain rule. If activation functions have derivatives less than 1 (sigmoid outputs derivatives between 0 and 0.25) multiplying many such small values causes exponential decay — a 20-layer network might have gradients 10^-10 times smaller at layer 1 than layer 20. Solutions evolved over time: ReLU activation (derivative is 1 for positive inputs 0 otherwise — no saturation in positive region). Batch normalization normalizes layer inputs keeping activations in a healthy range. Residual connections (ResNet) add shortcuts that allow gradients to flow directly backward without passing through activation functions. Careful initialization (He initialization for ReLU Xavier for tanh) sets initial weights so activations neither explode nor vanish from the first forward pass.

Real-World Example

ResNet (Residual Network) solved the degradation problem where very deep networks (100+ layers) performed worse than shallower ones despite having more parameters. The residual connections allowed training networks with 1000+ layers that would have been completely untrainable with standard architectures.

⚠ Common Mistakes

Using sigmoid or tanh activations in very deep networks without understanding their gradient saturation behavior. Not using batch normalization in deep CNNs. Thinking the vanishing gradient problem only affects RNNs — it was originally identified in feedforward networks and RNNs face an even more severe version.

🏭 Production Scenario

A production time-series forecasting LSTM model for financial data was not learning beyond the first few timesteps. Diagnosis showed vanishing gradients preventing the model from learning long-range dependencies. Switching to a Transformer architecture with attention mechanisms and positional encoding resolved the long-range dependency problem entirely.

Follow-up Questions

What is the exploding gradient problem and how is gradient clipping used? How do Transformers avoid the vanishing gradient problem? What is the difference between He and Xavier initialization??

ID: ML-ADV-001 · Difficulty: 8/10 · Level: Advanced

ML-ADV-002 What is attention mechanism and why did it replace RNNs for sequence modeling? ▾

Machine Learning AI/ML Advanced

8/10

Answer

Attention allows a model to directly reference any position in the input sequence when processing each output token regardless of distance. RNNs process sequentially and lose information about distant tokens. Attention solved this and enabled parallelization of training.

Deep Explanation

RNNs process sequences step by step maintaining a hidden state that compresses all previous context. This creates two problems: vanishing gradients (difficulty learning long-range dependencies) and sequential computation (cannot be parallelized — step N requires step N-1). Attention solves both. For each output position attention computes a weighted sum of all input positions — the weights (attention scores) are learned and indicate relevance. Self-attention attends to all positions in the same sequence. Multi-head attention runs multiple attention computations in parallel each learning different types of relationships (syntax semantics coreference). The Transformer architecture (2017) used only attention (no recurrence) enabling full parallelization of training which allowed training on massive datasets that were impractical for RNNs.

Real-World Example

Translation quality: an RNN translating a 100-word sentence compresses the entire source into a fixed-size vector losing detail about early tokens. An attention-based model when generating each target word directly attends to the most relevant source words — when translating 'bank' in a financial context it attends to financial terms in the source to disambiguate meaning.

⚠ Common Mistakes

Confusing self-attention with cross-attention (cross-attention attends between two different sequences as in encoder-decoder translation). Thinking attention has O(n) complexity — it is O(n2) in sequence length which is why very long sequences are computationally expensive and why efficient attention variants (Flash Attention sparse attention) were developed.

🏭 Production Scenario

A document classification system for a legal tech company was using an LSTM that performed poorly on contracts longer than 1000 words — important clauses near the beginning were forgotten by the time the model reached the end. Switching to a transformer-based model (BERT fine-tuning) that could attend to any position simultaneously improved accuracy by 18%.

Follow-up Questions

What is Flash Attention and why is it more efficient? What is positional encoding and why does the Transformer need it? How does multi-head attention differ from single-head attention??

ID: ML-ADV-002 · Difficulty: 8/10 · Level: Advanced

ML-MLO-001 What is model drift and how do you detect and handle it in production ML systems? ▾

Machine Learning AI/ML Advanced

8/10

Answer

Model drift is the degradation of model performance over time as the real-world data distribution changes after deployment. Detect with monitoring (input distribution prediction distribution and ground truth metrics). Handle with automated retraining triggers shadow deployments and champion-challenger frameworks.

Deep Explanation

There are two types of drift: data drift (input feature distributions change — customer demographics shift new product categories appear) and concept drift (the relationship between inputs and outputs changes — what predicts churn changes as customer behavior evolves). Detecting data drift: monitor statistical properties of input features using tests like KS test Population Stability Index (PSI) or Jensen-Shannon divergence. Detecting concept drift: monitor prediction distribution shifts and when labels are available track accuracy/AUC over time. PSI > 0.2 typically signals significant drift requiring investigation. Handling drift: trigger model retraining when drift metrics exceed thresholds use sliding window retraining on recent data implement champion-challenger deployment to safely test retrained models and maintain feature stores that can be queried at training and serving time to ensure consistency.

Real-World Example

A credit scoring model deployed in January showed 0.81 AUC. By September AUC had dropped to 0.71. PSI analysis of input features revealed significant drift in employment status and income features — COVID-19 had fundamentally changed the distribution. Emergency retraining on recent data restored AUC to 0.79.

⚠ Common Mistakes

Not monitoring model performance after deployment — treating deployment as the end of the ML lifecycle. Retraining on all historical data including outdated periods instead of using a recent sliding window. Not having rollback capability when a retrained model performs worse than the current champion. Ignoring the feedback loop where model predictions affect future training data.

🏭 Production Scenario

A fraud detection model at a payment processor declined from 89% recall to 74% recall over 6 months as fraudsters adapted their behavior patterns. Monthly retraining on recent fraud cases and implementing a fast-response challenger model that retrained weekly restored recall to 86% while reducing false positives.

Follow-up Questions

What is Population Stability Index and how is it calculated? What is a feature store and why does it matter for training-serving skew? How do you implement a champion-challenger deployment??

ID: ML-MLO-001 · Difficulty: 8/10 · Level: Advanced

ML-ADV-003 How do you design an ML feature store and why does it matter for production ML? ▾

Machine Learning AI/ML Advanced

9/10

Answer

A feature store is a centralized repository for ML features that solves the training-serving skew problem by ensuring features computed at training time are computed identically at serving time. It also enables feature reuse across teams and models.

Deep Explanation

Training-serving skew is one of the most common and damaging production ML problems: features computed during training using the full historical dataset are computed differently at serving time using real-time data leading to performance degradation. A feature store has two components: offline store (historical feature values for training — typically a data warehouse like BigQuery or Redshift) and online store (latest feature values for low-latency serving — typically Redis or DynamoDB). Feature pipelines write to both stores ensuring identical computation logic. Feature engineering logic is defined once and shared — a 'user_30_day_purchase_total' feature computed for a recommendation model can be reused by a fraud model without re-implementation. Modern feature stores (Feast Tecton Hopsworks) also handle: feature versioning (audit trail) feature sharing across teams and point-in-time correct feature lookup (critical for preventing temporal data leakage in training).

Real-World Example

At a major e-commerce company the customer lifetime value model the recommendation model and the fraud model all needed 'user_purchase_frequency_last_30_days'. Before the feature store each team computed it differently with subtle differences (timezone handling business day vs calendar day) producing inconsistent results. The feature store defined one authoritative computation shared across all three models.

⚠ Common Mistakes

Implementing offline-only features (fast to build but creates training-serving skew when serving). Computing features in the model serving code itself (no reuse performance overhead). Not handling point-in-time correctness in the offline store (using features from after the label timestamp in training data — a form of feature leakage). Building a feature store before having more than 2-3 models (premature optimization).

🏭 Production Scenario

A churn prediction model performing at 0.84 AUC in offline evaluation dropped to 0.71 AUC in production. Investigation revealed that customer engagement features were computed using UTC timestamps in training but local time in the serving API — a seemingly minor difference that caused dramatic feature value shifts for users in non-UTC timezones. Centralizing feature computation in a feature store with explicit timezone handling fixed the skew.

Follow-up Questions

What is point-in-time correct feature lookup and why does it prevent data leakage? What is the difference between Feast Tecton and Hopsworks? How do you handle feature freshness requirements for real-time models??

ID: ML-ADV-003 · Difficulty: 9/10 · Level: Advanced