Skip to main content
Home  /  Knowledge Hub  /  Interview Questions

Interview Questions& Model Answers

Real questions. Real answers. Built from 20 years of actual hiring and being hired.

54
Total Questions
3
Technologies
3
Levels

Showing 54 questions

AI-MLO-001 How do you evaluate the quality of an LLM-powered application in production?
AI Integration AI Integration Advanced
8/10
Answer

LLM application quality requires a multi-layered evaluation strategy: offline evals (automated benchmarks using LLM-as-judge) online monitoring (latency cost error rates) and human evaluation for quality calibration. There is no single metric — you need task-specific criteria.

Deep Explanation

Evaluation layers: automated offline evals (run test cases through the system compare outputs against reference answers using another LLM as judge — e.g. GPT-4 scoring responses on accuracy relevance groundedness and format compliance) human evaluation (sample of outputs reviewed by domain experts to calibrate the LLM judge and catch systematic failures) production monitoring (latency per-call cost API error rates user feedback signals like thumbs up/down) and A/B testing (compare system versions on real user traffic). RAGAS framework evaluates RAG systems specifically: faithfulness (is the answer grounded in retrieved context?) answer relevancy (does the answer address the question?) context recall and context precision. For agents: task completion rate steps per completion tool error rate and cost per successful task completion.

Real-World Example

At a legal document AI company: automated evals used a curated set of 500 document-question pairs with reference answers GPT-4 as judge scored faithfulness and accuracy monthly human review by paralegals calibrated the automated judge real-time dashboards showed per-endpoint latency and cost and a thumbs-down button collected user feedback that triggered human review for systematic issues.

⚠ Common Mistakes

Using only automated LLM-as-judge evaluation without human calibration — the judge model has its own biases and blind spots. Not evaluating on adversarial cases (edge cases failure modes). Measuring only technical metrics (latency cost) and not quality metrics. Not separating evaluation of the retrieval step from the generation step in RAG systems.

🏭 Production Scenario

A customer service AI showed consistently positive automated evaluation scores but had a growing volume of user complaints. The disconnect was because the LLM judge was evaluating response quality in isolation while users were frustrated by the system's failure to resolve their issues (task completion rate was not measured). Adding task completion as a primary metric revealed the real problem.

Follow-up Questions
What is LLM-as-judge and what are its limitations? What is RAGAS and how do you set it up? How do you A/B test prompt changes safely in production??
ID: AI-MLO-001  ·  Difficulty: 8/10  ·  Level: Advanced
ML-ADV-002 What is attention mechanism and why did it replace RNNs for sequence modeling?
Machine Learning AI/ML Advanced
8/10
Answer

Attention allows a model to directly reference any position in the input sequence when processing each output token regardless of distance. RNNs process sequentially and lose information about distant tokens. Attention solved this and enabled parallelization of training.

Deep Explanation

RNNs process sequences step by step maintaining a hidden state that compresses all previous context. This creates two problems: vanishing gradients (difficulty learning long-range dependencies) and sequential computation (cannot be parallelized — step N requires step N-1). Attention solves both. For each output position attention computes a weighted sum of all input positions — the weights (attention scores) are learned and indicate relevance. Self-attention attends to all positions in the same sequence. Multi-head attention runs multiple attention computations in parallel each learning different types of relationships (syntax semantics coreference). The Transformer architecture (2017) used only attention (no recurrence) enabling full parallelization of training which allowed training on massive datasets that were impractical for RNNs.

Real-World Example

Translation quality: an RNN translating a 100-word sentence compresses the entire source into a fixed-size vector losing detail about early tokens. An attention-based model when generating each target word directly attends to the most relevant source words — when translating 'bank' in a financial context it attends to financial terms in the source to disambiguate meaning.

⚠ Common Mistakes

Confusing self-attention with cross-attention (cross-attention attends between two different sequences as in encoder-decoder translation). Thinking attention has O(n) complexity — it is O(n2) in sequence length which is why very long sequences are computationally expensive and why efficient attention variants (Flash Attention sparse attention) were developed.

🏭 Production Scenario

A document classification system for a legal tech company was using an LSTM that performed poorly on contracts longer than 1000 words — important clauses near the beginning were forgotten by the time the model reached the end. Switching to a transformer-based model (BERT fine-tuning) that could attend to any position simultaneously improved accuracy by 18%.

Follow-up Questions
What is Flash Attention and why is it more efficient? What is positional encoding and why does the Transformer need it? How does multi-head attention differ from single-head attention??
ID: ML-ADV-002  ·  Difficulty: 8/10  ·  Level: Advanced
PY-ADV-002 How does Python’s memory management and garbage collection work?
Python Performance Advanced
8/10
Answer

Python uses reference counting as the primary memory management mechanism supplemented by a cyclic garbage collector to handle reference cycles. Memory is allocated from private heaps managed by the Python memory manager.

Deep Explanation

Every Python object has a reference count. When you assign a variable or pass an object to a function the count increases. When a reference goes out of scope or is deleted the count decreases. When the count reaches zero memory is freed immediately. The problem is reference cycles: object A references B B references A — neither count reaches zero. Python's gc module handles this with a generational garbage collector that periodically identifies and clears cycles. Objects are sorted into three generations based on survival — most objects die young (generation 0) so the GC focuses there. You can trigger collection manually with gc.collect() and disable it in performance-critical code if you are certain there are no cycles.

Real-World Example

A long-running FastAPI service was growing in memory over days. Profiling with tracemalloc revealed a reference cycle in a caching layer where cached response objects held references back to the cache container. Explicitly breaking the cycle with weakref.ref() eliminated the memory growth.

⚠ Common Mistakes

Assuming memory is freed immediately after del (del only removes the reference the GC frees memory). Creating reference cycles in data structures without using weakref. Disabling the GC for performance without understanding the cycle risk. Not using __slots__ in high-volume object creation wasting memory on per-instance __dict__.

🏭 Production Scenario

A Python-based IoT data collector crashed with OOM after running for several days. Memory profiling showed 50000 DataPoint objects that should have been freed were kept alive by a reference cycle between DataPoint and its parent DataStream. Using weakref.ref for the back-reference fixed the leak.

Follow-up Questions
What is the difference between gc.collect() and del? How does __slots__ affect memory? What is weakref and when should you use it??
ID: PY-ADV-002  ·  Difficulty: 8/10  ·  Level: Advanced
PY-ADV-001 What is a metaclass in Python and when would you actually use one?
Python Core Python Advanced
8/10
Answer

A metaclass is the class of a class — it controls how classes themselves are created. Use them when you need to enforce constraints auto-register classes or modify class definitions at creation time.

Deep Explanation

In Python everything is an object including classes. The default metaclass is 'type'. When Python processes a class definition it calls the metaclass to build the class object. By creating a custom metaclass (inheriting from type and overriding __new__ or __init__) you can intercept class creation and modify or validate it. Practical uses include: enforcing that all subclasses implement certain methods automatically registering plugin classes in a registry adding logging to all methods automatically and implementing singleton patterns. Django's ORM uses metaclasses to convert class-level field declarations into actual database schema mappings.

Real-World Example

Django's Model metaclass (ModelBase) reads the field attributes you declare on a model class and builds the database schema query interface and validation logic automatically. Without metaclasses Django's ORM syntax would require explicit registration calls for every model field.

⚠ Common Mistakes

Overusing metaclasses for problems that class decorators or __init_subclass__ solve more simply. Metaclasses from different libraries conflicting when a class inherits from both (metaclass conflict). Writing metaclasses that are so abstract they become impossible to debug.

🏭 Production Scenario

An internal plugin system at a SaaS company used a metaclass to automatically register all subclasses of a BasePlugin class into a global plugin registry. This eliminated the need for manual plugin registration and prevented the recurring production bug where developers created a plugin class but forgot to register it.

Follow-up Questions
What is __init_subclass__ and how does it compare to metaclasses? How does 'type' work as a metaclass? What causes a metaclass conflict??
ID: PY-ADV-001  ·  Difficulty: 8/10  ·  Level: Advanced
AI-ADV-003 What is fine-tuning an LLM and when should you fine-tune versus use RAG or prompt engineering?
AI Integration AI Integration Advanced
8/10
Answer

Fine-tuning adjusts the model weights on domain-specific data to internalize knowledge or style. Use it when the task requires consistent behavior style or format the base model cannot achieve through prompting alone. RAG is better for factual grounding; prompt engineering first for most tasks.

Deep Explanation

Fine-tuning: continue training a pretrained LLM on a curated dataset of examples in your target format/domain. Changes the model weights permanently for that task. Types: full fine-tuning (expensive updates all parameters) parameter-efficient fine-tuning (PEFT — LoRA QLORA update a small fraction of parameters cheaply). When to fine-tune: consistent output format the base model keeps breaking (code generation with specific conventions) domain-specific style or tone (legal writing medical reports) task-specific behavior patterns (classification schema extraction) or reducing prompt length at inference (baking instructions into the model). When NOT to fine-tune: you need up-to-date information (use RAG) you are still exploring requirements (use prompting first) you have less than 1000 high-quality examples (insufficient for fine-tuning) or the base model already performs the task well with prompting.

Real-World Example

A financial services company needed an LLM to consistently extract structured data from loan applications into a specific JSON schema. Prompt engineering achieved 78% schema compliance. RAG did not help (the schema was fixed not document-dependent). Fine-tuning with 5000 labeled examples achieved 97% schema compliance with shorter prompts reducing inference cost.

⚠ Common Mistakes

Fine-tuning with low-quality or insufficient examples — produces a model worse than the base model. Fine-tuning when prompt engineering would suffice — expensive and inflexible. Forgetting that fine-tuned models still hallucinate and still need RAG for factual grounding. Not evaluating catastrophic forgetting — fine-tuning on a narrow dataset can degrade performance on general tasks.

🏭 Production Scenario

A customer service company fine-tuned an LLM on 2000 examples of customer conversations expecting it to handle all intents. In production the model lost general language capabilities and failed on intents not well-represented in the training data. Rebuilding with a larger curated dataset (15000 examples across all intents) with proper evaluation resolved the regression.

Follow-up Questions
What is LoRA and how does it make fine-tuning parameter-efficient? What is catastrophic forgetting in fine-tuning? How do you create a high-quality fine-tuning dataset??
ID: AI-ADV-003  ·  Difficulty: 8/10  ·  Level: Advanced
AI-ADV-002 What is an AI agent and how is it architecturally different from a simple LLM API call?
AI Integration AI Integration Advanced
8/10
Answer

An AI agent uses an LLM as a reasoning engine to autonomously plan use tools and complete multi-step tasks. Unlike a single LLM call that maps input to output an agent operates in a loop: observe think act observe again — until the task is complete.

Deep Explanation

The ReAct pattern (Reason + Act) describes the core agent loop: the LLM receives a task and available tools generates a thought (reasoning about what to do) selects an action (a tool call) receives the observation (tool output) and repeats until producing a final answer. Tools are functions the LLM can invoke: web search code execution database queries API calls file operations. Agent architectures range from simple (single LLM with tools) to complex (multi-agent systems where specialized agents collaborate with a planner/orchestrator agent routing tasks). Key engineering challenges: tool design (tools must have clear descriptions for the LLM to select them correctly) error handling (agents can get stuck in loops or make wrong tool calls) context management (the agent's action history grows and fills the context window) and cost control (multi-step agents can make many API calls).

Real-World Example

A customer onboarding agent at a SaaS company replaces a 12-step manual process: it receives a new customer email calls the CRM API to create a contact queries the provisioning API to set up an account generates and sends a personalized welcome email creates a Jira ticket for account review and posts a Slack notification to the account manager — all autonomously from a single trigger.

⚠ Common Mistakes

Building agents without observability — impossible to debug why an agent made wrong decisions without logging the full thought-action-observation trace. Not implementing maximum step limits — agents can loop indefinitely on ambiguous tasks. Giving agents too many tools — LLMs struggle to select from large tool sets. Not handling tool failures gracefully in the agent loop.

🏭 Production Scenario

A document processing agent for an insurance company was processing claims autonomously. Without a step limit it entered an infinite loop trying to resolve a document parsing error making 10000 API calls in 8 minutes and generating a $400 API bill before being detected. Implementing a 20-step maximum and exponential backoff on tool errors fixed the runaway behavior.

Follow-up Questions
What is the difference between ReAct Plan-and-Execute and Reflexion agent patterns? How do you implement agent memory (short-term vs long-term)? What is LangGraph and how does it implement agent state machines??
ID: AI-ADV-002  ·  Difficulty: 8/10  ·  Level: Advanced
AI-ADV-001 How does Retrieval-Augmented Generation (RAG) work and what are its main failure modes?
AI Integration AI Integration Advanced
8/10
Answer

RAG retrieves relevant documents from a vector database using semantic similarity search injects them into the LLM context and generates a response grounded in the retrieved content. Main failure modes are retrieval failures context window overflow and hallucinations about retrieved content.

Deep Explanation

RAG has three main components: indexing (documents are chunked embedded using an embedding model and stored in a vector database like Pinecone Weaviate or pgvector) retrieval (the user query is embedded and semantically similar chunks are retrieved using approximate nearest neighbor search) and generation (retrieved chunks are inserted into the LLM prompt as context and the model generates a response). Key design decisions: chunk size (too small loses context too large wastes context window and dilutes relevance) embedding model choice number of retrieved chunks (k) whether to use reranking to improve retrieved chunk ordering and metadata filtering to constrain retrieval. Advanced patterns include hybrid search (semantic + keyword/BM25) HyDE (hypothetical document embeddings) and multi-hop retrieval for complex questions.

Real-World Example

A legal research assistant RAG system at a law firm used chunk sizes of 512 tokens for case documents. Attorneys complained answers lacked context. Investigation showed important legal reasoning spanned across chunk boundaries. Implementing larger overlapping chunks (1024 tokens with 200 token overlap) and a reranker (Cohere Rerank) improved answer quality significantly.

⚠ Common Mistakes

Chunking documents arbitrarily without considering semantic boundaries (splitting mid-paragraph). Using cosine similarity retrieval without reranking causing less relevant chunks to appear in context and confuse the model. Not handling the case where no relevant documents are retrieved — the model hallucinates instead of saying it does not know. Embedding the entire document instead of chunking exceeding context limits.

🏭 Production Scenario

A production customer support RAG system was giving confidently wrong answers about product return policies. Investigation revealed the retrieval was returning chunks from old policy documents because they had higher semantic similarity scores than newer updates. Implementing date-based metadata filtering to prefer recent documents and adding a retrieval confidence threshold solved the problem.

Follow-up Questions
What is the difference between RAG and fine-tuning — when do you use each? What is a vector database and how does HNSW indexing work? What is RAGAS and how do you evaluate a RAG system??
ID: AI-ADV-001  ·  Difficulty: 8/10  ·  Level: Advanced
ML-MLO-001 What is model drift and how do you detect and handle it in production ML systems?
Machine Learning AI/ML Advanced
8/10
Answer

Model drift is the degradation of model performance over time as the real-world data distribution changes after deployment. Detect with monitoring (input distribution prediction distribution and ground truth metrics). Handle with automated retraining triggers shadow deployments and champion-challenger frameworks.

Deep Explanation

There are two types of drift: data drift (input feature distributions change — customer demographics shift new product categories appear) and concept drift (the relationship between inputs and outputs changes — what predicts churn changes as customer behavior evolves). Detecting data drift: monitor statistical properties of input features using tests like KS test Population Stability Index (PSI) or Jensen-Shannon divergence. Detecting concept drift: monitor prediction distribution shifts and when labels are available track accuracy/AUC over time. PSI > 0.2 typically signals significant drift requiring investigation. Handling drift: trigger model retraining when drift metrics exceed thresholds use sliding window retraining on recent data implement champion-challenger deployment to safely test retrained models and maintain feature stores that can be queried at training and serving time to ensure consistency.

Real-World Example

A credit scoring model deployed in January showed 0.81 AUC. By September AUC had dropped to 0.71. PSI analysis of input features revealed significant drift in employment status and income features — COVID-19 had fundamentally changed the distribution. Emergency retraining on recent data restored AUC to 0.79.

⚠ Common Mistakes

Not monitoring model performance after deployment — treating deployment as the end of the ML lifecycle. Retraining on all historical data including outdated periods instead of using a recent sliding window. Not having rollback capability when a retrained model performs worse than the current champion. Ignoring the feedback loop where model predictions affect future training data.

🏭 Production Scenario

A fraud detection model at a payment processor declined from 89% recall to 74% recall over 6 months as fraudsters adapted their behavior patterns. Monthly retraining on recent fraud cases and implementing a fast-response challenger model that retrained weekly restored recall to 86% while reducing false positives.

Follow-up Questions
What is Population Stability Index and how is it calculated? What is a feature store and why does it matter for training-serving skew? How do you implement a champion-challenger deployment??
ID: ML-MLO-001  ·  Difficulty: 8/10  ·  Level: Advanced
ML-ADV-003 How do you design an ML feature store and why does it matter for production ML?
Machine Learning AI/ML Advanced
9/10
Answer

A feature store is a centralized repository for ML features that solves the training-serving skew problem by ensuring features computed at training time are computed identically at serving time. It also enables feature reuse across teams and models.

Deep Explanation

Training-serving skew is one of the most common and damaging production ML problems: features computed during training using the full historical dataset are computed differently at serving time using real-time data leading to performance degradation. A feature store has two components: offline store (historical feature values for training — typically a data warehouse like BigQuery or Redshift) and online store (latest feature values for low-latency serving — typically Redis or DynamoDB). Feature pipelines write to both stores ensuring identical computation logic. Feature engineering logic is defined once and shared — a 'user_30_day_purchase_total' feature computed for a recommendation model can be reused by a fraud model without re-implementation. Modern feature stores (Feast Tecton Hopsworks) also handle: feature versioning (audit trail) feature sharing across teams and point-in-time correct feature lookup (critical for preventing temporal data leakage in training).

Real-World Example

At a major e-commerce company the customer lifetime value model the recommendation model and the fraud model all needed 'user_purchase_frequency_last_30_days'. Before the feature store each team computed it differently with subtle differences (timezone handling business day vs calendar day) producing inconsistent results. The feature store defined one authoritative computation shared across all three models.

⚠ Common Mistakes

Implementing offline-only features (fast to build but creates training-serving skew when serving). Computing features in the model serving code itself (no reuse performance overhead). Not handling point-in-time correctness in the offline store (using features from after the label timestamp in training data — a form of feature leakage). Building a feature store before having more than 2-3 models (premature optimization).

🏭 Production Scenario

A churn prediction model performing at 0.84 AUC in offline evaluation dropped to 0.71 AUC in production. Investigation revealed that customer engagement features were computed using UTC timestamps in training but local time in the serving API — a seemingly minor difference that caused dramatic feature value shifts for users in non-UTC timezones. Centralizing feature computation in a feature store with explicit timezone handling fixed the skew.

Follow-up Questions
What is point-in-time correct feature lookup and why does it prevent data leakage? What is the difference between Feast Tecton and Hopsworks? How do you handle feature freshness requirements for real-time models??
ID: ML-ADV-003  ·  Difficulty: 9/10  ·  Level: Advanced

PAGE 4 OF 4  ·  54 QUESTIONS TOTAL