Prepare an instruction-tuning dataset, fine-tune Phi-2 or Mistral 7B using LoRA/QLoRA on free Colab TPUs, and rigorously evaluate the fine-tuned model vs. the base.
Curate and format a high-quality instruction-tuning dataset in JSONL format
Understand LoRA and QLoRA — how parameter-efficient fine-tuning works
Fine-tune a small open-source LLM (Phi-2 or Mistral 7B) on Google Colab for free
Evaluate a fine-tuned model rigorously against its base model using an LLM judge
Define test cases, score LLM outputs on accuracy, faithfulness, and tone. Build a regression tracker that alerts you when a prompt change breaks a passing test.
Design a rigorous evaluation test suite with multiple scoring dimensions
Use an LLM-as-judge to automatically score other LLM outputs
Track performance across prompt versions and detect regressions
Calculate inter-rater agreement between automated and human evaluators
Search across multiple document collections, rerank chunks by relevance using a cross-encoder, and generate a cited, structured answer from multiple sources.
Index and query documents from multiple distinct sources in a single vector store
Build a RAG system that dynamically assembles context based on query type — metadata filters, recency weighting, source diversity rules, and dynamic prompt templates.
Build a rigorous evaluation framework with a fixed test set and baseline score
Add document metadata and use it to filter retrieved chunks before answer generation
Classify query intent and apply different prompt templates for each type
Measure the impact of each RAG improvement quantitatively against a baseline
Summarize documents that exceed the context window using chunking and map-reduce patterns. Handle PDFs, articles, and reports of any length intelligently.
Ingest documents from multiple input types: pasted text, .txt files, and PDFs
Implement map-reduce summarization to handle arbitrarily long documents
Control output structure using prompt instructions (bullets, Q&A, executive summary)
Estimate and display approximate API cost per operation
Run the same query through 3 different prompt templates and score outputs by quality. Build a systematic prompt testing and iteration tool — the skill every AI engineer needs.
Design and compare multiple prompt templates for the same task systematically
Use concurrent API calls to run prompt variants in parallel
Build a scoring system to quantitatively evaluate prompt quality
Identify which prompt patterns (chain-of-thought, few-shot, direct) work best for different query types
Upload a PDF, ask questions about it, get accurate answers. Build your first RAG pipeline end-to-end — chunking, embedding, vector storage, and retrieval.
Chunk documents into overlapping segments for effective retrieval
Generate and store vector embeddings using OpenAI's Embeddings API
Build and query a FAISS vector index for similarity search
Create a RetrievalQA chain that grounds answers in document content
Pick a topic (company policy, a book, a subject), load the content, and answer user questions about it using basic prompt stuffing. Your intuition-builder for why RAG exists.
Understand the prompt stuffing technique and how it works
Identify the practical limits of putting large documents into a prompt
Observe and document LLM hallucination behavior on out-of-scope questions
Write a prompt that instructs the model to cite its sources
Feed in a long article or paste any text and get a clean summary. Learn API calls, basic prompt construction, and how to handle long inputs by chunking.
Build prompt templates that produce structured, useful summaries
Solve the context window problem using text chunking
Implement map-reduce summarization for documents of any length
Give users control over output format and tone through prompt engineering
Build a chatbot using direct OpenAI API calls — maintain conversation history, handle multi-turn context, and display responses cleanly. Your very first LLM integration.
Make your first OpenAI API call and understand the request/response structure
Maintain multi-turn conversation context using the messages array
Build a Streamlit chat UI with session state and message history
Use system prompts to give an LLM a persona and behavioral constraints
Every challenge includes detailed documentation, technical constraints, and automated evaluation scripts to ensure you have everything you need to succeed.