90 - AI Agentic Coding Tools for Researchers/Engineers

A practical guide for ML researchers and various engineers who write code but aren't software developers by trade.

Target audience: TalTech researchers working in ML, IoT, signal processing, embedded systems. Comfortable with Python, Jupyter, maybe C/C++ for microcontrollers, possibly MATLAB. Not doing enterprise software development (I'm doing that).

Goal: Get you productively using AI coding tools — today, on your own hardware or via API — without drowning in hype.

Part 1: The Honest State of Things

What "AI Coding" Actually Means in 2026

There's a spectrum of how AI can help you write code, from least to most autonomous:

Autocomplete — predicts the next few tokens/lines as you type. Like aggressive IntelliSense. Low risk, high convenience.
Chat — you paste code, ask questions, get explanations or rewrites. Copy-paste workflow. You're in full control.
Inline edit — you select code in your editor, describe a change, the model rewrites that block. You review a diff.
Agentic — you describe a task, the AI reads your files, writes code, runs commands, reads errors, fixes them, iterates. Semi-autonomous.

Most of the hype is about level 4. Most of the practical value for researchers is at levels 2–3. Be honest with yourself about where you are.

Frontier Models vs. Local Models

This is the single most important thing to understand before choosing tools.

Frontier models (Claude Opus, GPT, Gemini) are served from datacenters with hundreds of GPUs. They are trained on enormous datasets, heavily post-trained for instruction following and tool use, and have context windows of 256k–1M+ tokens. Ca 24TB gpu ram for training, ca 1tb ram for inference.

Local models run on your machine. Your hardware is the ceiling. With 128 GB of unified memory (top-end Mac laptop) you can run roughly a 70B parameter model at Q4 quantization. With a single NVIDIA GPU (24 GB VRAM), you're looking at 7B–14B models at reasonable quality, or 32B with aggressive quantization and partial CPU offload.

The gap is not subtle

Capability	Frontier (Opus/GPT/Gemini)	Local 32B (Qwen 3.5 Coder)	Local 7B
Autocomplete	Overkill	Good	Good
Code chat / Q&A	Excellent	Good	Decent
Single-file generation	Excellent	Good	Passable
Multi-file editing	Excellent	Fragile	Unreliable
Agentic (plan → execute → debug)	Works	Breaks after 3–5 steps	Doesn't work
Context window	256k–1M tokens	32k-128k tokens	8–32k tokens
Tool use / function calling	Reliable	Inconsistent	Poor
Error self-correction	Yes, iterates well	Sometimes	Rarely

The practical takeaway: Use local models for autocomplete, chat, and single-shot code generation. Use frontier APIs for anything that requires planning, multi-step execution, or understanding a large codebase. This is not a temporary gap — it's physics. Frontier models are 10–50x larger and trained with 100–1000x more compute.

Why local still matters

Privacy. Your unpublished research, datasets, proprietary hardware designs, grant proposals - nothing leaves your machine.
Latency. For autocomplete, local inference at 30–50 tokens/sec feels instant. API round-trips add 200–500ms minimum.
Cost. No per-token billing. Run it 24/7 if you want.
Offline. Works on a plane, in a Faraday cage, in a server room with no internet.
Learning. You're ML researchers. Running models locally teaches you about inference, quantization, memory bandwidth, KV cache — things you should understand anyway.

Quantization: The 30-Second Version

You already know what floating point precision means. Quantization is storing model weights at reduced precision to fit in less memory.

FP16 / BF16 — full precision. A 70B model needs ~140 GB. You can't run this locally unless you have very exotic hardware.
Q8 (8-bit) — ~70 GB for 70B. Minimal quality loss. Fits on 128 GB unified memory machines with room for KV cache.
Q4_K_M (4-bit, k-quant medium) — ~40 GB for 70B. Noticeable but acceptable quality loss. The sweet spot for most local setups.
Q3 and below — quality drops fast. Not recommended for coding tasks.

The format you'll see most often is GGUF — a single-file format used by llama.cpp and everything built on it (ollama, LM Studio, etc.).
MLX format for Apple Silicon. Most developers are on macs.

Part 2: Local Serving — How to Run Models on Your Machine

Option 1: Ollama (Recommended Starting Point)

Platforms: macOS, Linux, Windows
What it is: A CLI tool that downloads and serves LLMs locally. Think of it as "Docker for language models."
Models to test: qwen-coder-30b-a3bm, qwen3-coder-next, qwen3.5:27b, gpt-oss-20b

Install:

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download installer from https://ollama.ai

Usage:

# https://ollama.com/library/qwen3.5/tags
# Pull a model
ollama pull qwen3.5:27b-q8_0

# Chat directly
ollama run qwen3.5:27b-q8_0

# Start as a server (for other tools to connect to)
ollama serve
# Server runs on http://localhost:11434
# OpenAI-compatible API at http://localhost:11434/v1
# https://docs.ollama.com/integrations/claude-code

Why ollama: Zero configuration. Model management is trivial. Exposes an OpenAI-compatible API, so almost every tool can connect to it. Cross-platform. Has also Anthropic API.

Limitations: Less control over inference parameters. No tensor parallelism across multiple GPUs (uses llama.cpp under the hood, which does support multi-GPU, but ollama's control over it is limited).

Option 2: LM Studio

Platforms: macOS, Linux, Windows
What it is: A desktop GUI application for downloading and running LLMs locally.

Good for researchers who prefer a graphical interface. Browse models from Hugging Face, download with one click, chat in a built-in UI, or enable a local server that's API-compatible with OpenAI's format (and Anthropic also now).

When to use over ollama: When you want a visual model browser, or you're on Windows and don't want to touch the terminal.

Option 3: llama.cpp Server (Direct)

Platforms: macOS, Linux, Windows
What it is: The C/C++ inference engine that powers ollama and LM Studio, used directly.

# Build from source (you'll want GPU support)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # or -DGGML_METAL=ON for Mac
cmake --build build --config Release

# Run server
./build/bin/llama-server -m model.gguf -c 32768 --port 8080

When to use: When you need fine control over context length, batch size, thread count, GPU layer offloading. When you're benchmarking inference performance. When ollama's defaults aren't cutting it.

Option 4: MLX (Apple Silicon Only)

Platforms: macOS (Apple Silicon only)
What it is: Apple's ML framework, optimized for Metal / unified memory.

pip install mlx-lm

# Run a model
mlx_lm.server --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit

Why: Faster than llama.cpp on Apple Silicon for some models. Native Metal acceleration without translation layers.

Limitations: Mac only. Smaller model ecosystem (need MLX-format weights, though the community converts most popular models).

Option 5: vLLM (Linux + NVIDIA GPU)

Platforms: Linux (NVIDIA GPUs)
What it is: High-throughput inference server. Production-grade.

pip install vllm
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --dtype auto --max-model-len 32768

When to use: If you have a Linux workstation or server with NVIDIA GPUs and want the best throughput. Supports tensor parallelism across multiple GPUs natively. Overkill for single-user desktop use, but many of you have lab servers.

Which Models to Pull

For coding tasks specifically, as of early 2026:

32GB - 48GB+ RAM/VRAM (Workstation/High-End):

Qwen 3.5 30B/70B variants (30B-300B): Specifically high-parameter models and mixture-of-experts (MoE) for complex agentic workflows.
GLM-5 (Reasoning): A top open-source choice that matches proprietary models in coding tasks.
Kimi-K2 Thinking (64GB+ RAM): Strong reasoning and tool usage for complex projects.

tip

Start with ollama pull qwen3.5:27b-q8_0 if your machine can handle it. Drop to qwen3.5:27b if not. These are the best bang-for-buck local coding models right now.

caution

The local model landscape moves fast. By the time you read this, there may be better options. Check benchmarks, don't trust marketing. Test models for your specific use case. Develop your own benchmarks for your specific workflows.

Part 3: The Tools — From Autocomplete to Agents

Tier 1: Autocomplete and Chat in Your Editor

The lowest-friction entry point. No workflow change required — just install an extension.

Continue.dev (VS Code / JetBrains)

Open source. Supports local models via ollama.

Setup:

Install the Continue extension in VS Code
Open Continue settings (sidebar → gear icon)
Configure to use your local ollama instance:

~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen 2.5 Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

What you get:

Tab autocomplete (use a smaller model — 7B is fine — for speed)
Chat sidebar (use the 32B model for quality)
Inline editing (select code → Ctrl+I → describe change)
Can reference files with @filename in chat

Why start here: No agentic risk. The model suggests, you accept or reject. You learn what AI can and can't do before giving it more autonomy.

GitHub Copilot

Proprietary, cloud-based (not local). Subscription-based (~$10/month individual, free for verified academics in some programs).

Excellent autocomplete. Chat features improving. Works in VS Code, JetBrains, Neovim, and importantly JupyterLab.

When to use over Continue.dev: If you primarily work in Jupyter notebooks and want inline suggestions there. If you don't care about local and just want it to work.

Tier 2: Agentic Coding in VS Code

Kilo Code

Open source VS Code extension. The most practical way to get agentic coding in your editor — it can read your project files, create and edit code, run terminal commands, read output, and iterate on errors.

Install: Search for "Kilo Code" in the VS Code extensions marketplace and install it.

Setup with local models:

Open Kilo Code settings
Add an OpenAI-compatible API provider pointing at your ollama instance:
- API base: http://localhost:11434/v1
- Model: qwen2.5-coder:32b
No API key needed for ollama (enter any placeholder)

Setup with frontier models:

Add Anthropic as a provider with your API key
Select Claude Sonnet or Opus as the model
Or add OpenAI, Google, etc. — Kilo Code supports all major providers. You can even mix local and frontier models for different tasks. And use EU based providers (Azure Ai Foundry).

Key concepts:

Modes — Kilo Code has multiple operating modes: Code (write/edit files), Architect (plan before coding), Ask (Q&A without file changes), Debug (analyze errors). Start with Ask mode to build trust.
Context — it reads files in your workspace automatically. You can @mention specific files to focus attention.
Approval workflow — every file change, terminal command, or API call requires your explicit approval before execution. You always see what it wants to do before it does it.
Diff view — proposed edits are shown as diffs. Accept, reject, or ask for changes.
MCP support — can connect to MCP servers for database access, file operations, custom tools.

Why Kilo Code for researchers: It lives inside VS Code where you already work (or in JetBrains stuff). The approval workflow means the AI never does anything without your explicit consent. Works with both local models (via ollama, lm studio) and frontier APIs — switch between them freely depending on task complexity. The Ask mode is a great way to start: get explanations and suggestions without the AI touching your files.

Tier 3: Full Agentic Tools

These tools take a description of what you want, then autonomously write code, run it, read errors, fix them, and iterate. They require frontier models to work reliably.

Claude Code

Anthropic's CLI agent. As of early 2026, the most capable agentic coding tool available.

Install:

npm install -g @anthropic-ai/claude-code

Usage:

cd your-project
claude
# Then describe what you want in natural language

What it does:

Reads your project files
Writes and modifies code across multiple files
Runs commands (tests, builds, linters)
Reads output and errors
Self-corrects and iterates
Uses Sonnet by default, can use Opus for harder tasks

Limitations: Requires an Anthropic API key. Not local. Costs money per use (Sonnet is ~$3/$15 per million input/output tokens). But for complex multi-file tasks, it will save you hours.

The honest pitch: If you have a task that would take you 2 hours of Python wrangling — "parse these 50 CSV files from different instruments, normalize timestamps, merge them, generate summary statistics and plots" — Claude Code can often do it in 5 minutes. The API cost for that is maybe €0.50. Your time is worth more.

Other Agentic Tools (Brief Mentions)

Kilo Code — built-in agentic mode. Works with local models or frontier APIs. Approval workflow means it never does anything without your consent. Plugins and CLI. Their own LLM gateway if needed. Provider agnostic.
Pi — open source minimql agentic framework. Research-grade, rough edges. Writes itself.
Cursor — VS Code fork with deep AI integration. Proprietary, subscription-based. Good, but locks you into their ecosystem.
Windsurf (Codeium) — Similar to Cursor. Another proprietary IDE fork.
OpenHands (formerly OpenDevin) — open source agentic framework. Research-grade, rough edges.
SWE-agent — designed for automated bug fixing. Academic project (Princeton). Interesting if you're researching AI agents themselves.

Summary: What to Use When

Task	Tool	Model	Local?
Autocomplete while typing	Continue.dev	Qwen 2.5 Coder 7B via ollama	Yes
"Explain this code" / Q&A	Continue.dev chat or Kilo Code (Ask mode)	Qwen 2.5 Coder 32B via ollama	Yes
"Rewrite this function"	Kilo Code (Code mode) or Continue inline edit	Qwen 2.5 Coder 32B via ollama	Yes
Small scripts from scratch	Kilo Code	Local 32B or Sonnet API	Either
Multi-file project work	Kilo Code or Claude Code + Sonnet/Opus API	Sonnet/Opus API	No
"Build me a data pipeline"	Claude Code	Sonnet/Opus API	No
Jupyter notebook help	GitHub Copilot or chat + paste	Varies	Varies

Tier 1: Autocomplete and Chat in Your Editor

The lowest-friction entry point. No workflow change required — just install an extension.

Continue.dev (VS Code / JetBrains)

Open source. Supports local models via ollama.

Setup:

Install the Continue extension in VS Code
Open Continue settings (sidebar → gear icon)
Configure to use your local ollama instance:

~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen 2.5 Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

What you get:

Tab autocomplete (use a smaller model — 7B is fine — for speed)
Chat sidebar (use the 32B model for quality)
Inline editing (select code → Ctrl+I → describe change)
Can reference files with @filename in chat

Why start here: No agentic risk. The model suggests, you accept or reject. You learn what AI can and can't do before giving it more autonomy.

GitHub Copilot

Proprietary, cloud-based (not local). Subscription-based (~$10/month individual, free for verified academics in some programs).

Excellent autocomplete. Chat features improving. Works in VS Code, JetBrains, Neovim, and importantly JupyterLab.

When to use over Continue.dev: If you primarily work in Jupyter notebooks and want inline suggestions there. If you don't care about local and just want it to work.

Tier 2: Interactive Coding Assistants (Chat + Edit)

Aider

Open source. Terminal-based. The best tool for using local models in an agentic-ish way.

Install:

pip install aider-chat

Usage with local models:

# With ollama
aider --model ollama/qwen2.5-coder:32b

# With any OpenAI-compatible local server
aider --model openai/qwen2.5-coder:32b --openai-api-base http://localhost:11434/v1

Usage with frontier models (when you need the big guns):

# Claude Sonnet (needs ANTHROPIC_API_KEY env var)
aider --model sonnet

# Claude Opus
aider --model opus

Key concepts:

Aider watches your git repo. Every AI edit is a git commit. You can always git diff or git revert.
/add filename — add files to the AI's context (it can only see files you add)
/ask — ask a question without letting the AI edit anything
/code — let the AI propose edits
/diff — review what changed
/undo — revert the last AI edit

Why aider for researchers: Git integration means you always have a safety net. The /ask mode lets you use it as a pure Q&A tool before trusting it with edits. Works with both local and frontier models — you can switch mid-session.

Cline (VS Code Extension)

Open source VS Code extension. More autonomous than Continue — it can create files, run terminal commands, read output, and iterate.

Can connect to local models via ollama, but the quality of agentic behavior degrades significantly with local models. Best used with frontier APIs.

When to use: When you want agentic behavior inside VS Code rather than in a terminal.

Tier 3: Full Agentic Tools

These tools take a description of what you want, then autonomously write code, run it, read errors, fix them, and iterate. They require frontier models to work reliably.

Claude Code

Anthropic's CLI agent. As of early 2026, the most capable agentic coding tool available.

Install:

npm install -g @anthropic-ai/claude-code

Usage:

cd your-project
claude
# Then describe what you want in natural language

What it does:

Reads your project files
Writes and modifies code across multiple files
Runs commands (tests, builds, linters)
Reads output and errors
Self-corrects and iterates
Uses Sonnet by default, can use Opus for harder tasks

Limitations: Requires an Anthropic API key. Not local. Costs money per use (Sonnet is ~$3/$15 per million input/output tokens). But for complex multi-file tasks, it will save you hours.

Other Agentic Tools (Brief Mentions)

Cursor — VS Code fork with deep AI integration. Proprietary, subscription-based. Good, but locks you into their ecosystem.
Windsurf (Codeium) — Similar to Cursor. Another proprietary IDE fork.
OpenHands (formerly OpenDevin) — open source agentic framework. Research-grade, rough edges.
SWE-agent — designed for automated bug fixing. Academic project (Princeton). Interesting if you're researching AI agents themselves.

Summary: What to Use When

Task	Tool	Model	Local?
Autocomplete while typing	Continue.dev	Qwen 2.5 Coder 7B via ollama	Yes
"Explain this code" / Q&A	Continue.dev chat or Aider `/ask`	Qwen 2.5 Coder 32B via ollama	Yes
"Rewrite this function"	Aider `/code` or Continue inline edit	Qwen 2.5 Coder 32B via ollama	Yes
Small scripts from scratch	Aider	Local 32B or Sonnet API	Either
Multi-file project work	Aider or Claude Code	Sonnet/Opus API	No
"Build me a data pipeline"	Claude Code	Sonnet/Opus API	No
Jupyter notebook help	GitHub Copilot or chat + paste	Varies	Varies

Part 4: Practical Workflow for Researchers

You're not building SaaS products. You're writing data processing scripts, analysis pipelines, experiment automation, firmware, visualization code. Your workflow is different from a software developer's. Here's how to adapt.

Start With Chat, Not Agents

Don't jump to "let the AI write my whole project." Start by using it as a conversation partner:

Paste a function, ask "what does this do?"
Paste an error traceback, ask "why is this happening?"
Describe what you want, ask "what library should I use for this?"
Paste your data format, ask "write a parser for this"

This works well even with local models. You stay in control. You learn what the AI is good at and where it hallucinates.

The Spec-First Approach

When you do want the AI to write something substantial, write a specification first. Plain text, markdown, whatever. Describe:

What the inputs are (file formats, data types, sources)
What the outputs should be (file formats, plots, reports)
What the processing steps are (in your domain language)
What constraints exist (memory, time, hardware)

Example:

## Task: IMU Data Processor

**Input:** CSV files from MPU6050 IMU sensor. Columns: timestamp_ms, accel_x, accel_y,
accel_z, gyro_x, gyro_y, gyro_z. Sample rate ~100 Hz but not perfectly regular.

**Processing:**
1. Resample to exactly 100 Hz using linear interpolation
2. Apply complementary filter (alpha=0.98) to get roll and pitch
3. Detect "impact events" where total acceleration exceeds 3g
4. For each impact event, extract ±500ms window

**Output:**
- Cleaned CSV with added roll/pitch columns
- Matplotlib plot showing acceleration magnitude over time with impact events marked
- JSON summary: number of impacts, timestamps, peak magnitudes

**Constraints:** Must handle files up to 2 GB. Single-threaded is fine.

Feed this to the AI. This works dramatically better than "write me a script that processes IMU data." The spec IS the prompt.

Context Management: The Local Model Bottleneck

Frontier models can ingest 128k+ tokens — your entire project, all your data schemas, your README, everything. Local models with 32k context windows (and often degrading quality past 8–16k) need help.

Strategies:

Only add the files the AI needs to see. In Kilo Code, @mention specific files to focus the model's attention rather than letting it scan your entire workspace.
Keep files small. Split 2000-line monolithic scripts into modules. Good practice anyway.
Be explicit. Don't say "use the config from the other file." Paste the relevant config values directly.
For chat/Q&A: paste only the relevant function, not the whole file.

Jupyter Notebook Workflows

Most of you live in Jupyter. Here's how AI tools fit:

Option A: Chat on the side. Keep a chat window open (Continue.dev sidebar, Kilo Code panel, or claude.ai in a browser). Copy cells in, get answers back. Low-tech, works everywhere.

Option B: GitHub Copilot in JupyterLab. If you use JupyterLab (not classic Notebook), Copilot provides inline autocomplete. Feels natural. Cloud-based.

Option C: AI generates a .py script, you convert to notebook. For larger tasks, let the AI write a proper Python script, then use jupytext or manual copy-paste to turn it into a notebook. AI is better at generating scripts than notebooks because notebooks mix code, output, and markdown in a format that's hard to diff.

Validation: Don't Trust, Verify

This matters more for researchers than for software developers, because your code produces results that go into papers.

The AI will:

Write code that looks correct but has subtle numerical bugs
Use deprecated APIs from older library versions
Implement algorithms with off-by-one errors in edge cases
Produce plausible but wrong statistical calculations
Confidently invent function signatures that don't exist

Your defense:

Test with known inputs where you can verify the output by hand
Compare results against a reference implementation or published values
Read the generated code, especially the math. Don't just run it.
Use assert statements liberally
For statistical code: run on synthetic data with known properties first

danger

AI-generated code is confidently wrong. It doesn't flag uncertainty. A function that silently swaps two matrix dimensions will produce output of the right shape and type — just wrong values. You are responsible for validation.

Part 5: Cost, Privacy, and Practical Decisions

Cost Math

Local inference (one-time hardware cost):

Hardware	Approximate Price	What It Runs
M4 Mac Mini 32 GB	~€900	14B models comfortably, 32B with offloading
M4 Pro Mac Mini 48 GB	~€2,200	32B models comfortably
M4 Max MacBook Pro 128 GB	~€5,000+	70B models at Q4, 32B at Q8
Mac Studio M4 Ultra 192 GB	~€6,000+	70B at Q8, 100B+ at Q4. The king.
Linux + RTX 4090 (24 GB)	~€2,500	14B at FP16, 32B at Q4 with offload
Linux + 2× RTX 4090	~€4,500	32B at Q8, 70B at Q4

API costs (pay per token):

Model	Input	Output	Typical cost per "write a script" task
Claude Opus	~$5/M tokens	~$15/M tokens	€0.50–5.00

For typical research use (maybe 20–50 tasks per day), frontier API costs are roughly €50–200/month. A Sonnet-tier model for light-to-moderate use might be under €30/month.

If your institution provides Azure or cloud credits, use them.

tip

The subscription plans are often better value for moderate users. Claude Pro is $20/month for generous usage. GitHub Copilot is $10/month. These are simpler than managing API keys and budgets.

Privacy Considerations

Concern	Local	Cloud API	Subscription (Claude Pro, Copilot)
Data leaves your machine	No	Yes	Yes
Provider can train on your data	No	Depends on ToS (usually not for API)	Depends on settings
GDPR compliance	Your responsibility	Check provider's DPA	Check provider's terms
Unpublished research safety	Safe	Read the terms	Read the terms
Export-controlled code	Safe	May violate regulations	May violate regulations

For sensitive work: Use local models. Full stop.

For everything else: Read the terms of service. Most API providers (Anthropic, OpenAI) do NOT train on API data by default. But "default" can change, and your institution may have policies.

MCP: The Plumbing That Connects AI to Your Tools

Model Context Protocol (MCP) is a standard for connecting AI assistants to external data sources and tools. Think of it as a plugin system.

Why it matters for researchers: your workflows involve instruments, databases, lab equipment, file systems, custom APIs. MCP lets an AI agent interact with these programmatically.

Example MCP integrations:

File system access (read/write project files)
Database queries (SQLite, PostgreSQL)
HTTP APIs (your lab's REST endpoints)
Custom tools (instrument control, data acquisition)

This is still maturing. Frontier model tools (Claude Code, Cursor) support MCP today. Local model support is emerging. Worth tracking, not yet worth building your workflow around unless you're an early adopter.

Part 6: Decision Flowchart

Do you need AI help with code?
│
├─ Just autocomplete while typing?
│  └─ Continue.dev + ollama (Qwen 7B) — free, local, done.
│
├─ Need to understand / debug existing code?
│  └─ Continue.dev chat or Kilo Code Ask mode + ollama (Qwen 32B) — free, local.
│
├─ Writing a single script or function?
│  ├─ Is it straightforward?
│  │  └─ Kilo Code + local 32B — works fine.
│  └─ Is it complex (multiple libraries, tricky logic)?
│     └─ Kilo Code + Sonnet API — worth the €0.20.
│
├─ Multi-file project work?
│  └─ Claude Code or Kilo Code + Sonnet/Opus API. Local models will frustrate you.
│
├─ Sensitive / restricted data involved?
│  └─ Local models only. No exceptions.
│
└─ "I just want someone to talk to about my code"
   └─ claude.ai or ChatGPT free tier. No setup required.

Appendix: Quick Start (5 Minutes)

If you want to get something running right now:

macOS / Linux

# Install ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a coding model (pick based on your RAM)
ollama pull qwen2.5-coder:32b   # 20+ GB RAM needed
# or
ollama pull qwen2.5-coder:14b   # 10+ GB RAM needed
# or
ollama pull qwen2.5-coder:7b    # 5+ GB RAM needed

# Chat with it immediately
ollama run qwen2.5-coder:32b

Then install Continue.dev in VS Code and point it at ollama.

Windows

Download and install Ollama from ollama.ai
Open PowerShell: ollama pull qwen2.5-coder:32b
Install Continue extension in VS Code
Configure Continue to use ollama provider

Want agentic coding?

Install the Kilo Code extension in VS Code
Open Kilo Code settings → add an OpenAI-compatible provider:
- API base: http://localhost:11434/v1
- Model: qwen2.5-coder:32b
Start in Ask mode — get comfortable before switching to Code mode

For frontier-quality agentic coding, add Anthropic as a provider with your API key and select Claude Sonnet.

Appendix: Recommended Reading

Ollama documentation — model library and setup
Kilo Code — VS Code agentic coding extension
Continue.dev documentation — VS Code/JetBrains setup
Claude Code documentation — Anthropic's CLI agent
llama.cpp — the engine under most local tools
MCP specification — the protocol connecting AI to tools

Last updated: March 2026. The local model landscape changes fast — verify model recommendations against current benchmarks before committing.

Part 1: The Honest State of Things​

What "AI Coding" Actually Means in 2026​

Frontier Models vs. Local Models​

The gap is not subtle​

Why local still matters​

Quantization: The 30-Second Version​

Part 2: Local Serving — How to Run Models on Your Machine​

Option 1: Ollama (Recommended Starting Point)​

Option 2: LM Studio​

Option 3: llama.cpp Server (Direct)​

Option 4: MLX (Apple Silicon Only)​

Option 5: vLLM (Linux + NVIDIA GPU)​

Which Models to Pull​

Part 3: The Tools — From Autocomplete to Agents​

Tier 1: Autocomplete and Chat in Your Editor​

Continue.dev (VS Code / JetBrains)​

GitHub Copilot​

Tier 2: Agentic Coding in VS Code​

Kilo Code​

Tier 3: Full Agentic Tools​

Claude Code​

Other Agentic Tools (Brief Mentions)​

Summary: What to Use When​

Tier 1: Autocomplete and Chat in Your Editor​

Continue.dev (VS Code / JetBrains)​

GitHub Copilot​

Tier 2: Interactive Coding Assistants (Chat + Edit)​

Aider​

Cline (VS Code Extension)​

Tier 3: Full Agentic Tools​

Claude Code​

Other Agentic Tools (Brief Mentions)​

Summary: What to Use When​

Part 4: Practical Workflow for Researchers​

Start With Chat, Not Agents​

The Spec-First Approach​

Context Management: The Local Model Bottleneck​

Jupyter Notebook Workflows​

Validation: Don't Trust, Verify​

Part 5: Cost, Privacy, and Practical Decisions​

Cost Math​

Privacy Considerations​

MCP: The Plumbing That Connects AI to Your Tools​

Part 6: Decision Flowchart​

Appendix: Quick Start (5 Minutes)​

macOS / Linux​

Windows​

Want agentic coding?​

Appendix: Recommended Reading​

Part 1: The Honest State of Things

What "AI Coding" Actually Means in 2026

Frontier Models vs. Local Models

The gap is not subtle

Why local still matters

Quantization: The 30-Second Version

Part 2: Local Serving — How to Run Models on Your Machine

Option 1: Ollama (Recommended Starting Point)

Option 2: LM Studio

Option 3: llama.cpp Server (Direct)

Option 4: MLX (Apple Silicon Only)

Option 5: vLLM (Linux + NVIDIA GPU)

Which Models to Pull

Part 3: The Tools — From Autocomplete to Agents

Tier 1: Autocomplete and Chat in Your Editor

Continue.dev (VS Code / JetBrains)

GitHub Copilot

Tier 2: Agentic Coding in VS Code

Kilo Code

Tier 3: Full Agentic Tools

Claude Code

Other Agentic Tools (Brief Mentions)

Summary: What to Use When

Tier 1: Autocomplete and Chat in Your Editor

Continue.dev (VS Code / JetBrains)

GitHub Copilot

Tier 2: Interactive Coding Assistants (Chat + Edit)

Aider

Cline (VS Code Extension)

Tier 3: Full Agentic Tools

Claude Code

Other Agentic Tools (Brief Mentions)

Summary: What to Use When

Part 4: Practical Workflow for Researchers

Start With Chat, Not Agents

The Spec-First Approach

Context Management: The Local Model Bottleneck

Jupyter Notebook Workflows

Validation: Don't Trust, Verify

Part 5: Cost, Privacy, and Practical Decisions

Cost Math

Privacy Considerations

MCP: The Plumbing That Connects AI to Your Tools

Part 6: Decision Flowchart

Appendix: Quick Start (5 Minutes)

macOS / Linux

Windows

Want agentic coding?

Appendix: Recommended Reading