90 - AI Agentic Coding Tools for Researchers/Engineers
A practical guide for ML researchers and various engineers who write code but aren't software developers by trade.
Target audience: TalTech researchers working in ML, IoT, signal processing, embedded systems. Comfortable with Python, Jupyter, maybe C/C++ for microcontrollers, possibly MATLAB. Not doing enterprise software development (I'm doing that).
Goal: Get you productively using AI coding tools — today, on your own hardware or via API — without drowning in hype.
Part 1: The Honest State of Things
What "AI Coding" Actually Means in 2026
There's a spectrum of how AI can help you write code, from least to most autonomous:
- Autocomplete — predicts the next few tokens/lines as you type. Like aggressive IntelliSense. Low risk, high convenience.
- Chat — you paste code, ask questions, get explanations or rewrites. Copy-paste workflow. You're in full control.
- Inline edit — you select code in your editor, describe a change, the model rewrites that block. You review a diff.
- Agentic — you describe a task, the AI reads your files, writes code, runs commands, reads errors, fixes them, iterates. Semi-autonomous.
Most of the hype is about level 4. Most of the practical value for researchers is at levels 2–3. Be honest with yourself about where you are.
Frontier Models vs. Local Models
This is the single most important thing to understand before choosing tools.
Frontier models (Claude Opus, GPT, Gemini) are served from datacenters with hundreds of GPUs. They are trained on enormous datasets, heavily post-trained for instruction following and tool use, and have context windows of 256k–1M+ tokens. Ca 24TB gpu ram for training, ca 1tb ram for inference.
Local models run on your machine. Your hardware is the ceiling. With 128 GB of unified memory (top-end Mac laptop) you can run roughly a 70B parameter model at Q4 quantization. With a single NVIDIA GPU (24 GB VRAM), you're looking at 7B–14B models at reasonable quality, or 32B with aggressive quantization and partial CPU offload.
The gap is not subtle
| Capability | Frontier (Opus/GPT/Gemini) | Local 32B (Qwen 3.5 Coder) | Local 7B |
|---|---|---|---|
| Autocomplete | Overkill | Good | Good |
| Code chat / Q&A | Excellent | Good | Decent |
| Single-file generation | Excellent | Good | Passable |
| Multi-file editing | Excellent | Fragile | Unreliable |
| Agentic (plan → execute → debug) | Works | Breaks after 3–5 steps | Doesn't work |
| Context window | 256k–1M tokens | 32k-128k tokens | 8–32k tokens |
| Tool use / function calling | Reliable | Inconsistent | Poor |
| Error self-correction | Yes, iterates well | Sometimes | Rarely |
The practical takeaway: Use local models for autocomplete, chat, and single-shot code generation. Use frontier APIs for anything that requires planning, multi-step execution, or understanding a large codebase. This is not a temporary gap — it's physics. Frontier models are 10–50x larger and trained with 100–1000x more compute.
Why local still matters
- Privacy. Your unpublished research, datasets, proprietary hardware designs, grant proposals - nothing leaves your machine.
- Latency. For autocomplete, local inference at 30–50 tokens/sec feels instant. API round-trips add 200–500ms minimum.
- Cost. No per-token billing. Run it 24/7 if you want.
- Offline. Works on a plane, in a Faraday cage, in a server room with no internet.
- Learning. You're ML researchers. Running models locally teaches you about inference, quantization, memory bandwidth, KV cache — things you should understand anyway.
Quantization: The 30-Second Version
You already know what floating point precision means. Quantization is storing model weights at reduced precision to fit in less memory.
- FP16 / BF16 — full precision. A 70B model needs ~140 GB. You can't run this locally unless you have very exotic hardware.
- Q8 (8-bit) — ~70 GB for 70B. Minimal quality loss. Fits on 128 GB unified memory machines with room for KV cache.
- Q4_K_M (4-bit, k-quant medium) — ~40 GB for 70B. Noticeable but acceptable quality loss. The sweet spot for most local setups.
- Q3 and below — quality drops fast. Not recommended for coding tasks.
The format you'll see most often is GGUF — a single-file format used by llama.cpp and everything built on it (ollama, LM Studio, etc.).
MLX format for Apple Silicon. Most developers are on macs.
Part 2: Local Serving — How to Run Models on Your Machine
Option 1: Ollama (Recommended Starting Point)
Platforms: macOS, Linux, Windows
What it is: A CLI tool that downloads and serves LLMs locally. Think of it as "Docker for language models."
Models to test: qwen-coder-30b-a3bm, qwen3-coder-next, qwen3.5:27b, gpt-oss-20b
Install:
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows — download installer from https://ollama.ai
Usage:
# https://ollama.com/library/qwen3.5/tags
# Pull a model
ollama pull qwen3.5:27b-q8_0
# Chat directly
ollama run qwen3.5:27b-q8_0
# Start as a server (for other tools to connect to)
ollama serve
# Server runs on http://localhost:11434
# OpenAI-compatible API at http://localhost:11434/v1
# https://docs.ollama.com/integrations/claude-code
Why ollama: Zero configuration. Model management is trivial. Exposes an OpenAI-compatible API, so almost every tool can connect to it. Cross-platform. Has also Anthropic API.
Limitations: Less control over inference parameters. No tensor parallelism across multiple GPUs (uses llama.cpp under the hood, which does support multi-GPU, but ollama's control over it is limited).
Option 2: LM Studio
Platforms: macOS, Linux, Windows
What it is: A desktop GUI application for downloading and running LLMs locally.
Good for researchers who prefer a graphical interface. Browse models from Hugging Face, download with one click, chat in a built-in UI, or enable a local server that's API-compatible with OpenAI's format (and Anthropic also now).
When to use over ollama: When you want a visual model browser, or you're on Windows and don't want to touch the terminal.
Option 3: llama.cpp Server (Direct)
Platforms: macOS, Linux, Windows
What it is: The C/C++ inference engine that powers ollama and LM Studio, used directly.
# Build from source (you'll want GPU support)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or -DGGML_METAL=ON for Mac
cmake --build build --config Release
# Run server
./build/bin/llama-server -m model.gguf -c 32768 --port 8080
When to use: When you need fine control over context length, batch size, thread count, GPU layer offloading. When you're benchmarking inference performance. When ollama's defaults aren't cutting it.
Option 4: MLX (Apple Silicon Only)
Platforms: macOS (Apple Silicon only)
What it is: Apple's ML framework, optimized for Metal / unified memory.
pip install mlx-lm
# Run a model
mlx_lm.server --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit
Why: Faster than llama.cpp on Apple Silicon for some models. Native Metal acceleration without translation layers.
Limitations: Mac only. Smaller model ecosystem (need MLX-format weights, though the community converts most popular models).
Option 5: vLLM (Linux + NVIDIA GPU)
Platforms: Linux (NVIDIA GPUs)
What it is: High-throughput inference server. Production-grade.
pip install vllm
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --dtype auto --max-model-len 32768
When to use: If you have a Linux workstation or server with NVIDIA GPUs and want the best throughput. Supports tensor parallelism across multiple GPUs natively. Overkill for single-user desktop use, but many of you have lab servers.
Which Models to Pull
For coding tasks specifically, as of early 2026:
32GB - 48GB+ RAM/VRAM (Workstation/High-End):
- Qwen 3.5 30B/70B variants (30B-300B): Specifically high-parameter models and mixture-of-experts (MoE) for complex agentic workflows.
- GLM-5 (Reasoning): A top open-source choice that matches proprietary models in coding tasks.
- Kimi-K2 Thinking (64GB+ RAM): Strong reasoning and tool usage for complex projects.
Start with ollama pull qwen3.5:27b-q8_0 if your machine can handle it. Drop to qwen3.5:27b if not. These are the best bang-for-buck local coding models right now.
The local model landscape moves fast. By the time you read this, there may be better options. Check benchmarks, don't trust marketing. Test models for your specific use case. Develop your own benchmarks for your specific workflows.
Part 3: The Tools — From Autocomplete to Agents
Tier 1: Autocomplete and Chat in Your Editor
The lowest-friction entry point. No workflow change required — just install an extension.
Continue.dev (VS Code / JetBrains)
Open source. Supports local models via ollama.
Setup:
- Install the Continue extension in VS Code
- Open Continue settings (sidebar → gear icon)
- Configure to use your local ollama instance:
{
"models": [
{
"title": "Qwen 2.5 Coder 32B",
"provider": "ollama",
"model": "qwen2.5-coder:32b"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
What you get:
- Tab autocomplete (use a smaller model — 7B is fine — for speed)
- Chat sidebar (use the 32B model for quality)
- Inline editing (select code → Ctrl+I → describe change)
- Can reference files with
@filenamein chat
Why start here: No agentic risk. The model suggests, you accept or reject. You learn what AI can and can't do before giving it more autonomy.
GitHub Copilot
Proprietary, cloud-based (not local). Subscription-based (~$10/month individual, free for verified academics in some programs).
Excellent autocomplete. Chat features improving. Works in VS Code, JetBrains, Neovim, and importantly JupyterLab.
When to use over Continue.dev: If you primarily work in Jupyter notebooks and want inline suggestions there. If you don't care about local and just want it to work.
Tier 2: Agentic Coding in VS Code
Kilo Code
Open source VS Code extension. The most practical way to get agentic coding in your editor — it can read your project files, create and edit code, run terminal commands, read output, and iterate on errors.
Install: Search for "Kilo Code" in the VS Code extensions marketplace and install it.
Setup with local models:
- Open Kilo Code settings
- Add an OpenAI-compatible API provider pointing at your ollama instance:
- API base:
http://localhost:11434/v1 - Model:
qwen2.5-coder:32b
- API base:
- No API key needed for ollama (enter any placeholder)
Setup with frontier models:
- Add Anthropic as a provider with your API key
- Select Claude Sonnet or Opus as the model
- Or add OpenAI, Google, etc. — Kilo Code supports all major providers. You can even mix local and frontier models for different tasks. And use EU based providers (Azure Ai Foundry).
Key concepts:
- Modes — Kilo Code has multiple operating modes: Code (write/edit files), Architect (plan before coding), Ask (Q&A without file changes), Debug (analyze errors). Start with Ask mode to build trust.
- Context — it reads files in your workspace automatically. You can
@mentionspecific files to focus attention. - Approval workflow — every file change, terminal command, or API call requires your explicit approval before execution. You always see what it wants to do before it does it.
- Diff view — proposed edits are shown as diffs. Accept, reject, or ask for changes.
- MCP support — can connect to MCP servers for database access, file operations, custom tools.
Why Kilo Code for researchers: It lives inside VS Code where you already work (or in JetBrains stuff). The approval workflow means the AI never does anything without your explicit consent. Works with both local models (via ollama, lm studio) and frontier APIs — switch between them freely depending on task complexity. The Ask mode is a great way to start: get explanations and suggestions without the AI touching your files.
Tier 3: Full Agentic Tools
These tools take a description of what you want, then autonomously write code, run it, read errors, fix them, and iterate. They require frontier models to work reliably.
Claude Code
Anthropic's CLI agent. As of early 2026, the most capable agentic coding tool available.
Install:
npm install -g @anthropic-ai/claude-code
Usage:
cd your-project
claude
# Then describe what you want in natural language
What it does:
- Reads your project files
- Writes and modifies code across multiple files
- Runs commands (tests, builds, linters)
- Reads output and errors
- Self-corrects and iterates
- Uses Sonnet by default, can use Opus for harder tasks
Limitations: Requires an Anthropic API key. Not local. Costs money per use (Sonnet is ~$3/$15 per million input/output tokens). But for complex multi-file tasks, it will save you hours.
The honest pitch: If you have a task that would take you 2 hours of Python wrangling — "parse these 50 CSV files from different instruments, normalize timestamps, merge them, generate summary statistics and plots" — Claude Code can often do it in 5 minutes. The API cost for that is maybe €0.50. Your time is worth more.
Other Agentic Tools (Brief Mentions)
- Kilo Code — built-in agentic mode. Works with local models or frontier APIs. Approval workflow means it never does anything without your consent. Plugins and CLI. Their own LLM gateway if needed. Provider agnostic.
- Pi — open source minimql agentic framework. Research-grade, rough edges. Writes itself.
- Cursor — VS Code fork with deep AI integration. Proprietary, subscription-based. Good, but locks you into their ecosystem.
- Windsurf (Codeium) — Similar to Cursor. Another proprietary IDE fork.
- OpenHands (formerly OpenDevin) — open source agentic framework. Research-grade, rough edges.
- SWE-agent — designed for automated bug fixing. Academic project (Princeton). Interesting if you're researching AI agents themselves.
Summary: What to Use When
| Task | Tool | Model | Local? |
|---|---|---|---|
| Autocomplete while typing | Continue.dev | Qwen 2.5 Coder 7B via ollama | Yes |
| "Explain this code" / Q&A | Continue.dev chat or Kilo Code (Ask mode) | Qwen 2.5 Coder 32B via ollama | Yes |
| "Rewrite this function" | Kilo Code (Code mode) or Continue inline edit | Qwen 2.5 Coder 32B via ollama | Yes |
| Small scripts from scratch | Kilo Code | Local 32B or Sonnet API | Either |
| Multi-file project work | Kilo Code or Claude Code + Sonnet/Opus API | Sonnet/Opus API | No |
| "Build me a data pipeline" | Claude Code | Sonnet/Opus API | No |
| Jupyter notebook help | GitHub Copilot or chat + paste | Varies | Varies |
Tier 1: Autocomplete and Chat in Your Editor
The lowest-friction entry point. No workflow change required — just install an extension.
Continue.dev (VS Code / JetBrains)
Open source. Supports local models via ollama.
Setup:
- Install the Continue extension in VS Code
- Open Continue settings (sidebar → gear icon)
- Configure to use your local ollama instance:
{
"models": [
{
"title": "Qwen 2.5 Coder 32B",
"provider": "ollama",
"model": "qwen2.5-coder:32b"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
What you get:
- Tab autocomplete (use a smaller model — 7B is fine — for speed)
- Chat sidebar (use the 32B model for quality)
- Inline editing (select code → Ctrl+I → describe change)
- Can reference files with
@filenamein chat
Why start here: No agentic risk. The model suggests, you accept or reject. You learn what AI can and can't do before giving it more autonomy.
GitHub Copilot
Proprietary, cloud-based (not local). Subscription-based (~$10/month individual, free for verified academics in some programs).
Excellent autocomplete. Chat features improving. Works in VS Code, JetBrains, Neovim, and importantly JupyterLab.
When to use over Continue.dev: If you primarily work in Jupyter notebooks and want inline suggestions there. If you don't care about local and just want it to work.
Tier 2: Interactive Coding Assistants (Chat + Edit)
Aider
Open source. Terminal-based. The best tool for using local models in an agentic-ish way.
Install:
pip install aider-chat
Usage with local models:
# With ollama
aider --model ollama/qwen2.5-coder:32b
# With any OpenAI-compatible local server
aider --model openai/qwen2.5-coder:32b --openai-api-base http://localhost:11434/v1
Usage with frontier models (when you need the big guns):
# Claude Sonnet (needs ANTHROPIC_API_KEY env var)
aider --model sonnet
# Claude Opus
aider --model opus
Key concepts:
- Aider watches your git repo. Every AI edit is a git commit. You can always
git difforgit revert. /add filename— add files to the AI's context (it can only see files you add)/ask— ask a question without letting the AI edit anything/code— let the AI propose edits/diff— review what changed/undo— revert the last AI edit
Why aider for researchers: Git integration means you always have a safety net. The /ask mode lets you use it as a pure Q&A tool before trusting it with edits. Works with both local and frontier models — you can switch mid-session.
Cline (VS Code Extension)
Open source VS Code extension. More autonomous than Continue — it can create files, run terminal commands, read output, and iterate.
Can connect to local models via ollama, but the quality of agentic behavior degrades significantly with local models. Best used with frontier APIs.
When to use: When you want agentic behavior inside VS Code rather than in a terminal.
Tier 3: Full Agentic Tools
These tools take a description of what you want, then autonomously write code, run it, read errors, fix them, and iterate. They require frontier models to work reliably.
Claude Code
Anthropic's CLI agent. As of early 2026, the most capable agentic coding tool available.
Install:
npm install -g @anthropic-ai/claude-code
Usage:
cd your-project
claude
# Then describe what you want in natural language
What it does:
- Reads your project files
- Writes and modifies code across multiple files
- Runs commands (tests, builds, linters)
- Reads output and errors
- Self-corrects and iterates
- Uses Sonnet by default, can use Opus for harder tasks
Limitations: Requires an Anthropic API key. Not local. Costs money per use (Sonnet is ~$3/$15 per million input/output tokens). But for complex multi-file tasks, it will save you hours.
The honest pitch: If you have a task that would take you 2 hours of Python wrangling — "parse these 50 CSV files from different instruments, normalize timestamps, merge them, generate summary statistics and plots" — Claude Code can often do it in 5 minutes. The API cost for that is maybe €0.50. Your time is worth more.
Other Agentic Tools (Brief Mentions)
- Cursor — VS Code fork with deep AI integration. Proprietary, subscription-based. Good, but locks you into their ecosystem.
- Windsurf (Codeium) — Similar to Cursor. Another proprietary IDE fork.
- OpenHands (formerly OpenDevin) — open source agentic framework. Research-grade, rough edges.
- SWE-agent — designed for automated bug fixing. Academic project (Princeton). Interesting if you're researching AI agents themselves.
Summary: What to Use When
| Task | Tool | Model | Local? |
|---|---|---|---|
| Autocomplete while typing | Continue.dev | Qwen 2.5 Coder 7B via ollama | Yes |
| "Explain this code" / Q&A | Continue.dev chat or Aider /ask | Qwen 2.5 Coder 32B via ollama | Yes |
| "Rewrite this function" | Aider /code or Continue inline edit | Qwen 2.5 Coder 32B via ollama | Yes |
| Small scripts from scratch | Aider | Local 32B or Sonnet API | Either |
| Multi-file project work | Aider or Claude Code | Sonnet/Opus API | No |
| "Build me a data pipeline" | Claude Code | Sonnet/Opus API | No |
| Jupyter notebook help | GitHub Copilot or chat + paste | Varies | Varies |
Part 4: Practical Workflow for Researchers
You're not building SaaS products. You're writing data processing scripts, analysis pipelines, experiment automation, firmware, visualization code. Your workflow is different from a software developer's. Here's how to adapt.
Start With Chat, Not Agents
Don't jump to "let the AI write my whole project." Start by using it as a conversation partner:
- Paste a function, ask "what does this do?"
- Paste an error traceback, ask "why is this happening?"
- Describe what you want, ask "what library should I use for this?"
- Paste your data format, ask "write a parser for this"
This works well even with local models. You stay in control. You learn what the AI is good at and where it hallucinates.
The Spec-First Approach
When you do want the AI to write something substantial, write a specification first. Plain text, markdown, whatever. Describe:
- What the inputs are (file formats, data types, sources)
- What the outputs should be (file formats, plots, reports)
- What the processing steps are (in your domain language)
- What constraints exist (memory, time, hardware)
Example:
## Task: IMU Data Processor
**Input:** CSV files from MPU6050 IMU sensor. Columns: timestamp_ms, accel_x, accel_y,
accel_z, gyro_x, gyro_y, gyro_z. Sample rate ~100 Hz but not perfectly regular.
**Processing:**
1. Resample to exactly 100 Hz using linear interpolation
2. Apply complementary filter (alpha=0.98) to get roll and pitch
3. Detect "impact events" where total acceleration exceeds 3g
4. For each impact event, extract ±500ms window
**Output:**
- Cleaned CSV with added roll/pitch columns
- Matplotlib plot showing acceleration magnitude over time with impact events marked
- JSON summary: number of impacts, timestamps, peak magnitudes
**Constraints:** Must handle files up to 2 GB. Single-threaded is fine.
Feed this to the AI. This works dramatically better than "write me a script that processes IMU data." The spec IS the prompt.
Context Management: The Local Model Bottleneck
Frontier models can ingest 128k+ tokens — your entire project, all your data schemas, your README, everything. Local models with 32k context windows (and often degrading quality past 8–16k) need help.
Strategies:
- Only add the files the AI needs to see. In Kilo Code,
@mentionspecific files to focus the model's attention rather than letting it scan your entire workspace. - Keep files small. Split 2000-line monolithic scripts into modules. Good practice anyway.
- Be explicit. Don't say "use the config from the other file." Paste the relevant config values directly.
- For chat/Q&A: paste only the relevant function, not the whole file.