Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.upsonic.ai/llms.txt

Use this file to discover all available pages before exploring further.

What is Applied Scientist

Applied Scientist is an autonomous agent that runs inside Jupyter. You hand it your baseline notebook and a research source (PDF, web URL, Kaggle link, GitHub/GitLab repo, or a plain-text idea), and it does the work a researcher would do by hand: read it, run your baseline, implement the new method, and produce a structured comparison. You supply the inputs, launch the run, and read the result.
Applied Scientist running inside Jupyter — a six-phase pipeline from setup to verdict

How It Works

A run moves through six fixed phases. Each phase has one job and hands off to the next.
1

Phase 0 — Setup

Creates an isolated workspace and copies your notebook, data, and research source into it. The original files are never touched.
2

Phase 1 — Analyze Current

Reads your baseline notebook and documents the model, preprocessing, hyperparameters, and the metrics it reports.
3

Phase 2 — Research

Digests the research source: what the method does, what it improves, its requirements, and whether it’s compatible with your data.
4

Phase 3 — Benchmark

Locks in the metrics and baseline values that both sides will be measured on. Missing baseline metrics are flagged so the new run computes them too.
5

Phase 4 — Implement

Writes a new notebook implementing the method, using the same data, split, and seed as the baseline. Runs it end-to-end.
6

Phase 5 — Evaluate

Compares both runs and issues a verdict — BETTER, WORSE, INCONCLUSIVE, or FAILED — with concrete reasoning recorded to disk.

Cursor & Claude Code vs Upsonic Prebuilt Autonomous Agents

A question we hear a lot: why use this instead of just doing the same thing in Cursor or Claude Code? The short answer is that those are general coding copilots, and Applied Scientist is a purpose-built experiment runner. The table below shows where the two approaches diverge.
DimensionCursor & Claude CodeUpsonic Applied Scientist
WorkspaceRuns in your working repo, shared with your editorFully isolated workspace folder per experiment
OutputFree-form chat and file editsStructured ExperimentResult (verdict, comparison table, metrics)
WorkflowAssembled case by case in the chatPre-tested, well-designed pipeline
EnvironmentOutside the notebookRuns directly inside Jupyter
Progress trackingScroll through chat transcript to guess where it isLive progress bar driven by progress.json, plus last_logs(n) timeline

Install

!pip install upsonic
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

Requirements

You only need two things on disk.

Baseline notebook

A working .ipynb that trains your baseline model end-to-end. This is the reference every comparison is made against.

Research source

Anything describing the method to try: PDF, Markdown/HTML, web URL, arXiv link, GitHub/GitLab/Bitbucket repo, Kaggle notebook or dataset page, or a free-form idea as plain text.
current_data is optional. Omit it and the agent reads your notebook to find the data-loading cells itself.

Running an Experiment

The example below is the demo shipped with Upsonic: a Random Forest baseline for telco customer churn, benchmarked against a Kaggle notebook that uses SMOTE + XGBoost to handle class imbalance.

1. Create the agent

from upsonic.prebuilt import AppliedScientist

scientist = AppliedScientist(
    model="anthropic/claude-haiku-4-5",
    workspace="./autonomous_workspace",
)
workspace is the root directory the agent is allowed to work in. Every experiment lives in its own folder inside it.

2. Define the experiment

experiment = scientist.new_experiment(
    "smote_xgboost_churn",
    research_source="https://www.kaggle.com/code/ragilhadip/churn-prediction-handilng-imbalance-using-smote",
    current_notebook="telco_churn/Baseline_RandomForest_Churn.ipynb",
    current_data="telco_churn/WA_Fn-UseC_-Telco-Customer-Churn.csv",
)
research_source is polymorphic — pass any of these and the agent figures out how to materialize it:
  • Local files — PDF, Markdown, HTML, .ipynb, plain text
  • Web URLs — blog posts, arXiv pages, documentation
  • Code hosts — GitHub, GitLab, or Bitbucket repository URLs
  • Kaggle — notebook or dataset pages
  • Free-form idea — a plain string describing what to try
ParameterPurpose
name (positional)Folder name and registry key
research_sourceAnything from the list above
current_notebookPath to your baseline notebook
current_dataOptional. Data path or a short loader description. Inferred from the notebook when omitted.
experiments_directoryOptional. Defaults to ./experiments inside the workspace.

3. Run and watch

run_in_background() starts the run in a daemon thread and returns immediately.
experiment.run_in_background()
scientist.progress_bar_live(experiment, interval=5)
Live progress bar updating phase-by-phase as the experiment runs
State is exposed on the experiment object at any time:
AttributeWhat it tells you
experiment.is_runningTrue while the thread is alive
experiment.is_doneTrue once finished (success or error)
experiment.errorThe exception if the run raised, else None
To see the last few things the agent actually did:
experiment.last_logs(5)
last_logs(5) rendering the most recent phase entries with their structured payloads
Interrupt the kernel to stop watching without cancelling the run. Call experiment.stop() to cooperatively cancel.

4. Wait for the result

result = experiment.wait()

print(f"VERDICT: {result.verdict}")
print(f"\nSummary: {result.summary}")
print(f"\nExplanation: {result.explanation}")
wait() blocks until the run finishes and re-raises any exception it produced. For this demo run, it returns:
VERDICT: BETTER

Summary: XGBoost combined with SMOTE oversampling significantly improves minority class
detection in churn prediction. While overall accuracy decreases slightly (70.4% vs 80.3%),
the model achieves substantially higher recall for churned customers (85.6% vs 52.1%),
successfully catching more customers at risk of leaving. The F1 score improved from 0.5847
to 0.6055, indicating better balanced performance on the minority class. This trade-off is
favorable for churn prediction where identifying at-risk customers for retention campaigns
is more valuable than overall accuracy.

Explanation: The verdict is BETTER because: (1) Recall improved by +32.2 percentage points
(0.5214 → 0.8556), catching 85.6% of churners vs. only 52.1% before, reducing missed
opportunities for retention by ~60%. (2) F1-score improved by +3.5% (0.5847 → 0.6055),
showing better minority class balance. (3) While accuracy dropped 10.1 percentage points
(expected with SMOTE), the business impact is positive: preventing customer churn is more
valuable than reducing false positives. (4) SMOTE successfully balanced the 2.77:1 class
imbalance to 1:1, and XGBoost's gradient boosting effectively learned improved decision
boundaries.
AttributeValue
result.verdict'BETTER' | 'WORSE' | 'INCONCLUSIVE' | 'FAILED'
result.summaryWhat the new method is and how it differs from the baseline
result.explanationWhy this verdict was reached, referencing concrete numbers

5. Inspect the comparison

result.table is a list of metric dicts. Drop it into a DataFrame to see the side-by-side:
import pandas as pd
pd.DataFrame(result.table)
result.table rendered as a pandas DataFrame
Each row contains:
FieldMeaning
nameMetric name (e.g. accuracy, f1, auroc)
current / newBaseline and new-method values
diff / diff_displayRaw difference and a human-friendly version
unitUnit of the metric
higher_is_betterWhether larger is better
betterWhich side won on this metric (current or new)
Plotting the table makes the trade-off obvious — in this run, the new method trades a little overall accuracy for a large gain in churn recall:
Bar chart comparing Random Forest baseline against SMOTE + XGBoost
Need the raw artifacts? result.record exposes log.json, progress.json, and registry metadata for the run.

Managing Experiments

Every experiment is recorded in experiments.json. The registry is re-read from disk on every call, so it always reflects current state.
scientist.list_experiments()                      # newest first
scientist.list_experiments(status="completed")    # 'in_progress' | 'completed' | 'failed'

exp = scientist.experiments["smote_xgboost_churn"]
exp.phases   # normalised phase list
exp.log      # parsed log.json
list_experiments output showing date, name, status, verdict, and new vs baseline
Each registry entry is a dict with name, date, status, verdict, baseline_model, new_method, paper, and path.

API Reference

from upsonic.prebuilt import AppliedScientist

scientist = AppliedScientist(model=..., workspace="./ws")

# Create an experiment
exp = scientist.new_experiment(
    "smote_xgboost_churn",
    research_source=...,     # PDF, URL, repo, Kaggle page, or free-form idea
    current_notebook=...,
    # current_data=...,                      # optional
    # experiments_directory="./experiments"  # optional
)

# Run control
exp.run_in_background()
exp.is_running
exp.is_done
exp.error
exp.stop()
exp.wait()                # blocks, returns ExperimentResult

# Progress
exp.progress_bar
scientist.progress_bar_live(exp, interval=5)
exp.last_logs(5)

# Result
res = exp.result
res.verdict       # 'BETTER' | 'WORSE' | 'INCONCLUSIVE' | 'FAILED'
res.summary
res.explanation
res.table         # list[dict]

# Registry
scientist.list_experiments()
scientist.experiments["smote_xgboost_churn"].phases
scientist.experiments["smote_xgboost_churn"].log
The full demo notebook for this agent lives in the Upsonic repo under prebuilt_autonomous_agents.