brianletort.ai
All Posts

Building SemanticStudio Series

An 8-part deep dive into building a production-ready multi-agent chat platform

SemanticStudioETLData EngineeringMulti-Agent SystemsSelf-Learning

ETL & Agent Creation: Growing Your Multi-Agent System

How SemanticStudio's self-learning ETL pipelines ingest data, build knowledge graphs, and automatically create new domain agents.

January 26, 20267 min read

TL;DR

  • Self-learning ETL with Plan-Act-Reflect loops
  • Automatic knowledge graph population from data sources
  • New domain agents created automatically when you add data

Most RAG systems make data ingestion your problem.

SemanticStudio makes it a feature: self-learning ETL pipelines that don't just ingest data—they build knowledge graphs and create new agents automatically.

The ETL Vision

Traditional ETL:

Data Source → Transform → Load → Done

SemanticStudio ETL:

Data Source → Understand → Transform → Load → 
  → Populate Vector Store
  → Build Knowledge Graph
  → Create/Link Domain Agent
  → Learn for Next Time

The ETL system is intelligent. It adapts. It improves.

Plan-Act-Reflect (PAR) Loops

At the heart of SemanticStudio's ETL is the PAR pattern—the same self-learning loop I've written about in Self-Learning Pipelines.

Plan-Act-Reflect (PAR) Loop

Self-learning ETL that improves over time

1

Plan

Analyze source and determine strategy

2

Act

Execute transformation and capture results

3

Reflect

Evaluate results and learn for next time

Loop learns and improves each run

Phase 1: Plan

Before touching data, the system plans:

  • Schema Analysis: What does this source look like?
  • Issue Prediction: What might go wrong?
  • Strategy Selection: How should we transform this?
  • Resource Estimation: How long will this take?

The planner uses learned patterns from previous runs.

Phase 2: Act

Execute the plan:

  • Data Extraction: Pull from source
  • Transformation: Clean, normalize, enrich
  • Vector Store Population: Create embeddings, store chunks
  • Knowledge Graph Update: Extract entities, build relationships

Phase 3: Reflect

Evaluate and learn:

  • Quality Assessment: Did it work?
  • Error Analysis: What went wrong?
  • Strategy Update: How should we adapt?
  • Learning Storage: Remember for next time

The reflect phase is what makes the system self-learning.

ETL Jobs Dashboard

SemanticStudio ETL jobs dashboard showing job list and knowledge graph statistics

The dashboard shows:

  • Job List: All ETL jobs with status
  • Knowledge Graph Stats: 344 nodes, 320 edges
  • Nodes by Type: Distribution across entity types
  • Edges by Type: Relationship distribution

Job Status

StatusMeaning
CompletedSuccessfully finished
RunningCurrently executing
FailedError occurred
ScheduledWaiting to run

Job Actions

SemanticStudio ETL job action buttons

For each job:

  • View Details: See configuration and logs
  • Re-run: Execute again
  • Edit: Modify settings
  • Delete: Remove job

Data Sources

SemanticStudio supports multiple source types:

Supported Sources

Source TypeWhat It Handles
PostgreSQLDatabase tables
CSVFlat files
JSONStructured documents
REST APIExternal services
DocumentsPDF, DOCX, etc.

Data Sources Dashboard

SemanticStudio data sources and semantic layer configuration

The semantic layer maps:

  • Source TableEntity Type
  • ColumnsEntity Properties
  • Domain AgentData Access

Adding a Data Source

  1. Select source type (Postgres, CSV, etc.)
  2. Provide connection details
  3. Map entities to the semantic layer
  4. Link to domain agent (or create new)
  5. Run initial ETL

Database Population

When ETL runs, it populates two stores:

Vector Store (RAG)

  • Chunking: Break documents into segments
  • Embedding: Generate vectors (text-embedding-3-large)
  • Indexing: Store in pgvector for retrieval
  • Metadata: Preserve source info, timestamps

Knowledge Graph

  • Entity Extraction: Identify entities in content
  • Relationship Inference: Detect connections
  • Graph Population: Add nodes and edges
  • Deduplication: Merge duplicate entities

Creating New Domain Agents

Here's where it gets powerful: ETL can create agents automatically.

When to Create a New Agent

The system recommends a new agent when:

  • Data doesn't fit existing domains
  • New entity types are discovered
  • User explicitly requests
  • Coverage gap is detected

Agent Creation Workflow

  1. Data Analysis: What domain does this data represent?
  2. Agent Proposal: "This looks like a Procurement domain"
  3. System Prompt Generation: Create domain-specific instructions
  4. Data Source Linking: Connect new source to agent
  5. Activation: Enable agent and test

Example: Adding Procurement Data

1. Upload: procurement_contracts.csv

2. ETL Analysis:
   - Entity types: Vendor, Contract, PurchaseOrder
   - Domain signal: Procurement/Purchasing
   - No existing agent covers this

3. Proposal:
   "Create 'Procurement' agent?"
   Description: Vendors, contracts, purchases
   Category: Operations

4. Accept:
   - Agent created with generated system prompt
   - Data source linked
   - ETL completes
   - Agent enabled

5. Ready:
   "What contracts expire this quarter?"
   → Procurement agent responds with data

Expanding Existing Agents

Not every data source needs a new agent. Often, you're adding data to an existing domain.

Linking Additional Sources

  1. Select existing agent
  2. Add new data source
  3. Configure mapping
  4. Run ETL

Multiple Sources per Agent

An agent can have multiple sources:

Customer Intelligence Agent:
  - CRM database (customer records)
  - Support system (ticket history)
  - Survey data (satisfaction scores)
  - Marketing (campaign responses)

All sources are available when the agent responds.

Source Priority

When sources overlap, configure priority:

  • Primary: CRM (authoritative customer data)
  • Secondary: Support (supplemental context)
  • Tertiary: Marketing (additional signals)

Agent Lifecycle Management

As your system evolves, you need to manage agents:

Enable/Disable Agents

  • Disable agents during data refresh
  • Enable when data is ready
  • Disable unused agents to reduce routing complexity

Maintenance Mode

Set agents to maintenance when:

  • Running large ETL jobs
  • Updating system prompts
  • Validating data quality

Queries route to other agents during maintenance.

Retiring Agents

When an agent is no longer needed:

  1. Disable the agent
  2. Archive or delete data sources
  3. Remove agent from system

Historical data can be preserved even if agent is removed.

Self-Learning in Action

The PAR loop learns from each run:

Learning: Schema Drift

Run 1: Source has columns [A, B, C]
Run 2: Source now has columns [A, B, C, D]

Reflect: New column D detected
Plan (next run): Include D in transformation

Learning: Data Quality

Run 1: 5% of records failed validation
Reflect: Common issue is NULL in required field

Plan (next run): Add NULL handling for that field
Run 2: 0.5% failure rate (10x improvement)

Learning: Performance

Run 1: 10,000 records took 5 minutes
Reflect: Bottleneck in embedding generation

Plan (next run): Batch embeddings differently
Run 2: 10,000 records in 2 minutes (2.5x faster)

Best Practices

Start with Core Domains

Begin with agents that have:

  • Clear data sources
  • High usage potential
  • Well-defined boundaries

Iterate on System Prompts

After initial agent creation:

  • Review generated prompt
  • Customize for your context
  • Test with real queries
  • Refine based on results

Monitor ETL Health

Watch for:

  • Increasing failure rates
  • Slowing processing times
  • Growing error logs

These signal data source issues.

Regular Graph Rebuilds

Periodically rebuild the knowledge graph to:

  • Incorporate new relationships
  • Clean up stale entities
  • Optimize graph structure

What's Next

Data is flowing. Agents are created. Queries are routed. But how do you know if the system is working well?

Next up: Part 8 — Production Quality, where we cover quality evaluation, hallucination detection, and the observability that makes SemanticStudio enterprise-ready.