One of Ubby's most powerful features is the ability to choose from multiple AI models for your agents. Each model offers different capabilities, performance characteristics, and costs. Understanding these differences and knowing when to use which model can dramatically improve both the effectiveness of your agents and the efficiency of your credit usage.

This article explores the AI models available in Ubby, explains their key differences, and provides practical guidance for selecting the right model for each task.

The AI model landscape in Ubby

Ubby provides access to a carefully curated selection of AI models that have been tested and optimized for agentic workflows. Not every AI model works well in autonomous agent scenarios—some excel at conversational interactions but struggle with tool use, complex multi-step reasoning, or reliable execution of structured tasks. The models available in Ubby represent those that perform best for building and running autonomous agents.

This curated approach means you can trust that any model you select has been validated for agentic use cases. You are choosing among models that all work well with agents; the question is which one best fits your specific task requirements and budget.

The available models span several major families:

Claude models (Anthropic): Known for strong reasoning capabilities, nuanced understanding, and excellent instruction-following. These models excel at complex analysis, writing, and tasks requiring careful consideration.
GPT models (OpenAI): Highly versatile models with broad knowledge and strong general capabilities. The GPT family ranges from powerful flagship models to efficient smaller variants.
Gemini models (Google): Advanced models with strong multimodal capabilities and competitive performance across many tasks.
DeepSeek models: Cost-effective models that offer solid performance for many use cases at lower credit consumption.
Grok models (xAI): Models designed for conversational intelligence and real-time information processing.
Specialized models: Including coding-focused models like Qwen3-Coder and efficient open-source models like GPT-OSS variants.

Each model family brings different strengths to the table. Your job is not to find the single "best" model but to match models to tasks based on your specific requirements.

Understanding model characteristics

When evaluating AI models, several key characteristics determine their suitability for different tasks.

Intelligence and reasoning capability

Some models demonstrate superior reasoning ability, handling complex multi-step problems, nuanced analysis, and sophisticated decision-making better than others. These models typically cost more credits per token but deliver higher-quality outputs for demanding tasks.

Claude Sonnet-4, for example, excels at tasks requiring deep understanding, careful reasoning, and nuanced responses. If your agent needs to analyze complex business situations, provide strategic advice, or handle ambiguous instructions, a high-intelligence model justifies its higher cost.

Lighter models may struggle with truly complex reasoning but handle straightforward tasks perfectly well. Using an expensive high-intelligence model for simple tasks wastes credits without adding value.

Speed and responsiveness

Different models process requests at different speeds. Some return responses in seconds, while others take longer. For agents that need to respond quickly—like customer service bots or real-time assistants—speed matters significantly.

Generally, smaller models process faster than larger ones. A model like GPT-5-nano delivers responses much quicker than GPT-5, though with reduced capability. Consider whether your use case prioritizes speed over maximum intelligence.

In automated workflows where agents run asynchronously without human waiting, speed matters less. A monthly report that takes 30 seconds versus 10 seconds to generate makes little practical difference. But an interactive agent where users expect immediate responses needs faster models.

Context window size

The context window determines how much information a model can consider at once. This includes your prompt, any documents or data provided, the agent's instructions, and conversation history.

Models with larger context windows can handle longer documents, maintain more conversation history, or work with extensive background information. If your agents regularly process large documents or need to reference substantial context, context window size becomes a critical selection factor.

However, larger context windows consume more tokens and therefore more credits when fully utilized. Don't default to maximum-context models if your tasks rarely need extensive context.

Specialized capabilities

Some models offer specialized capabilities that make them particularly suited to certain tasks:

Coding models like Qwen3-Coder excel at understanding and generating code. If your agents work extensively with programming tasks, these specialized models often outperform general-purpose models.
Multimodal models can process images alongside text. If your agents need to analyze documents with visual elements, charts, or images, multimodal capability becomes essential.
Conversation-optimized models handle multi-turn dialogue more naturally, maintaining context and adapting responses based on conversation flow.

Match specialized models to tasks that benefit from their particular strengths.

Cost considerations

Every model has different credit costs per million tokens, and these costs vary between input (processing) and output (generation) tokens. Understanding these costs helps you optimize your credit usage.

The cost spectrum

At the high end, flagship models like Claude Sonnet-4 cost 4,500 credits per 1M input tokens and 22,500 per 1M output tokens. These premium prices reflect exceptional capability and should be reserved for tasks that genuinely benefit from top-tier performance.

Mid-range models like GPT-4.1 (3,000 input / 12,000 output) or DeepSeek-Chat (750 input / 1,500 output) offer strong performance at more moderate costs. These models handle most business tasks well and represent good value for routine agent operations.

At the efficient end, models like GPT-5-nano (75 input / 600 output) or GPT-OSS-20b (60 input / 225 output) provide basic capability at minimal credit consumption. While not suitable for complex reasoning, they excel at straightforward tasks where intelligence requirements are modest.

Understanding the cost-performance tradeoff

More expensive models deliver better performance, but the relationship is not linear. A model costing 10x more does not necessarily perform 10x better—the improvement might be 2x or 3x for most tasks.

This creates optimization opportunities. For many routine tasks, a mid-range model delivers 80-90% of a flagship model's quality at 30-50% of the cost. The performance gap matters more for some tasks than others.

Consider what "good enough" means for each task. A monthly executive report might justify premium model costs for maximum quality. A routine data extraction task that produces structured output might work perfectly well with an efficient mid-range model.

Calculating cost per task

To understand true costs, calculate expected credit consumption per task execution:

Example 1 - Complex analysis task:

Input: 20,000 tokens (detailed business data)
Expected output: 3,000 tokens (comprehensive analysis)
Using Claude Sonnet-4:
- Input cost: (20,000 ÷ 1,000,000) × 4,500 = 90 credits
- Output cost: (3,000 ÷ 1,000,000) × 22,500 = 67.5 credits
- Total: ~158 credits per execution

Example 2 - Same task with DeepSeek-Chat:

Input cost: (20,000 ÷ 1,000,000) × 750 = 15 credits
Output cost: (3,000 ÷ 1,000,000) × 1,500 = 4.5 credits
Total: ~20 credits per execution

The cheaper model saves ~138 credits per execution (87% savings). Whether this tradeoff makes sense depends on whether the quality difference matters for your use case.

Model selection strategies

Effective model selection involves matching model characteristics to task requirements systematically.

Start with task classification

Classify your tasks into categories based on their complexity and requirements:

High-complexity tasks: Strategic analysis, complex decision-making, nuanced writing, ambiguous problem-solving. These tasks benefit from flagship models with maximum reasoning capability.
Medium-complexity tasks: Data analysis, report generation, standard document processing, routine advisory work. Mid-range models handle these tasks well at reasonable cost.
Low-complexity tasks: Data extraction, simple classification, template filling, basic formatting. Efficient models provide adequate performance at minimal cost.
Specialized tasks: Code generation, image analysis, specific domain work. These tasks benefit from models with relevant specialized capabilities.

This classification guides your default model choices, though you should validate with testing.

The performance-cost optimization matrix

Create a simple matrix to guide model selection:

High value + High complexity → Premium models When the task output is highly valuable (strategic decisions, client-facing materials) and requires sophisticated reasoning, premium model costs are justified.
High value + Medium complexity → Mid-range models Important tasks that do not require maximum intelligence work well with solid mid-range models that balance quality and cost.
Low value + Any complexity → Most efficient adequate model For routine tasks where output has limited impact, minimize costs by using the most efficient model that produces acceptable results.
Specialized requirements → Specialized models When tasks have specific requirements (coding, multimodal, etc.), specialized models outperform general-purpose alternatives regardless of cost.

Testing and validation

Do not assume model selection. Test your assumptions with real tasks:

Run the same task through multiple models and compare results. Does the expensive model deliver meaningfully better output? Or does a mid-range model produce virtually identical results at lower cost?

Pay attention to failure modes. Sometimes cheaper models fail occasionally where expensive models succeed consistently. If these failures create significant problems, the reliability of expensive models justifies their cost. If failures are easily caught and corrected, occasional failures might be acceptable given the cost savings.

Validate across multiple examples, not just one. A model might handle one example well but struggle with others that have different characteristics.

Common model selection patterns

Certain patterns emerge across successful Ubby deployments for different types of agents.

Document processing agents

For complex document analysis (contracts, technical documents, nuanced content):

Primary: Claude Sonnet-4 or GPT-4.1
Reason: These tasks benefit from strong comprehension and reasoning

For straightforward extraction (pulling data fields, categorizing, summarizing):

Primary: DeepSeek-Chat or GPT-4o
Reason: Mid-range models handle structured extraction well at reasonable cost

For simple document classification:

Primary: GPT-5-mini or Gemini-2.5-flash-lite
Reason: Classification based on clear criteria works well with efficient models

Writing and content generation agents

For client-facing content (reports, proposals, communications):

Primary: Claude Sonnet-4
Reason: Quality and nuance matter for external-facing materials

For internal documentation:

Primary: Claude 3.7-Sonnet or GPT-4.1
Reason: Good quality at more moderate cost for internal use

For template-based content (filling forms, standard letters):

Primary: DeepSeek-Chat or GPT-4o
Reason: Template filling does not require premium models

Data analysis agents

For complex analytical reasoning:

Primary: Claude Sonnet-4 or GPT-5
Reason: Drawing insights from complex data benefits from strong reasoning

For routine metrics and reporting:

Primary: GPT-4.1 or DeepSeek-Chat
Reason: Calculating and presenting standard metrics works well with mid-range models

For data extraction and transformation:

Primary: GPT-4o or Gemini-2.5-pro
Reason: Structured data work does not require maximum intelligence

Conversational agents

For sophisticated advisory conversations:

Primary: Claude Sonnet-4
Reason: Nuanced dialogue benefits from strong comprehension and reasoning

For customer service and support:

Primary: GPT-4.1 or Claude 3.7-Sonnet
Reason: Balance between quality and speed/cost for frequent interactions

For simple FAQ and routing:

Primary: GPT-5-mini or DeepSeek-Chat
Reason: Straightforward Q&A works well with efficient models

Monitoring model performance and costs

Track how different models perform in your specific use cases to refine selection over time.

Cost tracking by model

Your Ubby usage logs show which models consumed how many credits over time. Review this data monthly to understand your model cost distribution:

Which models account for most of your credit consumption?
Are expensive models being used for tasks that could use cheaper alternatives?
Are efficient models failing frequently, suggesting you should upgrade to more capable models for those tasks?

Quality metrics

For critical agents, track quality metrics alongside costs:

Error rates or failure rates by model
Human review/revision rates for model outputs
User satisfaction for conversational agents
Accuracy for data extraction agents

If a cheaper model requires frequent human correction, its nominal cost savings might be illusory once you account for the human time spent fixing errors.

Continuous optimization

Model selection should not be static. As new models become available, as your tasks evolve, and as you gain experience with different models, continuously refine your choices:

Periodically test whether new models outperform your current selections
Reassess whether tasks you classified as "complex" could be handled by mid-range models after all
Look for opportunities to shift work from expensive to efficient models without sacrificing quality

Future-proofing your model strategy

The AI model landscape evolves rapidly. Design your agent architecture to adapt easily as new models emerge.

Avoid hard-coding model choices

Rather than hard-coding specific models into your agents, use configuration that can be easily updated. When a new, better model becomes available, you can switch with minimal effort.

Some organizations maintain model "tiers" (premium, standard, efficient) and assign agents to tiers rather than specific models. When a better model joins a tier, all agents in that tier benefit automatically.

Monitor new model releases

AI providers regularly release improved models with better performance, lower costs, or new capabilities. Stay informed about these releases and evaluate whether they offer advantages for your use cases.

Ubby adds new models to the platform as they become available. When you see a new model appear in your model pricing page, investigate whether it might outperform your current selections.

Build institutional knowledge

Document which models work well for which tasks in your organization. This knowledge helps new team members make good model selections and prevents repeatedly testing the same model/task combinations.

Share learnings across your team about model performance. Someone might discover that a particular model excels at a specific task type, knowledge that benefits everyone.

What next?

You now understand the different AI models available in Ubby, their characteristics, and how to select the right model for each task. This knowledge enables you to optimize both the quality of your agents' work and the efficiency of your credit usage.

In the next article, we will explore Ubby's pricing plans and billing system in detail, helping you choose the right plan for your needs and understand how billing works.

What is an autonomous AI agent?

Creating your first custom agent

Managing your agent portfolio

Understanding Ubby Credits

Pricing plans and billing

Choosing the right AI model for your agents