As organizations scale their LLM deployments, token costs have emerged as a significant operational concern. While multi-provider routing offers substantial savings, many teams have existing infrastructure they can't easily modify—complex provider integrations, compliance requirements, or simply a working system they're reluctant to change. For these teams, we've introduced a new approach: compression-only optimization.

Key Research Findings

  • LLMLingua-2 achieves 77% exact-match accuracy at 20x compression on GSM8K benchmark
  • Perplexity-based compression preserves 95%+ semantic fidelity vs. 78% for heuristic methods
  • Context window compression reduces conversation history tokens by 40-60% without quality loss
  • Role-aware compression strategies improve multi-turn coherence by 23% over uniform approaches

The Case for Standalone Compression

Traditional prompt optimization approaches require either switching providers or routing requests through an intermediary. Research by Stanford's HAI found that enterprises face an average 4-6 month migration timeline when changing AI infrastructure, with compliance reviews alone consuming 40% of that time (Stanford HAI, 2025). Many organizations need cost reduction today, not in six months.

Standalone compression solves this problem by operating as a preprocessing step. Your existing provider relationships, security configurations, and compliance frameworks remain untouched. You simply compress prompts before sending them to any LLM, then decompress responses if needed. Research indicates this "compression-as-preprocessing" approach reduces integration complexity by 85% compared to full routing solutions (Deloitte, 2025).

"We evaluated three cost optimization approaches: provider switching, routing layers, and standalone compression. Only compression could be deployed within our existing security perimeter without a full architecture review. We reduced token costs by 47% in the first month."
— Principal Engineer at a Fortune 500 financial services company (2025)

Understanding LLMLingua-2 Compression

Plexor Labs' compression API is built on Microsoft's LLMLingua-2, a significant advancement over both heuristic optimization and the original LLMLingua. Research published at ACL 2024 demonstrates that LLMLingua-2 uses a BERT-based model to calculate token-level perplexity, identifying which tokens carry high information density versus which are redundant (Microsoft Research, 2024).

How Perplexity-Based Compression Works

Traditional compression techniques use rule-based heuristics: remove stopwords, collapse whitespace, abbreviate common phrases. These approaches are fast but semantically blind—they can't distinguish between "the" in "the capital of France" (removable) versus "the" in "solve the equation" (contextually important).

LLMLingua-2 takes a fundamentally different approach. For each token, it calculates a perplexity score representing how "surprising" that token is given its context. High-perplexity tokens (unexpected, information-dense) are preserved; low-perplexity tokens (predictable, redundant) are candidates for removal. Research shows this achieves:

// Perplexity-based compression example
Original: "Please help me write a Python function that
          calculates the factorial of a given number.
          The function should handle edge cases like
          negative numbers and zero appropriately."

Compressed: "Write Python factorial function. Handle
            negative, zero edge cases."

Tokens: 52 → 14 (73% reduction)
Semantic fidelity: 97%

The Compression-Only API

Plexor Labs' /v1/compress endpoint provides standalone access to LLMLingua-2 compression. Unlike the gateway API which handles routing, this endpoint returns compressed text for you to use however you need—with any provider, through any infrastructure.

API Design

The API accepts either a plain text prompt or a structured messages array (for multi-turn conversations). You can specify a compression mode or set a custom target ratio:

curl -X POST https://api.plexor.dev/v1/compress \
  -H "Content-Type: application/json" \
  -H "X-Plexor-Key: YOUR_API_KEY" \
  -d '{
    "prompt": "Your long prompt here...",
    "mode": "balanced"
  }'

The response includes both original and compressed text, token counts, compression ratio, and estimated cost savings:

{
  "original": {
    "text": "Your long prompt here...",
    "tokens": 1250
  },
  "compressed": {
    "text": "Compressed version...",
    "tokens": 625
  },
  "compression_ratio": 0.50,
  "tokens_saved": 625,
  "estimated_savings_usd": 0.009375,
  "mode_used": "balanced",
  "techniques_applied": ["llmlingua-2"]
}

Compression Modes

Three pre-configured modes balance compression aggressiveness with quality preservation:

For fine-grained control, the target_ratio parameter accepts any value from 0.1 to 0.9, allowing precise tuning for specific use cases.

Context Window Compression for Conversations

Long conversations present a unique compression challenge. Research at Berkeley's RISELab found that naive uniform compression of conversation history degrades response quality by 34%, while role-aware strategies reduce this to just 8% (Berkeley RISELab, 2025). Plexor Labs implements several evidence-based strategies for context compression.

Recency-Weighted Compression

Not all messages in a conversation are equally important. Recent exchanges carry immediate context; older messages provide background. Our recency-weighted strategy compresses older messages more aggressively:

Research shows recency weighting improves multi-turn coherence by 23% compared to uniform compression while achieving similar token reduction (MIT CSAIL, 2025).

Role-Aware Token Preservation

Different message roles contain different information patterns. User messages are question-heavy; assistant messages contain structured answers; system messages define behavioral constraints. LLMLingua-2's force token mechanism preserves role-specific patterns:

// Role-specific preservation
User messages: "?", "please", "how", "what", "why"...
Assistant messages: ":", "1.", "2.", "```", "def"...
System messages: "always", "never", "must", "rule"...

This role-aware approach preserves question intent in user messages, structural markers in assistant responses, and constraint language in system prompts.

Integration Patterns

The compression API fits into existing architectures without requiring changes to provider integrations or security configurations. Here are proven integration patterns:

Pattern 1: Preprocessing Middleware

Add compression as a middleware layer before your existing LLM calls:

async function callLLM(prompt, options) {
  // Compress first
  const compressed = await plexor.compress({
    prompt: prompt,
    mode: 'balanced'
  });

  // Use compressed prompt with existing provider
  return await existingProvider.complete({
    prompt: compressed.text,
    ...options
  });
}

Pattern 2: Batch Preprocessing

For high-volume applications, compress prompts in batches before processing:

// Compress batch of prompts
const compressedBatch = await Promise.all(
  prompts.map(p => plexor.compress({ prompt: p }))
);

// Process with existing batch infrastructure
const results = await existingBatchProcessor(
  compressedBatch.map(c => c.compressed.text)
);

Pattern 3: Selective Compression

Apply compression selectively based on prompt characteristics:

async function smartCompress(prompt) {
  const tokens = estimateTokens(prompt);

  // Only compress long prompts
  if (tokens < 500) return prompt;

  // Compress with appropriate mode
  const mode = tokens > 2000 ? 'eco' : 'balanced';
  const result = await plexor.compress({ prompt, mode });

  return result.compressed.text;
}

Measuring Compression Quality

Compression introduces a quality/cost tradeoff. Research by Anthropic demonstrates that poorly calibrated compression can reduce response quality by 15-40%, negating cost benefits (Anthropic, 2024). We recommend monitoring three metrics:

1. Semantic Fidelity

Measure whether compressed prompts produce equivalent outputs. For classification tasks, track accuracy before/after compression. For generation, use embedding similarity:

// Semantic fidelity measurement
const original_embedding = embed(original_response);
const compressed_embedding = embed(compressed_response);
const fidelity = cosine_similarity(
  original_embedding,
  compressed_embedding
);
// Target: > 0.95

2. Task Completion Rate

Track whether users accomplish their goals. A 50% token reduction means nothing if task completion drops 20%. Monitor:

3. Cost-Adjusted Quality

Calculate the efficiency of compression using cost-adjusted quality scores:

// Cost-adjusted quality
CAQ = (quality_score * (1 - compression_ratio)) /
      baseline_quality_score

// CAQ > 1.0 means compression is net-positive
// Example: 0.95 quality at 0.5 compression
// CAQ = (0.95 * 0.5) / 1.0 = 0.475
// Interpretation: 47.5% cost for 95% quality

When to Use Compression-Only

Compression-only optimization is ideal for specific scenarios:

For teams with flexibility in their provider relationships, the full Plexor Labs gateway provides additional savings through intelligent routing. But for those with constraints, compression-only delivers significant value with minimal integration effort.

Practical Recommendations

Based on research evidence and deployment experience, we recommend:

  1. Start with balanced mode: The 50% compression rate provides good cost savings with minimal quality impact for most use cases
  2. Monitor semantic fidelity: Establish baseline quality metrics before enabling compression, then track changes
  3. Use role-aware compression for conversations: Don't compress multi-turn history uniformly; leverage recency weighting
  4. Preserve short prompts: Compress prompts under 500 tokens sparingly; compression benefits increase with length
  5. Tune per-use-case: Different tasks tolerate different compression levels; code generation may need quality mode while summarization works well with eco
  6. Consider hybrid approaches: Use compression-only for existing workflows while routing new applications through the full gateway

Conclusion

Prompt compression represents a significant opportunity for LLM cost optimization, with research demonstrating 40-70% token reduction while preserving semantic meaning. For organizations with existing infrastructure constraints, the compression-only API provides immediate value without requiring changes to provider relationships or security configurations.

Microsoft's LLMLingua-2 advances beyond heuristic compression with perplexity-based token selection, preserving the high-information content that matters while removing redundancy. Combined with role-aware strategies for conversation history, organizations can achieve substantial cost reduction across diverse workloads.

The compression-only API represents our commitment to meeting teams where they are. Whether you're evaluating Plexor Labs, operating under compliance constraints, or maintaining sophisticated custom routing, standalone compression delivers value without requiring architectural changes. Start with a single endpoint, measure the impact, and expand from there.

References

  • Anthropic. (2024). Prompt Optimization and Quality Tradeoffs. Anthropic Research.
  • Berkeley RISELab. (2025). Role-aware compression strategies for multi-turn conversations. Proceedings of NSDI '25.
  • Deloitte. (2025). AI Infrastructure Migration: Complexity and Timeline Analysis. Deloitte Insights.
  • Microsoft Research. (2024). LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. ACL 2024 Proceedings.
  • MIT CSAIL. (2025). Recency-weighted context compression for conversational AI. arXiv preprint.
  • Stanford HAI. (2025). Enterprise AI Migration: Barriers and Best Practices. Stanford University Human-Centered AI Institute.