StackHawk
๏ƒ‰

Understanding and Protecting Against LLM10: Unbounded Consumption

Matt Tanner   |   Dec 23, 2025

Share on LinkedIn
Share on X
Share on Facebook
Share on Reddit
Send us an email

So, your company has created the next big AI-powered tool or platform. To get things rolling, you offer free trials to attract users. Usage ramps up and everything looks great until your first monthly AWS bill arrives. Your $2,000 bill youโ€™ve budgeted for (based on your calculations of usage for the average user) is now a $47,000 bill. Yikes. What happened? 

Well, attackers discovered they could submit massive files and complex, compute-heavy requests that pushed the LLM to its limits. A single 200k-token request (input + output) can cost $4-20 USD depending on the model (e.g., OpenAI GPT-4o (128k), Anthropic Claude 3 Opus (200k), or Google Gemini 1.5 Pro (1M)); 500 such requests from automated scripts could potentially drain tens of thousands of dollars in under an hour. 

These “denial of wallet” attacks exploit the lack of input limits and cost controls, potentially leading to bills that severely disrupt cash flow or even put a company out of business. This is unbounded consumption in action: when LLM applications allow excessive and uncontrolled resource usage, it can severely impact services and budgets.

The OWASP Top 10 for Large Language Model Applications (2025) identifies this as LLM10: Unbounded Consumption. Unlike traditional denial of service attacks that overwhelm servers with traffic, this vulnerability exploits the high compute volume required by LLMs. Every query forces the model to process complex operations across millions or billions of parameters. Without proper controls, attackers can exploit this computational demand to disrupt services, drain budgets, or steal intellectual property.

In this guide, we’ll explore how unbounded consumption creates security and financial risks, examine the various attack vectors, and provide strategies for implementing proper resource controls without disrupting legitimate usage by your actual users.

What is Unbounded Consumption in LLMs?

As the introduction mentions, unbounded consumption happens when LLM applications allow users to make excessive and uncontrolled inferences without proper limits or monitoring. LLM inference (the process of generating responses from a model) requires significant computing resources. Each query involves applying learned patterns across massive neural networks to produce relevant responses. Without controls, malicious users can exploit this demand.

The core problem? Many developers treat LLMs like traditional APIs without accounting for their resource intensity. A simple web API might handle thousands of requests per second with minimal resource impact. An LLM handling complex queries can easily consume gigabytes of memory and significant CPU cycles for each response. When you multiply that by thousands of malicious requests, systems quickly become overwhelmed.

What makes this particularly dangerous is the variety of attack vectors. Attackers don’t just flood systems with requests, but they can craft inputs specifically designed to maximize computational load. Variable-length inputs, complex language patterns, and requests that exceed context windows all force models to work harder and consume more resources.

The impact of unbounded consumption includes:

  • Service disruption: Resource exhaustion leading to timeouts, crashes, or unresponsive systems
  • Financial damage: Excessive cloud costs from pay-per-use models potentially bankrupting organizations
  • Tenant blast radius: Cross-tenant resource starvation when pools aren’t isolated (which can be fixed with per-tenant pools/quotas, FYI)
  • Performance degradation: Legitimate users experiencing slow response times, poor system performance, or service unavailability
  • Intellectual property theft: Typically via high-volume I/O harvesting and fine-tuning (functional replication), not direct weight download from a hosted API

Unfortunately, itโ€™s actually not too tough to fall victim to these attacks. The computational demands of modern LLMs mean that what seems like โ€œnormal usageโ€ can quickly spiral into resource exhaustion and service degradation without proper safeguards.

Types of Unbounded Consumption Attacks

OWASP identifies two main categories of unbounded consumption attacks that exploit resource usage in LLM applications.

The first of these is pure consumption attacks. These types of attacks focus on overwhelming systems through excessive resource usage, using tactics like:

Variable-Length Input Flood: Attackers submit numerous inputs of varying lengths to exploit processing inefficiencies in LLM systems. Models often have to allocate memory dynamically based on input size, and processing longer inputs requires more computational resources. By flooding the system with inputs that range from tiny fragments to maximum token limits, attackers can cause memory fragmentation, force inefficient resource allocation, and overwhelm processing queues.

Denial of Wallet (DoW): This targets the financial model behind cloud-based AI services. Attackers initiate high volumes of operations specifically to exploit pay-per-use pricing, causing unsustainable costs for service providers. A few motivated attackers can generate hundreds of thousands of expensive queries in hours, turning a manageable monthly bill into a budget-breaking, business disaster.

Continuous Input Overflow: Attackers continuously send inputs that exceed the LLM’s context window (the maximum text length the model can handle in a single request), forcing the model to truncate, reprocess, or handle oversized inputs inefficiently. It is important to note that truncation of these malicious requests can still compute embeddings/chunking and consume tokens before rejection, so enforcing limits before sending to the model is essential (likely at the AI gateway level, more on that later).

Resource-Intensive Query Patterns: These involve submitting unusually demanding queries with complex sequences, intricate language patterns, or computational requests that drain system resources. For example, requests for complex mathematical proofs, detailed code generation with multiple revisions, or analysis of extremely long documents all require significantly more processing power than simple question-and-answer interactions.

The second type of attack outlined is IP Theft via Consumption. These attacks use systematic resource consumption to steal intellectual property. They do this through:

Model Extraction via API: Attackers systematically query models using carefully crafted inputs to collect sufficient outputs for replicating model behavior. This involves thousands of targeted queries designed to understand model capabilities, training patterns, and decision-making processes. The goal is to create “shadow models” that replicate proprietary functionality.

Functional Model Replication: Rather than trying to extract the exact model, attackers use target models to generate synthetic training data for fine-tuning other models. They submit diverse prompts to collect large datasets of input-output pairs, then use this data to train competing models that achieve similar functionality. An important note here, even though itโ€™s more of an industry impact vs. an impact on your application or LLM, is that functional replication can also enable downstream data poisoning or bias propagation if synthetic outputs are fed back into fine-tuning datasets.

Side-Channel Techniques: Malicious users exploit timing analysis and error patterns to infer model characteristics or bypass input filtering; direct weight theft generally requires separate misconfigurations or insider access. These attacks often involve analyzing response patterns, timing differences, or error messages to infer information about model structure and implementation details.

Although these two types are distinctly different, the root causes of both are still very much intertwined. The same can be said about the solutions in preventing and stopping them.

The Root Causes of Unbounded Consumption

Unbounded consumption vulnerabilities stem from fundamental misunderstandings about LLM resource requirements and inadequate system design. For most of us, LLMs are new, but the APIโ€™s that they are accessed through are not. Unfortunately, compared to traditional APIs that we are used to, which have predictable access patterns and resource consumption, LLM-powered APIโ€™s are a totally different ballgame altogether. Here are some of the root causes of unbounded consumption:

  • Treating LLMs Like Traditional APIs โ€“ Developers apply lightweight web service strategies to computationally intensive AI models, leading to insufficient rate limiting, inadequate monitoring, and missing cost controls.
  • Lack of Input Validation โ€“ Applications fail to validate input size, complexity, or format, allowing attackers to craft inputs that consume excessive resources or legitimate users to accidentally trigger exhaustion.
  • Insufficient Resource Monitoring โ€“ LLM applications lack proper monitoring of computational resources, memory consumption, or processing times, making it difficult to detect abuse or implement dynamic controls.
  • Missing Cost Controls โ€“ Applications lack cost monitoring, spending limits, or budget alerts, with organizations discovering abuse only after massive cloud bills arrive.
  • Inadequate Access Controls โ€“ Applications lack proper authentication, authorization, or user tracking, making it impossible to prevent abuse or identify attackers after incidents.
  • Poor Architecture Design โ€“ LLM applications lack user isolation, graceful degradation under load, or circuit breakers, with shared resource pools amplifying the impact of consumption attacks.
  • Underestimating Attack Motivation โ€“ Organizations assume LLM abuse is unlikely, but attackers have various motivations (vandalism to IP theft) and readily available tools for launching resource exhaustion attacks.

These root causes highlight that unbounded consumption results from treating LLMs as black boxes and traditional APIs without understanding their resource characteristics or implementing appropriate safeguards. A different mindset is required to think about how to protect this new paradigm of API usage. One of the best ways to illustrate this is by examining examples of how these attacks can occur.

Real-World Examples of Unbounded Consumption

Dig around the internet, and quite a few companies have stories about the impact of unbounded consumption, usually due to a lack of thinking about the reality of someone abusing their platform. We wonโ€™t go into any specific real-life scenarios, but here are realistic examples based on documented attack patterns and common deployment scenarios:

Scenario #1: E-Learning Platform Bankruptcy

An online education startup launches an AI tutoring platform that generates personalized study guides and practice questions. They offer a free tier to attract students before major exams. Within 48 hours of launch, automated scripts begin submitting requests for 50-page study guides on obscure topics, complex mathematical proofs, and detailed essay outlines. Adoption is in a good spot until they realize that all these requests are coming simultaneously from thousands of bot accounts, not real users.

Each request consumes significant computational resources and costs $0.50-$2.00 in cloud fees. The attackers generate over 100,000 requests in two days, creating a $200,000 cloud bill that forces the early-stage startup to shut down. The attack is relatively simple and exploits several aspects of the system, including the lack of rate limiting, request complexity validation, and cost monitoring.

Scenario #2: Healthcare Chat Assistant Meltdown

A hospital deploys an AI assistant to help staff quickly look up drug interactions, treatment protocols, and clinical guidelines. The system processes natural language queries and generates detailed medical summaries. An employee aiming to disrupt the service discovers they can submit extremely long queries containing entire medical textbooks and ask for a comprehensive analysis of thousands of drug combinations simultaneously.

The system accepts these complex medical queries and consumes enormous memory and processing power, attempting to provide a response. This causes response times to increase from 2 seconds to over 5 minutes. Legitimate staff queries are beginning to time out during critical patient care situations. The system becomes unreliable enough that medical staff abandon it, effectively rendering the expensive deployment useless. This, of course, looks similar to a typical REST API Denial of Service attack, but it’s not about the volume of requests sent; it’s about the actual compute workload placed on the system, even from a single request.

Scenario #3: Model Theft via Systematic Extraction

A competitor targets a proprietary legal AI tool that provides contract analysis and legal research capabilities. Instead of directly attacking the service, they create thousands of accounts using different identities and begin systematically querying the model with carefully crafted legal scenarios. Each query is designed to extract specific knowledge about contract interpretation, legal precedents, and analytical reasoning.

Over six months, they collect millions of input-output pairs covering every aspect of the model’s capabilities. They use this data to fine-tune their own legal AI model, essentially stealing years of development work and proprietary training data. The resource consumption appears normal because it’s spread across time and multiple accounts, but the cumulative computational cost and intellectual property theft are massive. Much of the time, attacks that get deployed this way could go completely unnoticed if they stay within typical user usage patterns.

Now that youโ€™ve seen the attacks in action, youโ€™ve probably already started to stitch together some theories on how to prevent them. Next, letโ€™s take a look at some of the best ways to protect against these types of attacks.

How to Protect Against Unbounded Consumption

At a high level, protecting LLM endpoints isn’t that different from securing any other type of API. The nuance (and probably the most essential differentiation) is that resource management and compute limits are just as critical as authentication or data validation. With this in mind, protecting against unbounded consumption requires implementing comprehensive resource controls. The best way to do this is to think of these layers of protection as a 4-stage deployment pipeline. Here is what that looks like:

Stage 1: Pre-Ingress Controls

First, before traffic can even attempt to reach an upstream API, youโ€™ll want to deploy protective measures to ensure users are legitimate and not malicious. This includes implementing and checking stuff like:

  • WAF and IP reputation: Block known malicious IPs and implement basic bot detection
  • Account verification: Require credit card validation or phone verification for higher usage tiers
  • Bot checks: Implement CAPTCHA or challenge-response systems for suspicious traffic patterns

Stage 2: Gateway Controls

Once users have passed the first stage, they enter the realm of familiarity (in terms of typical API access). Here in the second stage, youโ€™ll implement further controls to ensure that API access is authorized and adheres to the limits youโ€™ve put in place (in terms of rate limits and quotas). The controls include:

  • Authentication and authorization: Verify user permissions and implement role-based access controls
  • Per-tenant RPM/TPM/concurrency caps: Mirror provider limits (For example, OpenAI may have a 5K RPM/200K TPM limit in place, etc.) to prevent 429 cascades.
  • Request cost estimation: If possible, calculate approximate costs before processing and enforce spending limits
  • Queue back-pressure: Implement intelligent request queuing that prioritizes legitimate traffic and drops suspicious requests

Stage 3: Inference Controls

Once requests reach the model, your goal shifts from preventing malicious requests (since the previous steps have already filtered many) to containment, ensuring that anything reaching the LLM is controlled in terms of input and output. These measures should include:

  • Input validation: Maximum token counts, character limits, and complexity scoring to prevent resource-intensive inputs
  • Per-request timeouts and memory limits: Set maximum processing times and memory allocation per request
  • Output restrictions: Cap response length and complexity to control resource usage
  • Safe streaming: Implement streaming responses with early termination for oversized outputs
  • Retrieval limits: For RAG applications, limit the number and size of documents retrieved

Stage 4: Post-Inference Monitoring

Lastly, deploy comprehensive monitoring and automated response to potential issues. This is in line with the recommendation for securing API calls in general, making sure that everything is logged and that you monitor resource usage. If anything does slip through, you want to know about it. This stage should include:

  • Detailed logging: Track per-tenant tokens, latency, cost, and anomaly metrics
  • Anomaly detection: Monitor for unusual usage patterns, bursting activity, and long-tail token distributions
  • Automated budget actions: Set up alerts when certain spend thresholds are reached, implement kill-switches that disable API keys or throttle users when spending thresholds are exceeded
  • Model extraction heuristics: Detect systematic parameter sweeps and near-duplicate prompts that might indicate IP theft attempts

Following this pattern is one of the best ways to create a holistic defense against these types of attacks. If youโ€™re like most, you may not even know how to test for this vulnerability within your application. Luckily, there is a way to automatically test for this and other vulnerabilities outlined in the OWASP LLM Top 10 (with everyoneโ€™s favorite modern DAST solution: StackHawk)!

How StackHawk Can Help Secure Your AI Applications

Determining whether your AI application is susceptible to the vulnerabilities described within the OWASP LLM Top 10 can be genuinely tough for most developers. Most testing tools are not equipped for the dynamic nature of AI and AI APIs that power modern applications. At StackHawk, we believe security is shifting beyond the standard set of tools and techniques, which is why weโ€™ve augmented our platform to help developers address the most pressing security problems in AI.

As part of our LLM Security Testing, StackHawk detects relevant OWASP LLM Top 10 risks, including LLM10: Unbounded Consumption. With our built-in plugins (40052: API Lack of Rate Limiting for this risk), developers are flagged during their other tests when AI-specific risks are present.

StackHawk’s platform helps organizations build security into their AI applications from the ground up, ensuring users are protected against OWASP LLM Top 10 vulnerabilities.

Final Thoughts

Unbounded consumption represents a unique vulnerability where the computational power that makes LLMs useful becomes a vector for attack. The high computational demands of LLMs make them inherently vulnerable to both accidental and intentional resource exhaustion, leading to service disruption, unexpected costs, or intellectual property theft.

To stop this type of attack in its tracks, you need to implement comprehensive resource controls such as input validation, rate limiting, cost monitoring, and anomaly detection before unbounded consumption drains your compute budget or disrupts production systems. For the most current information on LLM security threats, check out the complete OWASP LLM Top 10 and start implementing resource controls that protect both your AI systems and your budget.


Ready to start securing your applications against unbounded consumption and other AI threats? Schedule a demo to learn how our security testing platform can help identify vulnerabilities in your AI-powered applications and traditional web services alike.

More Hawksome Posts