{"id":16194,"date":"2026-05-08T09:07:08","date_gmt":"2026-05-08T09:07:08","guid":{"rendered":"https:\/\/newestek.com\/?p=16194"},"modified":"2026-05-08T09:07:08","modified_gmt":"2026-05-08T09:07:08","slug":"pen-tests-show-ai-security-flaws-far-more-severe-than-legacy-software-bugs","status":"publish","type":"post","link":"https:\/\/newestek.com\/?p=16194","title":{"rendered":"Pen tests show AI security flaws far more severe than legacy software bugs"},"content":{"rendered":"<div>\n<div id=\"remove_no_follow\">\n<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n<div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n<div class=\"article-column__content\">\n<section class=\"wp-block-bigbite-multi-title\">\n<div class=\"container\"><\/div>\n<\/section>\n<p>Penetration tests of AI-based systems are revealing a greater percentage of high-risk flaws than those discovered in legacy systems.<\/p>\n<p>Security consultancy <a href=\"https:\/\/www.cobalt.io\/blog\/5-key-takeaways-from-the-2026-state-of-pentesting-report\">Cobalt\u2019s annual State of Pentesting Report<\/a> reveals that 32% of all AI and large language model (LLM) findings are rated as high risk \u2014 nearly 2.5 times the rate (13%) of severe flaws found in enterprise security tests more generally.<\/p>\n<p>LLM vulnerabilities also have the lowest resolution rate of all app types pen-tested, with just 38% of high-risk issues fixed, according to data collected during pen tests conducted by Cobalt.<\/p>\n<p>Furthermore, one in five organizations surveyed by Cobalt reported experiencing an LLM security incident in the past year, with a further 18% \u201cunsure\u201d and 19% preferring not to answer.<\/p>\n<p>Third-party security experts quizzed by CSO say Cobalt\u2019s findings align with what they\u2019ve seen on the ground.<\/p>\n<p>\u201cAI systems are being rolled out quickly, but often without the same mature security controls, testing discipline, and governance applied to conventional enterprise software,\u201d says Benny Lakunishok, CEO and co-founder of Zero Networks. \u201cThat naturally increases the share of serious findings.\u201d<\/p>\n<p>William Wright, CEO of penetration testing firm Closed Door Security, argues that the main issue <a href=\"https:\/\/www.csoonline.com\/article\/4116923\/output-from-vibe-coding-tools-prone-to-critical-security-flaws-study-finds.html\">comes from vibe coders writing systems<\/a>.<\/p>\n<p>\u201cAI only does what it\u2019s told, for the most part, and systems that get deployed are usually cobbled together by people without the technical knowledge,\u201d Wright adds. \u201cThe same people then are expected to fix the issue, so it\u2019s a vicious circle.\u201d<\/p>\n<p>David Girvin, AI security researcher at Sumo Logic, agrees.<\/p>\n<p>\u201cLLM-driven systems are showing a higher percentage of high-risk findings because we\u2019ve essentially taken a probabilistic engine, plugged it directly into business workflows, and hoped it behaves,\u201d he says. \u201cThat\u2019s not a security strategy.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"emerging-attack-surfaces-larger-blast-radius\">Emerging attack surfaces, larger blast radius<\/h2>\n<p>The top concern is prompt injection, now ranked by OWASP as the <a href=\"https:\/\/www.csoonline.com\/article\/575497\/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html\">No. 1 risk for LLM applications<\/a>, with reports on bug bounty platform HackerOne <a href=\"https:\/\/www.hackerone.com\/press-release\/hackerone-report-finds-210-spike-ai-vulnerability-reports-amid-rise-ai-autonomy\">surging more than six-fold (540%)<\/a> year over year.<\/p>\n<p>\u201cWhile the headline issue is prompt injection, the broader concern here is whether attackers can use the model as an entry point to bypass guardrails, leak data, manipulate decisions, or trigger unintended behavior across integrated workflows,\u201d says Taegh Sokhey, staff project manager for AI security at HackerOne.<\/p>\n<p>Experts say there are several main reasons AI systems tend to generate a higher percentage of high-risk vulnerabilities:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>AI systems introduce newer attack surfaces many organizations are still learning to defend.<\/strong> <a href=\"https:\/\/www.csoonline.com\/article\/4110008\/top-cyber-threats-to-your-ai-systems-and-infrastructure.html\">These risk vectors<\/a> include prompt injection, insecure plug-ins, data leakage, model supply-chain risk, unsafe agent behavior, excessive permissions, and over-trusted integrations with internal systems.<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li><strong>The blast radius for AI system flaws can be much larger when something goes wrong<\/strong>. Many LLM deployments are connected to internal knowledge bases, workflows, code repositories, customer data, or privileged tools. That means a single weakness can expose multiple systems.<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li><strong>AI system vulnerability remediation ownership is often fragmented<\/strong>. \u201cAI initiatives typically span engineering, security, legal, procurement, and business teams,\u201d according to Zero Networks\u2019 Lakunishok. \u201cThat slows fixes and helps explain why remediation rates are lower than for traditional applications.\u201d<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"no-remediation-playbook\">No remediation playbook<\/h2>\n<p>Adrian Furtuna, founder and CEO at Pentest-Tools.com, underscores that Cobalt\u2019s finding of low remediation rates for LLMs and AIs is more telling than the high-risk rate.<\/p>\n<p>\u201cA 38% fix rate for high-risk LLM findings is low even by the standards of application security, where remediation has always lagged discovery,\u201d Furtuna says. \u201cWhat that gap reflects is that development teams don\u2019t yet have established patterns for fixing LLM vulnerabilities the way they do for, say, SQL injection or XXE [XML External Entity injection].\u201d<\/p>\n<p>When a developer sees a traditional system injection issue, they know the remediation playbook, but there is no established procedure for resolving flaws in AI-based systems.<\/p>\n<p>\u201cWhen they see a prompt injection chain or an insecure tool call boundary, they often don\u2019t [have a playbook], and that uncertainty stalls action even when the severity rating is clear,\u201d Furtuna notes.<\/p>\n<p>Architecture and maturity factors also play a role in AI systems throwing up a greater percentage of high-risk vulnerabilities. Moreover, LLM integrations concentrate trust in ways that traditional application components avoid. As a result, the attack surface broadens, and trust boundaries are often implicit rather than explicitly enforced, magnifying the impact of any flaws, Furtuna says.<\/p>\n<p>\u201cA model that has access to internal tools, retrieval pipelines, and external APIs represents a large-radius blast zone if its input handling is weak,\u201d he adds. \u201cPrompt injection in that context isn\u2019t a nuisance \u2014 it\u2019s a path to data exfiltration, privilege escalation, or supply chain manipulation, depending on what the model can reach.\u201d<\/p>\n<p>Secure development practices for LLM integrations are still forming, an immaturity or knowledge gap that shows up directly in pen test findings.<\/p>\n<p>\u201cThe <a href=\"https:\/\/owasp.org\/www-project-top-10-for-large-language-model-applications\/\">OWASP LLM Top 10<\/a> is relatively recent,\u201d Furtuna explains. \u201cMost developers building on top of foundation models are doing so without the equivalent of decades of institutional knowledge about input validation, output handling, and authorization boundary design that exists for web applications.\u201d<\/p>\n<p>LLMs collapse trust boundaries \u2014 lacking the predictable input\/output flows of regular legacy apps \u2014 a problem compounded by the wide-ranging permissions routinely granted to AI systems.<\/p>\n<p>\u201cMost organizations try to secure agents and LLM systems at the identity layer, give the model a role and hope guardrails hold,\u201d says Sumo Logic\u2019s Girvin. \u201cBut if an attacker can steer the model \u2014 prompt injection, social engineering, etc. \u2014 they inherit its permissions. That\u2019s why the impact spikes.\u201d<\/p>\n<p>HackerOne\u2019s Sokhey adds: \u201cAI applications are producing a disproportionate number of high-risk issues because they create an entirely new layer of attack surface, one that is non-deterministic, rapidly changing, and often connected to sensitive data, internal systems, and autonomous actions.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"countermeasures\">Countermeasures<\/h2>\n<p>Experts advise CISOs to stop <a href=\"https:\/\/www.csoonline.com\/article\/3529615\/companies-skip-security-hardening-in-rush-to-adopt-ai.html\">skipping security hardening in a rush to implement AI<\/a> and instead treat AI systems as production systems rather than experiments.<\/p>\n<p>\u201cThat means threat modeling before deployment, red teaming and adversarial testing throughout the lifecycle, least-privilege access for models and agents, strong identity controls, segmentation around sensitive data, continuous monitoring, and rapid containment mechanisms when abnormal behaviour is detected,\u201d says Zero Networks\u2019 Lakunishok.<\/p>\n<p>Pentest-Tools.com\u2019s Furtuna argues that established best practices can be applied to the new architecture of LLMs provided they are deliberately designed into the systems from the get-go rather than bolted on as an afterthought.<\/p>\n<p>\u201cStrict tool call schemas, explicit output validation before downstream actions execute, human approval gates on high-consequence operations, and minimal privilege for model-accessible integrations all limit what a successfully exploited prompt injection can actually reach,\u201d Furtuna says.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Penetration tests of AI-based systems are revealing a greater percentage of high-risk flaws than those discovered in legacy systems. Security consultancy Cobalt\u2019s annual State of Pentesting Report reveals that 32% of all AI and large language model (LLM) findings are rated as high risk \u2014 nearly 2.5 times the rate (13%) of severe flaws found in enterprise security tests more generally. LLM vulnerabilities also have&#8230; <\/p>\n<p class=\"more\"><a class=\"more-link\" href=\"https:\/\/newestek.com\/?p=16194\">Read More<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-16194","post","type-post","status-publish","format-standard","hentry","category-uncategorized","is-cat-link-borders-light is-cat-link-rounded"],"_links":{"self":[{"href":"https:\/\/newestek.com\/index.php?rest_route=\/wp\/v2\/posts\/16194","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newestek.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newestek.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newestek.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/newestek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=16194"}],"version-history":[{"count":0,"href":"https:\/\/newestek.com\/index.php?rest_route=\/wp\/v2\/posts\/16194\/revisions"}],"wp:attachment":[{"href":"https:\/\/newestek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=16194"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newestek.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=16194"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newestek.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=16194"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}