Building a Comprehensive Threat Actor Dataset from OSINT

Listen to this Post

Featured Image

Introduction

Threat intelligence is a critical component of modern cybersecurity, helping organizations identify, analyze, and mitigate threats from advanced persistent threat (APT) groups and other malicious actors. Open-source intelligence (OSINT) provides a wealth of unstructured and semi-structured data that, when properly curated, can form a robust threat actor dataset. This article explores key resources and methodologies for compiling such datasets, along with practical commands and tools to automate the process.

Learning Objectives

  • Identify key OSINT sources for threat actor data.
  • Learn how to consolidate and cross-reference threat intelligence.
  • Automate data collection and enrichment using scripting and APIs.

You Should Know

1. Extracting Threat Actor Data from Malpedia

Malpedia provides structured data on 821 adversaries, including aliases and technical indicators.

Command (Python – `requests` library):

import requests 
response = requests.get("https://malpedia.caad.fkie.fraunhofer.de/api/get/actors") 
actors = response.json() 
for actor in actors: 
print(actor["name"], actor["description"]) 

Steps:

1. Install Python’s `requests` library (`pip install requests`).

  1. Use the Malpedia API to fetch actor data.
  2. Parse JSON output to extract names and descriptions.

2. Querying MITRE ATT&CK for Threat Group Tactics

MITRE ATT&CK provides structured threat group profiles with Tactics, Techniques, and Procedures (TTPs).

Command (curl):

curl -X GET "https://attack.mitre.org/api/v2/groups/" -H "accept: application/json" | jq '.objects[] | {name, description}' 

Steps:

1. Use `curl` to fetch MITRE’s group data.

2. Pipe output to `jq` for JSON parsing.

3. Extract group names and descriptions for analysis.

3. Automating MISP Galaxy Data Ingestion

MISP Galaxy aggregates threat actor aliases and relationships.

Command (Python – `pymisp`):

from pymisp import PyMISP 
misp = PyMISP("https://your-misp-instance.com", "API_KEY") 
galaxy_clusters = misp.galaxy_clusters() 
for cluster in galaxy_clusters: 
print(cluster["value"], cluster["meta"]["synonyms"]) 

Steps:

1. Install `pymisp` (`pip install pymisp`).

2. Authenticate with your MISP instance.

3. Extract threat actor clusters and aliases.

4. Enriching Data with APTMap

APTMap combines multiple datasets for cross-referencing.

Command (Python – `pandas`):

import pandas as pd 
df = pd.read_csv("https://aptmap.net/data/apt_groups.csv") 
print(df[["name", "aliases", "suspected_origin"]].head()) 

Steps:

1. Use `pandas` to load APTMap’s CSV data.

2. Filter relevant columns (name, aliases, origin).

3. Merge with other datasets for enrichment.

5. Bulk Exporting from Mandiant (Google Threat Intelligence)

Mandiant’s reports provide deep insights into APT groups.

Command (wget for bulk download):

wget --recursive --accept pdf --no-parent https://www.mandiant.com/resources/reports 

Steps:

1. Use `wget` to download Mandiant reports.

  1. Extract threat actor details using PDF parsers like pdftotext.

3. Combine with structured datasets.

6. Automating Threat Intel with SOCRadar API

SOCRadar offers an API for threat actor profiling.

Command (curl with API key):

curl -X GET "https://api.socradar.com/threat/actors" -H "Authorization: Bearer YOUR_API_KEY" 

Steps:

1. Obtain an API key from SOCRadar.

2. Query threat actor endpoints.

3. Store results in a structured format (JSON/CSV).

7. Cloud-Focused Threat Actors from WIZ

WIZ tracks cloud-specific threats.

Command (AWS CLI for cross-checking):

aws guardduty list-threat-intel-sets --region us-east-1 

Steps:

  1. Use AWS GuardDuty to compare with WIZ’s cloud threat list.

2. Identify overlaps in IOCs (Indicators of Compromise).

What Undercode Say

  • Key Takeaway 1: Consolidating OSINT sources reduces manual effort and improves threat visibility.
  • Key Takeaway 2: Automation (APIs, scripting) is essential for scalable threat intelligence.

Analysis:

The fragmentation of threat actor data across vendors necessitates automated aggregation. While no single source provides complete coverage, combining structured (MITRE, MISP) and unstructured (Mandiant reports) data yields a comprehensive dataset. Future improvements may involve AI-driven clustering to resolve aliases and track evolving TTPs.

Prediction

As APT groups increasingly leverage AI for attacks, threat intelligence platforms will adopt machine learning for real-time actor attribution and behavior prediction. Open-source datasets will remain vital but require stricter standardization for interoperability.

For further exploration, check the Awesome Threat Actor Resources repository.

IT/Security Reporter URL:

Reported By: Ysergeev Threatintel – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

Join Our Cyber World:

💬 Whatsapp | 💬 Telegram