SkillAgentSearch skills...

Dataspot

Find data concentration patterns and hotspots. Built for fraud detection and risk analysis.

Install / Use

/learn @frauddi/Dataspot

README

Dataspot 🔥

Find data concentration patterns and dataspots in your datasets

PyPI version License: MIT Maintained by Frauddi Python 3.9+

Dataspot automatically discovers where your data concentrates, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

✨ Why Dataspot?

  • 🎯 Purpose-built for finding data concentrations, not just clustering
  • 🔍 Fraud detection ready - spot suspicious behavior patterns
  • Simple API - get insights in 3 lines of code
  • 📊 Hierarchical analysis - understand data at multiple levels
  • 🔧 Flexible filtering - customize analysis with powerful options
  • 📈 Field-tested - validated in real fraud detection systems

🚀 Quick Start

pip install dataspot
from dataspot import Dataspot
from dataspot.models.finder import FindInput, FindOptions

# Sample transaction data
data = [
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
    {"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},
    {"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},
    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},
]

# Find concentration patterns
dataspot = Dataspot()
result = dataspot.find(
    FindInput(data=data, fields=["country", "device", "user_type"]),
    FindOptions(min_percentage=10.0, limit=5)
)

# Results show where data concentrates
for pattern in result.patterns:
    print(f"{pattern.path} → {pattern.percentage}% ({pattern.count} records)")

# Output:
# country=US > device=mobile > user_type=premium → 75.0% (3 records)
# country=US > device=mobile → 75.0% (3 records)
# device=mobile → 75.0% (3 records)

🎯 Real-World Use Cases

🚨 Fraud Detection

from dataspot.models.finder import FindInput, FindOptions

# Find suspicious transaction patterns
result = dataspot.find(
    FindInput(
        data=transactions,
        fields=["country", "payment_method", "time_of_day"]
    ),
    FindOptions(min_percentage=15.0, contains="crypto")
)

# Spot unusual concentrations that might indicate fraud
for pattern in result.patterns:
    if pattern.percentage > 30:
        print(f"⚠️ High concentration: {pattern.path}")

📊 Business Intelligence

from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions

# Discover customer behavior patterns
insights = dataspot.analyze(
    AnalyzeInput(
        data=customer_data,
        fields=["region", "device", "product_category", "tier"]
    ),
    AnalyzeOptions(min_percentage=10.0)
)

print(f"📈 Found {len(insights.patterns)} concentration patterns")
print(f"🎯 Top opportunity: {insights.patterns[0].path}")

🔍 Temporal Analysis

from dataspot.models.compare import CompareInput, CompareOptions

# Compare patterns between time periods
comparison = dataspot.compare(
    CompareInput(
        current_data=this_month_data,
        baseline_data=last_month_data,
        fields=["country", "payment_method"]
    ),
    CompareOptions(
        change_threshold=0.20,
        statistical_significance=True
    )
)

print(f"📊 Changes detected: {len(comparison.changes)}")
print(f"🆕 New patterns: {len(comparison.new_patterns)}")

🌳 Hierarchical Visualization

from dataspot.models.tree import TreeInput, TreeOptions

# Build hierarchical tree for data exploration
tree = dataspot.tree(
    TreeInput(
        data=sales_data,
        fields=["region", "product_category", "sales_channel"]
    ),
    TreeOptions(min_value=10, max_depth=3, sort_by="value")
)

print(f"🌳 Total records: {tree.value}")
print(f"📊 Main branches: {len(tree.children)}")

# Navigate the hierarchy
for region in tree.children:
    print(f"  📍 {region.name}: {region.value} records")
    for product in region.children:
        print(f"    📦 {product.name}: {product.value} records")

🤖 Auto Discovery

from dataspot.models.discovery import DiscoverInput, DiscoverOptions

# Automatically discover important patterns
discovery = dataspot.discover(
    DiscoverInput(data=transaction_data),
    DiscoverOptions(max_fields=3, min_percentage=15.0)
)

print(f"🎯 Top patterns discovered: {len(discovery.top_patterns)}")
for field_ranking in discovery.field_ranking[:3]:
    print(f"📈 {field_ranking.field}: {field_ranking.score:.2f}")

🛠️ Core Methods

| Method | Purpose | Input Model | Options Model | Output Model | |--------|---------|-------------|---------------|--------------| | find() | Find concentration patterns | FindInput | FindOptions | FindOutput | | analyze() | Statistical analysis | AnalyzeInput | AnalyzeOptions | AnalyzeOutput | | compare() | Temporal comparison | CompareInput | CompareOptions | CompareOutput | | discover() | Auto pattern discovery | DiscoverInput | DiscoverOptions | DiscoverOutput | | tree() | Hierarchical visualization | TreeInput | TreeOptions | TreeOutput |

Advanced Filtering Options

# Complex analysis with multiple criteria
result = dataspot.find(
    FindInput(
        data=data,
        fields=["country", "device", "payment"],
        query={"country": ["US", "EU"]}  # Pre-filter data
    ),
    FindOptions(
        min_percentage=10.0,      # Only patterns with >10% concentration
        max_depth=3,             # Limit hierarchy depth
        contains="mobile",       # Must contain "mobile" in pattern
        min_count=50,           # At least 50 records
        sort_by="percentage",   # Sort by concentration strength
        limit=20                # Top 20 patterns
    )
)

⚡ Performance

Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.

🚀 Real-World Performance

| Dataset Size | Processing Time | Memory Usage | Patterns Found | |--------------|-----------------|---------------|----------------| | 1,000 records | ~5ms | ~1.4MB | 12 patterns | | 10,000 records | ~43ms | ~2.8MB | 12 patterns | | 100,000 records | ~375ms | ~2.9MB | 20 patterns | | 1,000,000 records | ~3.7s | ~3.0MB | 20 patterns |

Benchmark Methodology: Performance measured using validated testing with 5 iterations per dataset size on MacBook Pro (M-series). Test data specifications:

  • JSON Size: ~164 bytes per JSON record (~0.16 KB each)
  • JSON Structure: 8 keys per JSON record (country, device, payment_method, amount, user_type, channel, status, id)
  • Analysis Scope: 4 fields analyzed simultaneously (country, device, payment_method, user_type)
  • Configuration: min_percentage=5.0, limit=50 patterns
  • Results: Consistently finds 12 concentration patterns across all dataset sizes
  • Variance: Minimal timing variance (±1-6ms), demonstrating algorithmic stability
  • Memory Efficiency: Near-constant memory usage regardless of dataset size

💡 Performance Tips

# Optimize for speed
result = dataspot.find(
    FindInput(data=large_dataset, fields=fields),
    FindOptions(
        min_percentage=10.0,    # Skip low-concentration patterns
        max_depth=3,           # Limit hierarchy depth
        limit=100             # Cap results
    )
)

# Memory efficient processing
from dataspot.models.tree import TreeInput, TreeOptions

tree = dataspot.tree(
    TreeInput(data=data, fields=["country", "device"]),
    TreeOptions(min_value=10, top=5)  # Simplified tree
)

📈 What Makes Dataspot Different?

| Traditional Clustering | Dataspot Analysis | |---------------------------|---------------------| | Groups similar data points | Finds concentration patterns | | Equal-sized clusters | Identifies where data accumulates | | Distance-based | Percentage and count based | | Hard to interpret | Business-friendly hierarchy | | Generic approach | Built for real-world analysis |

🎬 Dataspot in Action

View the algorithm Dataspot in action - Finding data concentration patterns

See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.

📊 API Structure

Input Models

  • FindInput - Data and fields for pattern finding
  • AnalyzeInput - Statistical analysis configuration
  • CompareInput - Current vs baseline data comparison
  • DiscoverInput - Automatic pattern discovery
  • TreeInput - Hierarchical tree visualization

Options Models

  • FindOptions - Filtering and sorting for patterns
  • AnalyzeOptions - Statistical analysis parameters
  • CompareOptions - Change detection thresholds
  • DiscoverOptions - Auto-discovery constraints
  • TreeOptions - Tree structure customization

Output Models

  • FindOutput - Pattern discovery results with statistics
  • AnalyzeOutput - Enhanced analysis with insights and confidence scores
  • CompareOutput - Change detection results with significance tests
  • DiscoverOutput - Auto-discovery findings with field rankings
  • TreeOutput - Hierarchical tree structure with navigation

🔧 Installation & Requirements

# Install from PyPI
pip install dataspot

# Development installation
git clone https://github.com/frauddi/dataspot.git
cd dataspot
pip install -e ".[dev]"

Requirements:

  • Python 3.9+
  • No heavy dependencies (just standard library + optional speedups)

🛠️ Development Comma

Related Skills

View on GitHub
GitHub Stars6
CategoryData
Updated2mo ago
Forks1

Languages

Python

Security Score

90/100

Audited on Jan 26, 2026

No findings