SkillAgentSearch skills...

Cosdata

Cosdata: A cutting-edge AI data platform for next-gen search pipelines. Features semantic search, hybrid capabilities, real-time scalability, and ML integration. Designed for immutability and version control to enhance AI projects.

Install / Use

/learn @cosdata/Cosdata
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

<p align="center"> <img src="org/logo.svg" alt="Cosdata" style="max-width: 100%; height: auto;"> </p> <p align="center"> <a href="https://cosdata.io"> <img src="https://img.shields.io/badge/www-cosdata.io-pink"> </a> <a href="https://github.com/cosdata/cosdata/actions"> <img src="https://img.shields.io/github/actions/workflow/status/cosdata/cosdata/ci.yml?label=build&color=green"> </a> <img src="https://img.shields.io/badge/language-Rust-yellow"> <img src="https://img.shields.io/badge/language-Python-black"> <br> <a href="https://discord.gg/QFsrBfFVVY"> <img src="https://img.shields.io/badge/Discord-Join%20Us-7289da?logo=discord&logoColor=white"> </a> <a href="https://www.linkedin.com/company/cosdata/"> <img src="https://img.shields.io/badge/our_journey-LinkedIn-blue"> </a> <a href="https://github.com/cosdata/cosdata/blob/master/LICENSE"> <img src="https://img.shields.io/badge/license-Apache--2.0-blue"> </a> <a href="https://github.com/cosdata/cosdata/pulls"> <img src="https://img.shields.io/github/issues-pr/cosdata/cosdata?color=pink"> </a> </p> </br> <p></p>

📦 Table of Contents

<br> <br>

🚀 Overview

Cosdata is a next-generation retrieval infrastructure engineered for AI-native applications that demand relevance beyond simple vector similarity.

The Challenge

Traditional vector databases optimize for cosine similarity rather than what users actually find useful. Decades of search evolution prove that effective retrieval requires sophisticated ranking systems that understand context, incorporate multiple signals, and optimize for user satisfaction—not just mathematical proximity.

Our Solution

Built with immutability and version control at its core, Cosdata delivers a relevance-first architecture combining:

  • Multi-Modal Retrieval: Seamlessly integrate BM25 full-text search, HNSW dense vectors, SPLADE learned sparse embeddings, and metadata-rich sparse vectors in a unified platform
  • Context-Aware Capabilities: Leverage geofencing, hierarchical document organization, and explainable ranking that understands user intent and real-world complexity
  • Enterprise-Grade Architecture: Benefit from colocated storage, streaming ingestion, transactional versioning, and comprehensive security features

Proven Impact

Organizations using Cosdata achieve 60-120% reduction in compute requirements while improving retrieval quality by 20-50% (NDCG@10). Our unified architecture eliminates external document stores and complex multi-database queries, reducing infrastructure costs and latency.

<br>

💡 Why Cosdata?

The Cosine Similarity Problem

Most vector databases treat retrieval as a pure similarity problem—if two embeddings are mathematically close in vector space, they must be relevant to each other. This assumption is fundamentally flawed.

High cosine similarity ≠ High relevance to users.

Cosine similarity measures the angle between embedding vectors—a mathematical distance determined by how a model was trained. But this metric has no inherent connection to what users actually find useful or relevant. Two documents can be mathematically similar while being practically useless for a user's information need, or vice versa.

The Relevance-First Approach

True relevance requires understanding context, not just proximity.

Decades of search engine evolution—from Google's PageRank to modern recommendation systems—prove that effective retrieval demands:

  • Multiple signals: Lexical matching, semantic understanding, metadata, recency, authority, and user context
  • Ground truth from users: Real relevance comes from actual user behavior and expert judgments, not embedding distances
  • Explainable ranking: Systems must show why results matter, not just that they're "similar"
  • Business logic integration: Geographic constraints, temporal filters, hierarchical relationships, and domain-specific rules

How Cosdata Delivers Relevance

Cosdata is built from the ground up to optimize for user satisfaction, not mathematical convenience:

  1. Hybrid Multi-Modal Search: Combines BM25 lexical matching, dense vectors (HNSW), SPLADE learned sparse embeddings, and metadata-rich representations—letting each signal contribute what it does best

  2. Context-Aware Ranking: Native support for geofencing, hierarchical document structures, temporal filtering, and custom business logic without requiring everything to be embedded

  3. Explainable Results: Every result comes with transparent scoring showing semantic similarity contributions, metadata matches, geographic relevance, and hierarchical context

  4. Proven Quality Metrics: We measure success using NDCG (Normalized Discounted Cumulative Gain) and recall against human-judged relevance datasets like BEIR—not just precision against our own similarity rankings

Real-World Impact

Organizations using Cosdata see:

  • 20-50% improvement in retrieval quality (NDCG@10) compared to pure vector similarity approaches
  • 60-120% reduction in compute requirements through efficient multi-modal indexing
  • Sub-100ms response times while maintaining relevance quality
  • Simplified architecture with colocated storage eliminating external document stores

Bottom line: Cosdata treats retrieval as a relevance problem, not a storage problem. We've learned from decades of search evolution to build infrastructure that understands what users actually need.

<br>

📊 Benchmarks

Cosdata delivers exceptional performance across all retrieval modalities. Our benchmarks use industry-standard datasets and compare against leading solutions to demonstrate real-world performance gains.

🔍 Full-Text Search (BM25)

Our custom BM25 implementation outperforms Elasticsearch with dramatically higher throughput and lower latency while maintaining comparable ranking quality.

Performance Highlights

  • Up to 151× higher QPS than Elasticsearch (SciFact dataset)
  • Average 44× QPS improvement across multiple IR benchmark datasets
  • Up to 12× faster indexing on large-scale datasets
  • Lower latency at both p50 and p95 percentiles across all tested datasets

Detailed Comparison: Cosdata vs. Elasticsearch

| Dataset | Corpus Size | System | Indexing (sec) | QPS | NDCG@10 | p50 (ms) | p95 (ms) | |---------|-------------|--------|----------------|-----|---------|----------|----------| | arguana | 8.7K | Cosdata | 0.1 | 2,167 | 0.40 | 9 | 15 | | | | Elasticsearch | 1.4 | 263 | 0.48 | 44 | 74 | | climate-fever | 5.4M | Cosdata | 40.6 | 135 | 0.13 | 106 | 379 | | | | Elasticsearch | 522.8 | 84 | 0.14 | 162 | 263 | | fever | 5.4M | Cosdata | 40.3 | 314 | 0.47 | 52 | 157 | | | | Elasticsearch | 525.7 | 154 | 0.52 | 80 | 138 | | fiqa | 57K | Cosdata | 0.5 | 4,942 | 0.25 | 7 | 12 | | | | Elasticsearch | 6.7 | 251 | 0.25 | 39 | 60 | | msmarco | 8.8M | Cosdata | 57.7 | 315 | 0.23 | 46 | 162 | | | | Elasticsearch | 714.7 | 166 | 0.23 | 73 | 129 | | nq | 2.6M | Cosdata | 19.3 | 483 | 0.29 | 30 | 81 | | | | Elasticsearch | 243.2 | 197 | 0.29 | 59 | 100 | | quora | 522K | Cosdata | 2.7 | 1,425 | 0.81 | 11 | 36 | | | | Elasticsearch | 30.2 | 323 | 0.81 | 39 | 55 | | scidocs | 25K | Cosdata | 0.3 | 13,338 | 0.16 | 7 | 12 | | | | Elasticsearch | 3.6 | 319 | 0.15 | 33 | 48 | | scifact | 5.2K | Cosdata | 0.1 | 40,909 | 0.69 | 7 | 13 | | | | Elasticsearch | 1.0 | 271 | 0.68 | 34 | 51 | | trec-covid | 171K | Cosdata | 1.7 | 2,219 | 0.61 | 10 | 18 | | | | Elasticsearch | 22.1 | 110 | 0.62 | 57 | 88 | | webis-touche2020 | 382K | Cosdata | 5.5 | 2,789 | 0.34 | 10 | 18 | | | | Elasticsearch | 63.1 | 108 | 0.34 | 62 | 99 |

Key Takeaway: Cosdata maintains comparable or better ranking quality (NDCG@10) while delivering dramatically higher throughput and lower latency.


🎯 Dense Vector Search (HNSW)

Our HNSW implementation achieves industry-leading performance on large-scale vector datasets with high-dimensional embeddings.

Performance Highlights

  • 1,758 QPS on 1 million records (1536 dimensions)
  • ~42% faster than Qdrant
  • ~54% faster than Weaviate
  • ~146% faster than E

Related Skills

View on GitHub
GitHub Stars347
CategoryDesign
Updated1d ago
Forks40

Languages

Rust

Security Score

95/100

Audited on Apr 2, 2026

No findings