Upgini
Data search & enrichment library for Machine Learning → Easily find and add relevant features to your ML & AI pipeline from hundreds of public and premium external data sources, including open & commercial LLMs
Install / Use
/learn @upgini/UpginiREADME
❔ Overview
Upgini is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by generating an optimal set of ML features using large language models (LLMs), GNNs (graph neural networks), and recurrent neural networks (RNNs).
Motivation: for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want to radically simplify feature search and enrichment to make external data a standard approach. Like hyperparameter tuning in machine learning today.
Mission: Democratize access to data sources for data science community.
🚀 Awesome features
⭐️ Automatically find only relevant features that improve your model’s accuracy. Not just correlated with the target variable, which in 9 out of 10 cases yields zero accuracy improvement
⭐️ Automated feature generation from the sources: feature generation with LLM‑based data augmentation, RNNs, and GraphNNs; ensembling across multiple data sources
⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/ZIP code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
⭐️ Calculate accuracy metrics and uplift after enriching an existing ML model with external features
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate the risks of unstable external data dependencies in the ML pipeline
⭐️ Easy to use - a single request to enrich the training dataset with all of the keys at once:
⭐️ Scikit-learn-compatible interface for quick data integration with existing ML pipelines
⭐️ Support for most common supervised ML tasks on tabular data:
⭐️ Simple Drag & Drop Search UI:
<a href="https://upgini.com/upgini-widget">
<img width="710" alt="Drag & Drop Search UI" src="https://github.com/upgini/upgini/assets/95645411/36b6460c-51f3-400e-9f04-445b938bf45e">
</a>
🌎 Connected data sources and coverage
- Public data: public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
- Community‑shared data: royalty- or license-free datasets or features from the data science community (our users). This includes both public and scraped data
- Premium data providers: commercial data sources verified by the Upgini team in real-world use cases
👉 Details on datasets and features
📊 Total: 239 countries and up to 41 years of history
|Data sources|Countries|History (years)|# sources for ensembling|Update frequency|Search keys|API Key required |--|--|--|--|--|--|--| |Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No |Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No |International holidays & events, Workweek calendar| 232 |22|-|Monthly|date, country|No |Consumer Confidence index| 44 |22|-|Monthly|date, country|No |World economic indicators|191 |41|-|Monthly|date, country|No |Markets data|-|17|-|Monthly|date, datetime|No |World mobile & fixed-broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No |World demographic data |90|-|2|Annual|country, postal/ZIP code|No |World house prices |44|-|3|Annual|country, postal/ZIP code|No |Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes |Car ownership data and Parking statistics|3|-|-|Annual|country, postal/ZIP code, email/HEM, phone|Yes |Geolocation profile for phone & IPv4 & email|239|-|6|Monthly|date, email/HEM, phone, IPv4|Yes |🔜 Email/WWW domain profile|-|-|-|-
❓Know other useful data sources for machine learning? Give us a hint and we'll add it for free.
💼 Tutorials
Search of relevant external features & Automated feature generation for Salary prediction task (use as a template)
- The goal is to predict salary for a data science job posting based on information about the employer and job description.
- Following this guide, you'll learn how to search and auto‑generate new relevant features with the Upgini library
- The evaluation metric is Mean Absolute Error (MAE).
Run Feature search & generation notebook inside your browser:
<!-- [