Datadoubleconfirm

Simple datasets and notebooks for data visualization, statistical analysis and modelling - with write-ups here: https://projectosyo.wixsite.com/datadoubleconfirm.

Generate Convert Improve

Install / Use

/learn @hxchua/Datadoubleconfirm

About this skill

Quality Score

0/100

README

Data is nutrients for the soul

Provides simple datasets for data visualization, statistical analysis and modelling
Suitable for those starting out in data science and of course all who find the datasets useful
Data visualizations can be found here
Tutorials/ exercises can be found here

List of datasets along with descriptions

Dataset: akcdogs.csv
Description: Cleaned data on dog breeds scraped from akc.org (as at 17 Jan 2018)
Variables: Breed , Trait1, Trait2, Trait3, Energy level, Size, Rank, Good with Children, Good with other Dogs, Shedding, Grooming, Trainability, Height, Weight, Life expectancy, Barking level, Group
Mode of data collection: Webscraping
Source: American Kennel Club

Dataset: arrivals2018.csv
Description: Top 20 cities based on 2017 arrivals and 2018 estimates
Variables: rank , city, country, arrivals_2017 (actual arrival count for 2017), arrivals_2018 (estimated arrival count for 2018)
Mode of data collection: Webscraping
Source: Most visited: World's top cities for tourism

Dataset: bookdepo.csv
Description: Raw data on bestsellers scraped from bookdepository.com (as at 11 Jan 2018)
Variables: (blank) (row index number) , name (book title), material (book material), author (author), rank (bestsellers rank), maincat (main category), subcat (sub category), rating (rating by readers), ratingcount (number of readers who gave ratings), saleprice (discounted price in S$), listprice (original price in S$), numofpages (number of pages), datepub (date published), isbn13 (ISBN13 number)
Mode of data collection: Webscraping
Source: Book Depository

Dataset: bookdepobest.csv
Description: Cleaned data on bestsellers scraped from bookdepository.com (as at 11 Jan 2018)
Variables: SN, name, rank, maincat, subcat, rating, saleprice, listprice, datepub, isbn13, GoodreadsRateCount, BookMaterial, Author(s), PageCount
Mode of data collection: Webscraping
Source: Book Depository

Dataset: Class1.csv
Description: Hypothetical dataset consisting score results of 100 students for three tests
Variables: id, gender, test1, test2, test3
Mode of data collection: N.A.
Source: N.A.

Dataset: Class2.csv
Description: Hypothetical dataset consisting score results of 100 students for four tests
Variables: id, gender, test1, test2, test3, test4
Mode of data collection: N.A.
Source: N.A.

Dataset: covid19_sg.csv
Description: Covid-19 case time series data in Singapore (more details here)// Data is contributed to https://github.com/neherlab/covid19_scenarios
Variables: Date, Daily_Confirmed_, False_Positives_Found, Cumulative_Confirmed, Daily_Discharged, Passed_but_not_due_to_COVID, Cumulative_Discharged, Discharged_to_Isolation, Still_Hospitalised, Daily_Deaths, Cumulative_Deaths, Tested_positive_demise, Daily_Imported, Daily_Local_transmission, Local_cases_residing_in_dorms_MOH_report, Local_cases_not_residing_in_doms_MOH_report, Intensive_Care_Unit_(ICU), General_Wards_MOH_report, In_Isolation_MOH_report, Total_Completed_Isolation_MOH_report, Total_Hospital_Discharged_MOH_report, Linked_community_cases, Unlinked_community_cases, Phase, Cumulative_Vaccine_Doses, Cumulative_Individuals_Vaccinated, Cumulative_Individuals_Vaccination_Completed
Mode of data collection: Manual/ Webscraping
Source: Ministry of Health

Dataset: datagovscraped_stacked.csv
Description: List of datasets available on data.gov.sg as at 2 Sep 2022
Variables: title, link, org, description, last_updated, created, format, coverage, licence
Mode of data collection: Webscraping
Source: Data.gov.sg

Dataset: DisneySongs25.csv
Description: Top 25 Disney Songs (as at 4 Apr 2016)
Variables: Rank , Song Title, Movie, Year, Lyrics
Mode of data collection: Manual
Source: Top 25 Disney Songs - IGN, Metrolyrics, Disneyclips.com

Dataset: emojis.csv
Description: names and descriptions of emojis
Variables: id (number to identify unique emoji in dataset), index (running number in dataset), name (name of emoji), desc (description/ alternative names)
Mode of data collection: Webscraping
Source: Emoji Cheat Sheet

Dataset: FreqWordsObama.csv
Description: Frequently mentioned words in Barack Obama's tweets between 2007 and 2017 (as at 12 Dec 2017)
Variables: Year (year of tweet), Word (frequently mentioned word), Count (number of tweets containing word), Year Volume (volume of tweet in the year), Percentage (percentage of tweets containing word)
Mode of data collection: Twitter webscraping and text mining
Source: Barack Obama's Twitter account

Dataset: GovSG.csv
Description: Addresses with GIS location and contact information of Ministries and Statutory Boards in Singapore
Variables: Organisation, Type (Ministry/ Statutory Board), Zipcode, Latitude, Longitude, Website, Tel, Fax, Email, Enquiry/ Feedback Form (url), Parent Ministry (Statutory Boards under respective Ministries)
Mode of data collection: Webscraping, Manual, Tableau-generated latitude/longitude based on Zipcode
Source: Singapore Government Directory, The Public Service | Careers@Gov

Dataset: gov-sg-terms-translations.tsv
Descriptions: Official English-Mandarin translations of Singapore Government Terms.
Variables: english, mandarin
Mode of data collection: Webscraping
Source: Government Terms Translated

Dataset: hawker_stacked.csv
Descriptions: Nutritional contents of Singapore hawker foods
Variables: kcal, protein(g), fat(g), saturatedfat(g), dietaryfibre(g), carbs(g), cholesterol(mg), sodium(mg), food, comments, image_link, type
Mode of data collection: Webscraping
Source: Best and Worst Singapore Hawker Chinese Food: Dim Sum, Char Kway Teow and More,Best and Worst Singapore Hawker Malay Breakfast Foods: Nasi Lemak, Mee Siam, Soto and More, Best and Worst Singapore Hawker Indian Breads: Prata, Mutton Murtabak and More

Dataset: mrtfaretime.csv
Description: Travel time and fare information between train (MRT/LRT) stations in Singapore (as at Oct 2018)
Variables: Station_start (Boarding station), Station_end (Alighting station), Time (Travel time in mins), Adult (Adult fare), Senior (Fare for Seniors and Persons with Disabilities), Standard (Fare for Standard Ticket), Student (Student fare), WTCS (Fare under Workfare Transport Concession Scheme), REF_STNSTART, Latitude_Start, Longitude_Start, REF_STNEND, Latitude_End, Longitude_End
Mode of data collection: Webscraping
Source: TransitLink Electronic Guide

Dataset: mrtsg.csv
Description: Latitude and longitude of train (MRT/LRT) stations in Singapore (as at Dec 2020)
Variables: OBJECTID (id) , STN_NAME (station name), STN_NO (station number), X (X coord in SVY21 format), Y (Y coord in SVY21 format), Latitude, Longitude, COLOR (color of train line)
Mode of data collection: Public dataset, Coordinate conversion web scraping
Source: LTA DataMall, OneMap Singapore

Dataset: mydramalistactors_tocsv.csv
Description: Top rated actors from MyDramaList (as at Dec 2022)
Variables: person, ranking, likes_num, nationality, gender, dob, title, title_info, title_ratings
Mode of data collection: Webscraping (slightly over 2000 records scraped)
Source: MyDramaList

Dataset: mydramalisttop.csv
Description: Top rated movies from MyDramaList (as at Dec 2022)
Variables: title, ranking, movie_url, movie_country, ratings, num_watchers, synopsis, duration, tags, genres, cast_list
Mode of data collection: Web scraping (all 5000 records scraped) (Note: Data requires cleaning)
Source: MyDramaList

Dataset: passport.csv
Description: Top 10 Passports (in the 2017 Glob

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

110.7k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

351.4k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

hxchua

View profile

View on GitHub

GitHub Stars65

CategoryDevelopment

Updated27d ago

Forks39

hxchua/datadoubleconfirm

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 12, 2026

No findings