Stats337
Readings in applied data science
Install / Use
/learn @hadley/Stats337README
Stats 337: Readings in Applied Data Science
Stats 337 is a small discussion class available to Stanford students in Spring 2018. Student in this class will read 3-4 papers (or equivalent) per week, write a brief response, and then discuss the papers (and related ideas) in class.
Readings
These readings reflect my personal thoughts about applied data science, and are skewed towards topics that I think are important but are generally under appreciated. It is not a systematic attempt to survey the field. That said, if you think there's something major that I've missed, please feel free to submit an issue (or pull request!). These readings will evolve as the quarter goes by.
Many of the readings come from Practical Data Science for Stats, a join PeerJ collection and special issue of the American Statistician. Jenny Bryan and I pulled this collection together in order to publish some of the important parts of data science that were previously unpublished. Other readings are blog posts because so much of applied data science is outside the comfort zone of traditional academic fields.
The development of much of this course has been driven by conversations on twitter. A big thanks go to everyone who has helped me out! Key threads: classroom discussion, ethics, google sheets, citation management.
What the *&!% is data science? (Apr 2)
-
Data scientists mostly just do arithmetic and that’s a good thing; Noah Lorang (2016).
-
Optional: Enterprise Data Analysis and Visualization: An Interview Study; Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer (2012).
-
Optional: 50 years of data science (OA preprint); David Donoho (2017). This is discussion paper and a number of notable statisticians have contributed commentary. Make sure to read some of these as well.
Data collection and collaboration (Apr 9)
-
Tidy data; Hadley Wickham (2013).
-
Data organization in spreadsheets; Karl W Broman, Kara Woo (2017).
-
Best practices for using google sheets in your data project; Matthew Lincoln (2018).
-
Bonus: Modeling as a core component of structuring data; Clifford Konold, William Finzer, Kozoom Kreetong (2017)
Spend 3-5 minutes filling out class feedback.
Software engineering (Apr 16)
-
Software development skills for data scientists; Trey Causey (2015).
-
Excuse me, do you have a moment to talk about version control?; Jennifer Bryan (2017).
-
Good enough practices in scientific computing; Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal (2017).
DevOps (Apr 23)
-
Opinionated analysis development; Hillary Parker (2017)
-
An introduction to Docker for reproducible research, with examples from the R environment; Carl Boettiger (2014).
-
Hidden Technical Debt in Machine Learning Systems; D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison (2015).
Teaching (Apr 30)
-
The Introductory Statistics Course: A Ptolemaic Curriculum?. George W Cobb (2007).
-
The democratization of data science education; Sean Kross, Roger D Peng, Brian S Caffo, Ira Gooding, Jeffrey T Leek (2017).
-
Teaching stats for data science; Danny Kaplan (2017).
-
Ten quick tips for teaching programming; Neil C. C. Brown, Greg Wilson (2018).
Reproducibility (May 7)
-
Best practices for computational science; Victoria Stodden, Sheila Miguez (2014).
-
How rOpenSci uses Code review to promote reproducible science; Noam Ross, Scott Chamberlain, Karthik Ram, Maëlle Salmon (2017).
-
A practical guide for transparency in psychological science; Olivier Klein, Tom Hardwicke, Frederik Aust, Johannes Breuer, Henrik Danielsson, Alicia Hofelich Mohr, Hans IJzerman, Gustav Nilsonne, Wolf Vanpaemel, Michael Frank (2018).
-
Lessons Learned Reproducing a Deep Reinforcement Learning Paper; Matthew Rahtz (2018).
-
Bonus: The Practice of Reproducible Research; Justin Kitzes, Daniel Turek, Fatma Deniz (2018).
Ethics (May 14)
-
The Ethical Data Scientist; Cathy O'Neil (2016).
-
Big data, machine learning, and the social sciences; Hannah Wallach (2014).
-
A Code of Ethics for Data Science; DJ Patil (2018).
-
An ethical code can’t be about ethics; Schaun Wheeler (2018).
-
Ethical Guidelines for Statistical Practice; Committee on Professional Ethics of the American Statistical Association (2016).
-
Journalism as a Professional Model for Data Science; Brian C. Keegan (2016)
Career (May 21)
-
What it's like to be on the data science job market; Trey Causey (2015)
-
Academic job search advice; Matt Might (????).
-
Importance of sponsorship; Emily Robinson (2018).
-
Imposter syndrome in data science; Caitlin Hudon (2018).
Industry
-
Doing data science at twitter; Robert Chang (2015).
-
Engineers shouldn’t write ETL: A guide to building a high functioning data science Department; Jeff Magnusson (2016).
-
Using R packages and education to scale data science at Airbnb; Ricardo Bion (2016).
-
Data science at Instacart; Jeremy Stanley (2017).
-
.rprofile: Jenny Bryan; Kelly O'Briant (2017)
-
Marketing for data science. Erik Oberg (2018).
Workflow
-
The plain person's guide to plain text social science; Kieran Healy (2016).
-
Open notebook history; Caleb McDaniel (2013).
-
Optional: How to be a modern scientist; Jeff Leek (2016).
Annotated bibliographies
Many students in the spring 2018 elected to share their final annotated bibliographies
-
Communication and visualization by Kenneth Tay
-
Connections to cognitive science by Sara Altman.
-
Data science in modern medicine by Sean R. Zion.
-
Ethics in data science (pdf)
-
Graphical advice by Nick Hershey
-
[Sharing analyses across research groups
Related Skills
feishu-drive
354.3k|
things-mac
354.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
354.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
Metodologias_causa_raiz
Agente de IA Consultor em Balanced Scorecard com arquitetura RAG otimizada (LangGraph + Redis + Cohere)
