SyllabusAndLectures
Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
Install / Use
/learn @HertieDataScience/SyllabusAndLecturesREADME
<img src="img/HertieCollaborativeDataLogo_v1.png" align="right" height="75" width ="75"/>
MPP-E1180: Introduction to Collaborative Social Science Data Analysis
Autumn 2016
Version: 1 December 2016
Instructor: Christopher Gandrud
-
Office: 1.64
-
Email: <a href="mailto:gandrud@hertie-school.org">gandrud@hertie-school.org</a>
-
Website: <a href="http://christophergandrud.blogspot.com/">http://christophergandrud.blogspot.com/</a>
-
Work: <a href="https://github.com/christophergandrud">https://github.com/christophergandrud</a>
The objective of this course is to learn how to collaboratively and reproducibly gather social data, analyse it, and effectively present results.
The course is intended to be immediately useful for your academic work, as well as work in the public and private sectors. The tools you learn and the final project you complete in this course will be directly useful for your thesis research. As increasing emphasis in academics is being placed on the skills needed to effectively gather, handle, and analyse data as well as present results to a range of audiences in highly reproducible ways, this course will provide you with important tools for future academic study. Governments and international institutions are increasingly adopting the technologies and methods of collaborative open data science. For examples, see initiatives by the World Bank, Germany, New York City, the United Kingdom, and the United States. These and many more resources provide great new opportunities for open evidence-based policymaking. This course is designed to enable you to take full advantage of these opportunities and actively contribute to these initiatives. Finally, the skills we will learn in this course are also widely used in business. R programming skills in particular are highly valued in fields such as finance and information technology. Being able to effectively communicate results from statistical analyses in dynamic, often web-based formats is highly valued by businesses and increasingly in governments and academics.
A large part of the practice of social science data analysis is computer programming. Learning how to approach the analysis of data from a computer science perspective will allow you to take full advantage of state of the art statistical tools and best practice research methods for understanding social phenomena and effectively communicating your findings in multiple mediums.
The course will involve learning the fundamentals of widely used computer languages. The statistical language R will allow us to gather and analyse our data. The Markdown/HTML and LaTeX markup languages will allow us to present our results to a variety of audiences. We will use Git/GitHub to version control and store all of our files. This will enable collaboration and full reproducibility.
The focus of this course is active in class participation and collaboration on realistic projects using the concepts and tools introduced in lectures and scholarly articles. All assignments and projects will be completed in teams. I encourage you to use pair programming and even collaborate across teams.
Alongside learning the details of how to use specific tools of collaborative and reproducible social science data analysis we will emphasise their general properties and how they fit together into a highly collaborative and reproducible research workflow. Languages and technologies come and go, so it is important to understand the fundamental principles underlying them so that you can adapt to new technologies and understand previous researchers' work.
Prerequisites
The course assumes that you have a good basic understanding of descriptive and introductory inferential statistics (e.g. data types, ways of describing distributions, significance testing, linear models, and so on). Knowledge of particular software or computer programming is not assumed.
Patience is a key skill for computer programming. Computer languages are extremely literal. This can lead to 'communication problems' between you and the computer. It does not share your assumptions, so you have to be very explicit. This quality makes using these tools great for recording your research steps so that they are highly reproducible. But it can also be maddening and requires patience to deal with effectively.
Materials
Readings
Gandrud, Christopher. 2015. Reproducible Research with R and RStudio. 2nd Edition. Chapman & Hall/CRC Press, Oxford. (RRRR)
A good reference text to have by your side when doing statistics with R is:
Crawley, Michael J. 2005. Statistics: An Introduction Using R. John Wiley and Sons Ltd., Chichester.
A great free resource for more advanced R programming is is Hadley Wickam's aptly named Advanced R Programming.
If you ever get stuck, a good first place to turn for answers is StackExchange. If you are stuck on a coding problem, chances are someone else has had the same problem before, asked an question on StackExchange, and found answers.
Software and Computers
All of the software used in this course will be open source, i.e. free.
-
Please bring your own laptop to class. What we do in the course requires you to have administrator privileges on your computer. It's preferable that you have a computer with Mac or (similarly) Linux OS. Windows is also fine, there will just be a few extra steps and it may take more time for me to help you resolve bugs.
-
Sign up for a GitHub account and install Git.
-
Install LaTeX. This is a large installation, so dedicate some time to doing it.
-
You need to have a modern web browser installed on your computer. Chrome or Firefox are the best choices for Web Scraping.
Lectures
All lecture materials and their source files will be hosted in the course's GitHub repository.
You are highly encouraged to suggest changes to the lecture material with a pull request (we'll learn about how to do this in the second lecture) if you think of improvements that can be made for clarity, relevance, and to fix typos.
Assessment:
| Name | Percent of Final Mark | Due | | ----------------------- | --------------------- | ---------------- | | Pair Assignment 1 | 10% | 7 October | | Pair Assignment 2 | 10% | 28 October | | Pair Assignment 3 | 10% | 11 November | | Collaborative Research Project | 50% | Presentation: Final Class, Paper/Website: Final Exam Week | | Attendance/active Participation | 20% | - |
-
The first pair assignment is designed to develop your understanding of file structures, version control, and basic R data structures and descriptive statistics. Each pair will create a new public GitHub repository. The repository should be fully documented, including with a descriptive README.md file. The repository will include R source code files that access at least two data sets from the R core data sets and/or fivethirtyeight, perform basic transformations on the data, and illustrate the datas' distributions using a variety relevant of descriptive statistics. At least one file must dynamically link to another in a substantively meaningful way. Finally, another pair must make a pull request and it should be discussed and merged. Deadline 7 October, 10% of final grade.
-
The second pair assignment is a proposal for your Collaborative Research Project. It is an opportunity for you to layout your collaborative research paper question, justify why it is interesting, provide basic literature review (properly cited using BibTeX), and identify data sources/methodologies that you can access to help answer your question. You will also demonstrate your understanding of literate programming technologies. Deadline 28 October, 2,000 words maximum, 10% of final grade.
-
In the third pair assignment you will gather web based data from at least two sources, merge the data sets, conduct basic descriptive and (some) inferential statistics on the data to address a relevant research question and briefly describe the results including with dynamically generated tables and figures. Students are encouraged to access data and perform statistical analyses with an eye to answering questions relevant for their Collaborative Research Project. Deadline 11 November, the write up should be 1,500 words maximum and use literate programming, 10% of final grade.
-
For the Collaborative Research Project you will pose an interesting social science question and attempt
Related Skills
feishu-drive
347.9k|
things-mac
347.9kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
347.9kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
postkit
PostgreSQL-native identity, configuration, metering, and job queues. SQL functions that work with any language or driver
