Gitdm.archive
📜Fork for tracking CNCF projects
Install / Use
/learn @cncf/Gitdm.archiveREADME
CNCF gitdm
This is the Cloud Native Computing Foundation's fork of Jon Corbet and Greg KH's gitdm tool for calculating contributions based on developers and their companies. Companies and developers can check if they are correctly attributed at the following links:
Company Developers list: [co1], [co2], [co3], [co4], [co5], [co6], [co7], [co8].
Developers affiliations list: [dev1], [dev2], [dev3], [dev4], [dev5].
New affiliations are imported into DevStats about 1-2 times/month.
DevStats
This repository is used as a source of affiliations for all DevStats projects. The final affiliations JSON is periodically imported by the DevStats project.
Adding/Updating affiliation
If you find any errors or missing affiliations in those lists, please submit a pull request with edits to developers affiliations files: [dev1], [dev2], [dev3], [dev4], [dev5], ...
Please note that we need both current and historical email here as we are processing data from GitHub Archives, so old emails are there (even if they are no longer current).
Only the Developers affiliations list [dev1], [dev2], [dev3], [dev4], [dev5], ... should be edited manually.
Company Developers lists [co1], [co2], [co3], [co4], [co5], [co6], [co7], [co8] are computed derivatives of the first list.
Other files used for affiliations are the email map file and github users file.
Please note that cncf/gitdm affiliations are imported into DevStats (cncf/devstats) once per 4 weeks.
Removing affiliations
If you do not want to have your email listed here please read how to remove your email.
Testing changes
You can test any changes locally by cloning this repository and regenerating all data by running ./rerun_data.sh.
Then generate config files by running: ./import_affs.sh.
If those two files are out of sync, the tool will notify you about this.
This tool will generate a new email-map file.
Check if your changes processed properly and move the file to cncf-config/email-map (replace)
Sync workflow
Please follow the instructions from SYNC.md.
Running
Use *.sh scripts to run analytics (all*.sh for full analysis and rels*.sh for per release stats)
This program assumes that gitdm resides in: ~/dev/cncf/gitdm/ and that kubernetes is in ~/dev/go/src/k8s.io/kubernetes/
Output files are placed in the kubernetes directory.
To regenerate all statistics just run: ./rerun_data.sh
This is an iterative process:
Run any of the scripts. Review its output in the kubernetes directory. Iteratively adjust mappings to handle more authors.
You can also run via ./debug.sh to halt in debugger and review the hacker's structure and those who were not found. See cncfdm.py:DebugUnknowns
Final report:
Contributing
Pull requests are welcome.
Our mapping is never complete, please see config files in Config files.
File email-map is a direct email to the employer mapping.
There is also a long list of unknown emails. For that, scroll to the section called Developers with unknown affiliation:
in all.txt
All of those were searched for in various sources but we were not able to find their affiliation.
Detailed Description
Regenerating all data with ./rerun_data.sh means:
- Data for
kubernetes/kubernetesrepository (all time) with 3 mappings of Unknown developers: no mapping (list them with their email & name), map them to their email domain (user@gmail.com-->'Gmail *'), map all of them to '(Unknown)'. This is done via running: (./all.sh,./all_no_map.sh,./all_with_map.sh). Output goes tokubernetes/all_time/directory - Data for
kubernetes/kubernetesrepository divided into releases v1.0.0, v1.1.0, ..., v1.7.0 (with 3 types of mappings described above). This is done via (./rels.sh,./rels_strict.sh,./rels_no_map.sh). Output goes tokubernetes/v1.X.0-v1.Y.0/directory: X=0,1,2,3,4,5,6 Y=1,2,3,4,5,6,7)
After performing those two steps, cncfdm.py output needs to be analysed. It is done by calling: ./analysis_all.sh (analyses all-time results) and then ./analysis_rels.sh (for pre-release data)
Data for all 68 repos (currently) which makes the entire Kubernetes project with ./kubernetes_repos.sh script.
Final files generated by first 2 calls (for single repo kubernetes/kubernetes) are in kubernetes/all_time/*.txt and ./kubernetes/v1.X.0-v1.Y.0/*.txt
All scripts are configured to ignore commits related to files from vendor and Godeps directories.
This is because external sources are placed here and many commits are just adding external libraries. Accounting for them would make the results less accurate
All of them use a git log call with specific args piped to cncfdm.py call with specific parameters.
See ./run.sh for an example. All other calls use the same commands git log and cncfdm.py with other parameters.
To get a list of parameters for cncfdm.py, see comments inside of the cncfdm.py file describing all possible options.
For more details about how cncfdm.py tool works refer to its sources and other *.py files.
Those files are analysed by ./analysis_all.sh and ./analysis_rels.sh.
The first one calls:
ruby analysis.rb all kubernetes/all_time/first_run_patch.txt kubernetes/all_time/run_no_map_patch.txt kubernetes/all_time/run_with_map_patch.txt
The second calls:
ruby analysis.rb v1.0_v1.1 kubernetes/*/output_strict_patch.txt kubernetes/*/output_patch.txt kubernetes/*/output_no_map_patch.txt
This ruby tool expects to get 3 files (one with no unknown developers mapping, 2nd with mapping to a domain name and 3rd with mapping to (Unknown).
The output of this analysis.rb tool goes to project/<prefix>_<key>_<type>.csv files.
<prefix>: can be all or v1.X.0-v1.Y.0 - it means that the file is for all time data or for a specific release of kubernetes/kubernetes
<key>: can be changeset, employers, lines, signoffs - it means that the file contains data sorted by this <key> desc.
<type>: can be sum, top, all:
allmeans that the file contains all data for given <prefix> sorted by <key> desc (header is:idx,company,n,percentwhich means n-th, company name, n developers, % all developers)All knownis the sum of all detected developerstopmeans that there will be top 10 data fromallbut also must contain data for: '(Unknown)', 'Gmail *', 'Qq *', 'Outlook *', 'Yahoo *', 'Hotmail *', '(Independent)', '(Not Found)'. The header is the same as inall.sumcontains a summary value for all found developers. It has a different header:N companies,sum,percentnumber of developer's companies found, the sum of <key> for all found developers, % of the sum <key> as a part of the sum <key> for all developers.- Special names:
All known(sum all known developers),(Independent)(developers working on their own),(Not Found)(developers for whom an employer was not found even though the search was done in multiple sources),(Unknown)(developers not mapped (yet?)),Some name *(sum of developers having emails onSome namedomain).An asterisk*is added to indicate this.
This data is directly used for the "Who writes Kubernetes" report.
./kubernetes_repos.sh script is used to generate all-time data for all the kubernetes repos.
To use it, you must have all of kubernetes repositories (68 from 3 different organizations) cloned in ~/dev/go/src/k8s/.
Orgs are: kubernetes, kubernetes-incubator, kubernetes-client.
It generates statistics for each single repo via:
./anyrepo.sh ~/dev/go/src/k8s.io/<repo-name> <repo-name>
See details in ./kubernetes_repos.sh.
<repo-name> is a directory where a given kubernetes repository is cloned.
To clone a repository, do:
cd ~/dev/go/src/k8s/
git clone https://github.com/<one-of-3-kubernetes-orgs>/<kubernetes-repo-name>.git.
one-of-3-kubernetes-orgs: kubernetes, kubernetes-incubator and kubernetes-client
kubernetes-repo-name: please look up all repo names in all kubernetes orgs on GitHub.
./anyrepo.sh just calls cncfdm.py with appropriate args (like exclude vendor dir numstat etc).
There is also ./anyreporange.sh that allows querying a repo for a specific time range (cncfdm.py supports that as well).
Output of this goes to repos/<repo-name>.<ext>
<repo-name>: repository name ./anyrepo.sh was called with.
<ext>: txt, csv, html, out: txt: main data file, csv: dumps list of employers in given repo, html: the same as txt but in HTML format, out: cncfdm.py verbose output messages (for debugging)
Finally, ./kubernetes_repos.sh calls:
./multirepo.sh with all 68 repository directories listed.
It gathers git log on each of them and concatenates all those files and then run cncfdm.py on the concatenated result (see ./multirepo.sh)
Results are saved to repos/combined.<ext> <ext> is the same as for anyrepo.sh.
The typical workflow is re-runing ./kubernetes_repos.sh and examining repos/combined.txt for unknown developers.
Research on google, Clearbit, FullContact, github, LinkedIn, Facebook, any other source -> update cncf-config/<filename> and re-run ./kubernetes_repos.sh
<filename>: usually in this order
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
