SkillAgentSearch skills...

Gitdm.archive

📜Fork for tracking CNCF projects

Install / Use

/learn @cncf/Gitdm.archive
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CNCF gitdm

This is the Cloud Native Computing Foundation's fork of Jon Corbet and Greg KH's gitdm tool for calculating contributions based on developers and their companies. Companies and developers can check if they are correctly attributed at the following links:

Company Developers list: [co1], [co2], [co3], [co4], [co5], [co6], [co7], [co8].

Developers affiliations list: [dev1], [dev2], [dev3], [dev4], [dev5].

New affiliations are imported into DevStats about 1-2 times/month.

DevStats

This repository is used as a source of affiliations for all DevStats projects. The final affiliations JSON is periodically imported by the DevStats project.

Adding/Updating affiliation

If you find any errors or missing affiliations in those lists, please submit a pull request with edits to developers affiliations files: [dev1], [dev2], [dev3], [dev4], [dev5], ...

Please note that we need both current and historical email here as we are processing data from GitHub Archives, so old emails are there (even if they are no longer current).

Only the Developers affiliations list [dev1], [dev2], [dev3], [dev4], [dev5], ... should be edited manually.

Company Developers lists [co1], [co2], [co3], [co4], [co5], [co6], [co7], [co8] are computed derivatives of the first list.

Other files used for affiliations are the email map file and github users file.

Please note that cncf/gitdm affiliations are imported into DevStats (cncf/devstats) once per 4 weeks.

Removing affiliations

If you do not want to have your email listed here please read how to remove your email.

Testing changes

You can test any changes locally by cloning this repository and regenerating all data by running ./rerun_data.sh.

Then generate config files by running: ./import_affs.sh.

If those two files are out of sync, the tool will notify you about this.

This tool will generate a new email-map file.

Check if your changes processed properly and move the file to cncf-config/email-map (replace)

Sync workflow

Please follow the instructions from SYNC.md.

Running

Use *.sh scripts to run analytics (all*.sh for full analysis and rels*.sh for per release stats)

This program assumes that gitdm resides in: ~/dev/cncf/gitdm/ and that kubernetes is in ~/dev/go/src/k8s.io/kubernetes/

Output files are placed in the kubernetes directory.

To regenerate all statistics just run: ./rerun_data.sh

This is an iterative process: Run any of the scripts. Review its output in the kubernetes directory. Iteratively adjust mappings to handle more authors.

You can also run via ./debug.sh to halt in debugger and review the hacker's structure and those who were not found. See cncfdm.py:DebugUnknowns

Final report:

Data

Report

Contributing

Pull requests are welcome.

Our mapping is never complete, please see config files in Config files.

File email-map is a direct email to the employer mapping.

There is also a long list of unknown emails. For that, scroll to the section called Developers with unknown affiliation: in all.txt

All of those were searched for in various sources but we were not able to find their affiliation.

Detailed Description

Regenerating all data with ./rerun_data.sh means:

  • Data for kubernetes/kubernetes repository (all time) with 3 mappings of Unknown developers: no mapping (list them with their email & name), map them to their email domain (user@gmail.com --> 'Gmail *'), map all of them to '(Unknown)'. This is done via running: (./all.sh, ./all_no_map.sh, ./all_with_map.sh). Output goes to kubernetes/all_time/ directory
  • Data for kubernetes/kubernetes repository divided into releases v1.0.0, v1.1.0, ..., v1.7.0 (with 3 types of mappings described above). This is done via (./rels.sh, ./rels_strict.sh, ./rels_no_map.sh). Output goes to kubernetes/v1.X.0-v1.Y.0/ directory: X=0,1,2,3,4,5,6 Y=1,2,3,4,5,6,7)

After performing those two steps, cncfdm.py output needs to be analysed. It is done by calling: ./analysis_all.sh (analyses all-time results) and then ./analysis_rels.sh (for pre-release data)

Data for all 68 repos (currently) which makes the entire Kubernetes project with ./kubernetes_repos.sh script.

Final files generated by first 2 calls (for single repo kubernetes/kubernetes) are in kubernetes/all_time/*.txt and ./kubernetes/v1.X.0-v1.Y.0/*.txt

All scripts are configured to ignore commits related to files from vendor and Godeps directories. This is because external sources are placed here and many commits are just adding external libraries. Accounting for them would make the results less accurate

All of them use a git log call with specific args piped to cncfdm.py call with specific parameters.

See ./run.sh for an example. All other calls use the same commands git log and cncfdm.py with other parameters.

To get a list of parameters for cncfdm.py, see comments inside of the cncfdm.py file describing all possible options.

For more details about how cncfdm.py tool works refer to its sources and other *.py files.

Those files are analysed by ./analysis_all.sh and ./analysis_rels.sh.

The first one calls: ruby analysis.rb all kubernetes/all_time/first_run_patch.txt kubernetes/all_time/run_no_map_patch.txt kubernetes/all_time/run_with_map_patch.txt

The second calls: ruby analysis.rb v1.0_v1.1 kubernetes/*/output_strict_patch.txt kubernetes/*/output_patch.txt kubernetes/*/output_no_map_patch.txt

This ruby tool expects to get 3 files (one with no unknown developers mapping, 2nd with mapping to a domain name and 3rd with mapping to (Unknown).

The output of this analysis.rb tool goes to project/<prefix>_<key>_<type>.csv files. <prefix>: can be all or v1.X.0-v1.Y.0 - it means that the file is for all time data or for a specific release of kubernetes/kubernetes <key>: can be changeset, employers, lines, signoffs - it means that the file contains data sorted by this <key> desc. <type>: can be sum, top, all:

  • all means that the file contains all data for given <prefix> sorted by <key> desc (header is: idx,company,n,percent which means n-th, company name, n developers, % all developers) All known is the sum of all detected developers
  • top means that there will be top 10 data from all but also must contain data for: '(Unknown)', 'Gmail *', 'Qq *', 'Outlook *', 'Yahoo *', 'Hotmail *', '(Independent)', '(Not Found)'. The header is the same as in all.
  • sum contains a summary value for all found developers. It has a different header: N companies,sum,percent number of developer's companies found, the sum of <key> for all found developers, % of the sum <key> as a part of the sum <key> for all developers.
  • Special names: All known (sum all known developers), (Independent) (developers working on their own), (Not Found) (developers for whom an employer was not found even though the search was done in multiple sources), (Unknown) (developers not mapped (yet?)), Some name * (sum of developers having emails on Some name domain).An asterisk * is added to indicate this.

This data is directly used for the "Who writes Kubernetes" report.

./kubernetes_repos.sh script is used to generate all-time data for all the kubernetes repos.

To use it, you must have all of kubernetes repositories (68 from 3 different organizations) cloned in ~/dev/go/src/k8s/.

Orgs are: kubernetes, kubernetes-incubator, kubernetes-client.

It generates statistics for each single repo via: ./anyrepo.sh ~/dev/go/src/k8s.io/<repo-name> <repo-name>

See details in ./kubernetes_repos.sh. <repo-name> is a directory where a given kubernetes repository is cloned.

To clone a repository, do: cd ~/dev/go/src/k8s/ git clone https://github.com/<one-of-3-kubernetes-orgs>/<kubernetes-repo-name>.git.

one-of-3-kubernetes-orgs: kubernetes, kubernetes-incubator and kubernetes-client

kubernetes-repo-name: please look up all repo names in all kubernetes orgs on GitHub.

./anyrepo.sh just calls cncfdm.py with appropriate args (like exclude vendor dir numstat etc).

There is also ./anyreporange.sh that allows querying a repo for a specific time range (cncfdm.py supports that as well).

Output of this goes to repos/<repo-name>.<ext> <repo-name>: repository name ./anyrepo.sh was called with. <ext>: txt, csv, html, out: txt: main data file, csv: dumps list of employers in given repo, html: the same as txt but in HTML format, out: cncfdm.py verbose output messages (for debugging)

Finally, ./kubernetes_repos.sh calls: ./multirepo.sh with all 68 repository directories listed.

It gathers git log on each of them and concatenates all those files and then run cncfdm.py on the concatenated result (see ./multirepo.sh)

Results are saved to repos/combined.<ext> <ext> is the same as for anyrepo.sh.

The typical workflow is re-runing ./kubernetes_repos.sh and examining repos/combined.txt for unknown developers.

Research on google, Clearbit, FullContact, github, LinkedIn, Facebook, any other source -> update cncf-config/<filename> and re-run ./kubernetes_repos.sh <filename>: usually in this order

Related Skills

View on GitHub
GitHub Stars166
CategoryDevelopment
Updated1mo ago
Forks902

Languages

Ruby

Security Score

80/100

Audited on Feb 2, 2026

No findings