Credit: IBM
Using data science to manage a software project in a GitHub
organization, Part 2
Content series:
This content is part # of # in the series: Using data science to manage a software project in a GitHub organization, Part 2
https://www.ibm.com/developerworks/library/?series_title_by=**auto**
Stay tuned for additional content in this series.
This content is part of the series:Using data science to manage a software project in a GitHub organization, Part 2
Stay tuned for additional content in this series.
In Part 1 of this series, you created the basic
structure of a data science project and downloaded the data programmatically
from GitHub, transforming it to be statistically analyzed with pandas.
Here in Part 2, you use Jupyter Notebook to explore many
aspects of a software project and learn how to deploy the project to the Python
Package index, both as a library and a command line tool.
Explore a GitHub
organization using Jupyter Notebook
In the following sections, I explain how to use Jupyter Notebook to analyze and
evaluate the development shop of a GitHub organization.
Pallets project
analysis
As I pointed out in Part 1, one of the issues with looking at only a single
repository is that it is only part of the data. The code that you created
in Part 1 gives you the ability to clone an entire organization — with all
of its repositories — and analyze it.
An example of a GitHub organization is the well-known Pallets project,
which has multiple projects such as Click and Flask. The following steps
detail how to perform a Jupyter Notebook analysis on the Pallets project.
- To start Jupyter from the command line, type
jupyter
. Then, import the libraries
notebook
that you will use:In [3]: import sys;sys.path.append("..") ...: import pandas as pd ...: from pandas import DataFrame ...: import seaborn as sns ...: import matplotlib.pyplot as plt ...: from sklearn.cluster import KMeans ...: %matplotlib inline ...: from IPython.core.display import display, HTML ...: display(HTML("<style>.container { width:100% !important; }</style>"))
- Next, run the code to download the organization:
In [4]: from devml import (mkdata, stats, state, fetch_repo, ts) In [5]: dest, token, org = state.get_project_metadata("../project/config.json") In [6]: fetch_repo.clone_org_repos(token, org, ...: dest, branch="master") Out[6]: [<git.Repo "/tmp/checkout/flask/.git">, <git.Repo "/tmp/checkout/pallets-sphinx-themes/.git">, <git.Repo "/tmp/checkout/markupsafe/.git">, <git.Repo "/tmp/checkout/jinja/.git">, <git.Repo "/tmp/checkout/werkzeug/.git">, <git.Repo "/tmp/checkout/itsdangerous/.git">, <git.Repo "/tmp/checkout/flask-website/.git">, <git.Repo "/tmp/checkout/click/.git">, <git.Repo "/tmp/checkout/flask-snippets/.git">, <git.Repo "/tmp/checkout/flask-docs/.git">, <git.Repo "/tmp/checkout/flask-ext-migrate/.git">, <git.Repo "/tmp/checkout/pocoo-sphinx-themes/.git">, <git.Repo "/tmp/checkout/website/.git">, <git.Repo "/tmp/checkout/meta/.git">]
- With the code living on disk, convert it to a pandas DataFrame:
In [7]: df = mkdata.create_org_df(path="/tmp/checkout") In [9]: df.describe() Out[9]: commits count 8315.0 mean 1.0 std 0.0 min 1.0 25% 1.0 50% 1.0 75% 1.0 max 1.0
- Calculate the active days:
In [10]: df_author_ud = stats.author_unique_active_days(df) ...: In [11]: df_author_ud.head(10) Out[11]: author_name active_days active_duration active_ratio 86 Armin Ronacher 941 3817 days 0.25 499 Markus Unterwaditzer 238 1767 days 0.13 216 David Lord 94 710 days 0.13 663 Ron DuPlain 56 854 days 0.07 297 Georg Brandl 41 1337 days 0.03 196 Daniel Neuhäuser 36 435 days 0.08 169 Christopher Grebs 27 1515 days 0.02 665 Ronny Pfannschmidt 23 2913 days 0.01 448 Keyan Pishdadian 21 882 days 0.02 712 Simon Sapin 21 793 days 0.03
- Create a seaborn plot by using
sns.barplot
to
plot the top 10 contributors to the organization by the days that they are
active in the project (that is, the days they actually checked in
code). It is no surprise that the main author of many of the projects
is almost three times more active than any other contributor.Figure 1. Seaborn active days plot
You could probably extrapolate similar observations for closed source
projects across all of the repositories in a company. “Active days” could
be a useful metric to show engagement, and it could be part of many
metrics used to measure the effectiveness of teams and projects.
“One observation from this query is that tests have a lot
of churn, which might be worth exploring more. Does this mean that the
tests themselves also contain bugs? ”
CPython project
analysis
Next, let’s look at a Jupyter notebook that shows the exploration of the metadata around
the CPython project, the
repository used to develop the Python language.
Relative churn
One of the metrics that is generated is called “relative churn.” (See
“Related topics” for an article from Microsoft Research about this
metric.) Basically, the related churn principle states that any increase
in relative code churn results in an increase in system defect density. In
other words, too many changes in a file results in defects.
- As before, import the modules needed for the rest of the exploration:
In [1]: import sys;sys.path.append("..") ...: import pandas as pd ...: from pandas import DataFrame ...: import seaborn as sns ...: import matplotlib.pyplot as plt ...: from sklearn.cluster import KMeans ...: %matplotlib inline ...: from IPython.core.display import display, HTML ...: display(HTML("<style>.container { width:100% !important; }</style>"))
- Generate churn metrics:
In [2]: from devml.post_processing import (git_churn_df, file_len, git_populate_file_metatdata) In [3]: df = git_churn_df(path="/Users/noahgift/src/cpython") 2017-10-23 06:51:00,256 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/cpython] In [4]: df.head() Out[4]: files churn_count 0 b'Lib/test/test_struct.py' 178 1 b'Lib/test/test_zipimport.py' 78 2 b'Misc/NEWS.d/next/Core' 351 3 b'and' 351 4 b'Builtins/2017-10-13-20-01-47.bpo-31781.cXE9S... 1
- A few filters in pandas can then be used to figure out the top
relative churn files with the Python extension. See the output in
Figure 2.In [14]: metadata_df = git_populate_file_metatdata(df) In [15]: python_files_df = metadata_df[metadata_df.extension == ".py"] ...: line_python = python_files_df[python_files_df.line_count> 40] ...: line_python.sort_values(by="relative_churn", ascending=False).head(15) ...:
Figure 2. Top relative churn in CPython.py
filesOne observation from this query is that tests have a lot
of churn, which might be worth exploring more. Does this mean that
the tests themselves also contain bugs? That might be interesting
to explore in more detail. Also, there are a couple of Python
modules that have extremely high relative churn, such as the
string.py module. In looking through the source code for that file, it does look very complex for
its size, and it contains metaclasses. It is possible that the
complexity has made it prone to bugs. This seems like a module
worth further data science exploration. - Next, you can run some descriptive statistics to look for the median
values across the project. These statistics show that for the couple
of decades and more than 100,000 commits that the project has been
around, a median file is about 146 lines, that it is changed five
times, and it has a relative churn of 10 percent. This leads to the
conclusion that this is the ideal type of file to be created: small
and with few changes over the years.In [16]: metadata_df.median() Out[16]: churn_count 5.0 line_count 146.0 relative_churn 0.1 dtype: float64
- Generating a seaborn plot for the relative churn makes the patterns
even more
clear:In [18]: import matplotlib.pyplot as plt ...: plt.figure(figsize=(10,10)) ...: python_files_df = metadata_df[metadata_df.extension == ".py"] ...: line_python = python_files_df[python_files_df.line_count> 40] ...: line_python_sorted = line_python.sort_values(by="relative_churn", ascending=False).head(15) ...: sns.barplot(y="files", x="relative_churn",data=line_python_sorted) ...: plt.title('Top 15 CPython Absolute and Relative Churn') ...: plt.show()
In
Figure 3, the regrtest.py module sticks out quite a bit as the most
modified file. Again, it makes sense why it has been changed so
much. While it is a small file, typically a regression test can be
very complicated. This also might be a hot spot in the code that
needs to be looked at.Figure 3. Top relative churn in CPython .py
file
Deleted files
Another area of exploration is to look at files that have been deleted
throughout the history of a project. There are many directions of research
that could be derived from this exploration, such as predicting that a
file would later be deleted (for example, if the relative churn was too
high).
- To look at the deleted files, create another function in the post_processing directory:
FILES_DELETED_CMD= 'git log --diff-filter=D --summary | grep delete' def files_deleted_match(output): """Retrieves files from output from subprocess i.e: wcase/templates/hello.htmln delete mode 100644 Throws away everything but path to file """ files = [] integers_match_pattern = '^[-+]?[0-9]+$' for line in output.split(): if line == b"delete": continue elif line == b"mode": continue elif re.match(integers_match_pattern, line.decode("utf-8")): continue else: files.append(line) return files
This
function looks for delete messages in the git log, does some
pattern matching, and extracts the files to a list so that a
pandas DataFrame can be created. - Next, use the function in a Jupyter notebook:
In [19]: from devml.post_processing import git_deleted_files ...: deletion_counts = git_deleted_files("/Users/noahgift/src/cpython")
To inspect some of the files that have been deleted, view the last
few
records:In [21]: deletion_counts.tail() Out[21]: files ext 8812 b'Mac/mwerks/mwerksglue.c' .c 8813 b'Modules/version.c' .c 8814 b'Modules/Setup.irix5' .irix5 8815 b'Modules/Setup.guido' .guido 8816 b'Modules/Setup.minix' .minix
- See if there is a pattern that appears with deleted files versus files
that are kept. To do that, join the deleted files DataFrame:In [22]: all_files = metadata_df['files'] ...: deleted_files = deletion_counts['files'] ...: membership = all_files.isin(deleted_files) ...: In [23]: metadata_df["deleted_files"] = membership In [24]: metadata_df.loc[metadata_df["deleted_files"] == True].median() Out[24]: churn_count 4.000 line_count 91.500 relative_churn 0.145 deleted_files 1.000 dtype: float64 In [25]: metadata_df.loc[metadata_df["deleted_files"] == False].median() Out[25]: churn_count 9.0 line_count 149.0 relative_churn 0.1 deleted_files 0.0 dtype: float64
In
looking at the median values of the deleted files compared with
the files that are still in the repository, you see that there are
some differences. Mainly, the relative churn number is higher for
the deleted files. Perhaps the files that were problems were
deleted? It is unknown without more investigation. - Next, create a correlation heatmap in seaborn on this DataFrame:
In [26]: sns.heatmap(metadata_df.corr(), annot=True)
Figure
4 shows that there is a correlation, a very small positive one,
between relative churn and deleted files. This signal might be
included in a machine learning model to predict the likelihood of
a file being deleted.Figure 4. Files deleted correlation heatmap
- Next, a final scatterplot shows some differences between deleted files
and files that have remained in the repository:In [27]: sns.lmplot(x="churn_count", y="line_count", hue="deleted_files", data=metadata_df)
Figure
5 shows three dimensions: line counts, churn counts, and the
category of True/False for a deleted file.Figure 5. Scatterplot line counts and churn
count
Deploying a project to the Python Package index
With all of the hard work performed in creating a library and command line
tool, it makes sense to share the project with other people by submitting
it to the Python Package index. There are only a few steps to do this:
- Create an account on https://pypi.python.org/pypi.
- Install twine:
pip install twine
- Create a setup.py file.
The two parts that are the most important
are thepackages
section, which ensures that the
library is installed, and thescripts
section. The
scripts
section includes the dml script that we used
throughout this article.import sys if sys.version_info < (3,6): sys.exit('Sorry, Python < 3.6 is not supported') import os from setuptools import setup from devml import __version__ if os.path.exists('README.rst'): LONG = open('README.rst').read() setup( name='devml', version=__version__, url='https://github.com/noahgift/devml', license='MIT', author='Noah Gift', author_email='consulting@noahgift.com', description="""Machine Learning, Statistics and Utilities around Developer Productivity, Company Productivity and Project Productivity""", long_description=LONG, packages=['devml'], include_package_data=True, zip_safe=False, platforms='any', install_requires=[ 'pandas', 'click', 'PyGithub', 'gitpython', 'sensible', 'scipy', 'numpy', ], classifiers=[ 'Development Status :: 4 - Beta', 'Intended Audience :: Developers', 'License :: OSI Approved :: MIT License', 'Programming Language :: Python', 'Programming Language :: Python :: 3.6', 'Topic :: Software Development :: Libraries :: Python Modules' ], scripts=["dml"], )
The
scripts
directive then installs the dml tool into the
path of all users whopip
install the
module. - Add a
deploy
step to the Makefile:deploy-pypi: pandoc --from=markdown --to=rst README.md -o README.rst python setup.py check --restructuredtext --strict --metadata rm -rf dist python setup.py sdist twine upload dist/* rm -f README.rst
- Finally, deploy:
(.devml) ➜ devml git:(master) ✗ make deploy-pypi pandoc --from=markdown --to=rst README.md -o README.rst python setup.py check --restructuredtext --strict --metadata running check rm -rf dist python setup.py sdist running sdist running egg_info writing devml.egg-info/PKG-INFO writing dependency_links to devml.egg-info/dependency_links.txt .... running check creating devml-0.5.1 creating devml-0.5.1/devml creating devml-0.5.1/devml.egg-info copying files to devml-0.5.1... .... Writing devml-0.5.1/setup.cfg creating dist Creating tar archive removing 'devml-0.5.1' (and everything under it) twine upload dist/* Uploading distributions to https://upload.pypi.org/legacy/ Enter your username:
Conclusion
Part 1 of this series shows you how to create a basic data science
skeleton and explains its parts. Part 2 provides an in-depth
data exploration using Jupyter Notebook, using the code built in Part 1.
You also learned how to deploy the project to the Python Package
index.
This article should be a good building block for other data science
developers to study as they build solutions that can be delivered as a
Python library and a command line tool.
Downloadable resources
Related topics
Credit: IBM