Friday, March 5, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Technology Companies

Explore your project with Jupyter Notebooks and deploy it to the Python Package index

January 18, 2019
in Technology Companies
On demand data in Python, Part 3: Coroutines and asyncio
588
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Credit: IBM

Using data science to manage a software project in a GitHub
organization, Part 2

You might also like

Is your Cloud infrastructure securely configured? Does your DevSecOps pipeline integrate ibm-terraform compliance checks? – IBM Developer

Kafka Monthly Digest – February 2021 – IBM Developer

Why developers should centralize their security – IBM Developer


Content series:

This content is part # of # in the series: Using data science to manage a software project in a GitHub organization, Part 2

https://www.ibm.com/developerworks/library/?series_title_by=**auto**

Stay tuned for additional content in this series.

This content is part of the series:Using data science to manage a software project in a GitHub organization, Part 2

Stay tuned for additional content in this series.

In Part 1 of this series, you created the basic
structure of a data science project and downloaded the data programmatically
from GitHub, transforming it to be statistically analyzed with pandas.
Here in Part 2, you use Jupyter Notebook to explore many
aspects of a software project and learn how to deploy the project to the Python
Package index, both as a library and a command line tool.

Explore a GitHub
organization using Jupyter Notebook

In the following sections, I explain how to use Jupyter Notebook to analyze and
evaluate the development shop of a GitHub organization.

Pallets project
analysis

As I pointed out in Part 1, one of the issues with looking at only a single
repository is that it is only part of the data. The code that you created
in Part 1 gives you the ability to clone an entire organization — with all
of its repositories — and analyze it.

An example of a GitHub organization is the well-known Pallets project,
which has multiple projects such as Click and Flask. The following steps
detail how to perform a Jupyter Notebook analysis on the Pallets project.

  1. To start Jupyter from the command line, type jupyter
    notebook
    . Then, import the libraries
    that you will use:

    In [3]: import sys;sys.path.append("..")
       ...: import pandas as pd
       ...: from pandas import DataFrame
       ...: import seaborn as sns
       ...: import matplotlib.pyplot as plt
       ...: from sklearn.cluster import KMeans
       ...: %matplotlib inline
       ...: from IPython.core.display import display, HTML
       ...: display(HTML("<style>.container { width:100% !important; }</style>"))
  2. Next, run the code to download the organization:
    In [4]: from devml import (mkdata, stats, state, fetch_repo, ts)
    In [5]: dest, token, org = state.get_project_metadata("../project/config.json")
    In [6]: fetch_repo.clone_org_repos(token, org,
       ...:         dest, branch="master")
    Out[6]:
    [<git.Repo "/tmp/checkout/flask/.git">,
     <git.Repo "/tmp/checkout/pallets-sphinx-themes/.git">,
     <git.Repo "/tmp/checkout/markupsafe/.git">,
     <git.Repo "/tmp/checkout/jinja/.git">,
     <git.Repo "/tmp/checkout/werkzeug/.git">,
     <git.Repo "/tmp/checkout/itsdangerous/.git">,
     <git.Repo "/tmp/checkout/flask-website/.git">,
     <git.Repo "/tmp/checkout/click/.git">,
     <git.Repo "/tmp/checkout/flask-snippets/.git">,
     <git.Repo "/tmp/checkout/flask-docs/.git">,
     <git.Repo "/tmp/checkout/flask-ext-migrate/.git">,
     <git.Repo "/tmp/checkout/pocoo-sphinx-themes/.git">,
     <git.Repo "/tmp/checkout/website/.git">,
     <git.Repo "/tmp/checkout/meta/.git">]
  3. With the code living on disk, convert it to a pandas DataFrame:
    In [7]: df = mkdata.create_org_df(path="/tmp/checkout")
    In [9]: df.describe()
    Out[9]:
           commits
    count   8315.0
    mean       1.0
    std        0.0
    min        1.0
    25%        1.0
    50%        1.0
    75%        1.0
    max        1.0
  4. Calculate the active days:
    In [10]: df_author_ud = stats.author_unique_active_days(df)
        ...:
    In [11]: df_author_ud.head(10)
    Out[11]:
                  author_name  active_days active_duration  active_ratio
    86         Armin Ronacher          941       3817 days          0.25
    499  Markus Unterwaditzer          238       1767 days          0.13
    216            David Lord           94        710 days          0.13
    663           Ron DuPlain           56        854 days          0.07
    297          Georg Brandl           41       1337 days          0.03
    196     Daniel Neuhäuser           36        435 days          0.08
    169     Christopher Grebs           27       1515 days          0.02
    665    Ronny Pfannschmidt           23       2913 days          0.01
    448      Keyan Pishdadian           21        882 days          0.02
    712           Simon Sapin           21        793 days          0.03
  5. Create a seaborn plot by using sns.barplot to
    plot the top 10 contributors to the organization by the days that they are
    active in the project (that is, the days they actually checked in
    code). It is no surprise that the main author of many of the projects
    is almost three times more active than any other contributor.

    Figure 1. Seaborn active days plot

    barchart active days and developer name

    barchart active days and developer name

You could probably extrapolate similar observations for closed source
projects across all of the repositories in a company. “Active days” could
be a useful metric to show engagement, and it could be part of many
metrics used to measure the effectiveness of teams and projects.

“One observation from this query is that tests have a lot
of churn, which might be worth exploring more. Does this mean that the
tests themselves also contain bugs? ”

CPython project
analysis

Next, let’s look at a Jupyter notebook that shows the exploration of the metadata around
the CPython project, the
repository used to develop the Python language.

Relative churn

One of the metrics that is generated is called “relative churn.” (See
“Related topics” for an article from Microsoft Research about this
metric.) Basically, the related churn principle states that any increase
in relative code churn results in an increase in system defect density. In
other words, too many changes in a file results in defects.

  1. As before, import the modules needed for the rest of the exploration:
    In [1]: import sys;sys.path.append("..")
       ...: import pandas as pd
       ...: from pandas import DataFrame
       ...: import seaborn as sns
       ...: import matplotlib.pyplot as plt
       ...: from sklearn.cluster import KMeans
       ...: %matplotlib inline
       ...: from IPython.core.display import display, HTML
       ...: display(HTML("<style>.container { width:100% !important; }</style>"))
  2. Generate churn metrics:
    In [2]: from devml.post_processing import (git_churn_df, file_len, git_populate_file_metatdata)
    In [3]: df = git_churn_df(path="/Users/noahgift/src/cpython")
    2017-10-23 06:51:00,256 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/cpython]
    In [4]: df.head()
    Out[4]:
                                                   files  churn_count
    0                         b'Lib/test/test_struct.py'          178
    1                      b'Lib/test/test_zipimport.py'           78
    2                           b'Misc/NEWS.d/next/Core'          351
    3                                             b'and'          351
    4  b'Builtins/2017-10-13-20-01-47.bpo-31781.cXE9S...            1
  3. A few filters in pandas can then be used to figure out the top
    relative churn files with the Python extension. See the output in
    Figure 2.

    In [14]: metadata_df = git_populate_file_metatdata(df)
    In [15]: python_files_df = metadata_df[metadata_df.extension == ".py"]
        ...: line_python = python_files_df[python_files_df.line_count> 40]
        ...: line_python.sort_values(by="relative_churn", ascending=False).head(15)
        ...:
    Figure 2. Top relative churn in CPython.py
    files

    table showing churn count

    table showing churn count

    One observation from this query is that tests have a lot
    of churn, which might be worth exploring more. Does this mean that
    the tests themselves also contain bugs? That might be interesting
    to explore in more detail. Also, there are a couple of Python
    modules that have extremely high relative churn, such as the
    string.py module. In looking through the source code for that file, it does look very complex for
    its size, and it contains metaclasses. It is possible that the
    complexity has made it prone to bugs. This seems like a module
    worth further data science exploration.

  4. Next, you can run some descriptive statistics to look for the median
    values across the project. These statistics show that for the couple
    of decades and more than 100,000 commits that the project has been
    around, a median file is about 146 lines, that it is changed five
    times, and it has a relative churn of 10 percent. This leads to the
    conclusion that this is the ideal type of file to be created: small
    and with few changes over the years.

    In [16]: metadata_df.median()
    Out[16]:
    churn_count         5.0
    line_count        146.0
    relative_churn      0.1
    dtype: float64
  5. Generating a seaborn plot for the relative churn makes the patterns
    even more
    clear:

    In [18]: import matplotlib.pyplot as plt
        ...: plt.figure(figsize=(10,10))
        ...: python_files_df = metadata_df[metadata_df.extension == ".py"]
        ...: line_python = python_files_df[python_files_df.line_count> 40]
        ...: line_python_sorted = line_python.sort_values(by="relative_churn", ascending=False).head(15)
        ...: sns.barplot(y="files", x="relative_churn",data=line_python_sorted)
        ...: plt.title('Top 15 CPython Absolute and Relative Churn')
        ...: plt.show()

    In
    Figure 3, the regrtest.py module sticks out quite a bit as the most
    modified file. Again, it makes sense why it has been changed so
    much. While it is a small file, typically a regression test can be
    very complicated. This also might be a hot spot in the code that
    needs to be looked at.

    Figure 3. Top relative churn in CPython .py
    file

    bar chart with regrtest.py the largest and test_winsound.ph                             the least

    bar chart with regrtest.py the largest and test_winsound.ph the least

Deleted files

Another area of exploration is to look at files that have been deleted
throughout the history of a project. There are many directions of research
that could be derived from this exploration, such as predicting that a
file would later be deleted (for example, if the relative churn was too
high).

  1. To look at the deleted files, create another function in the post_processing directory:
    FILES_DELETED_CMD=
        'git log --diff-filter=D --summary | grep delete'
    def files_deleted_match(output):
        """Retrieves files from output from subprocess
        i.e:
        wcase/templates/hello.htmln delete mode 100644
        Throws away everything but path to file
        """
        files = []
        integers_match_pattern = '^[-+]?[0-9]+$'
        for line in output.split():
            if line == b"delete":
                continue
            elif line == b"mode":
                continue
            elif re.match(integers_match_pattern, line.decode("utf-8")):
                continue
            else:
                files.append(line)
        return files

    This
    function looks for delete messages in the git log, does some
    pattern matching, and extracts the files to a list so that a
    pandas DataFrame can be created.

  2. Next, use the function in a Jupyter notebook:
    In [19]: from devml.post_processing import git_deleted_files
        ...: deletion_counts =
    git_deleted_files("/Users/noahgift/src/cpython")

    To inspect some of the files that have been deleted, view the last
    few
    records:

    In [21]: deletion_counts.tail()
    Out[21]:
                               files     ext
    8812  b'Mac/mwerks/mwerksglue.c'      .c
    8813        b'Modules/version.c'      .c
    8814      b'Modules/Setup.irix5'  .irix5
    8815      b'Modules/Setup.guido'  .guido
    8816      b'Modules/Setup.minix'  .minix
  3. See if there is a pattern that appears with deleted files versus files
    that are kept. To do that, join the deleted files DataFrame:

    In [22]: all_files = metadata_df['files']
        ...: deleted_files = deletion_counts['files']
        ...: membership = all_files.isin(deleted_files)
        ...:
    In [23]: metadata_df["deleted_files"] = membership
    In [24]: metadata_df.loc[metadata_df["deleted_files"] == True].median()
    Out[24]:
    churn_count        4.000
    line_count        91.500
    relative_churn     0.145
    deleted_files      1.000
    dtype: float64
    
    In [25]: metadata_df.loc[metadata_df["deleted_files"] == False].median()
    Out[25]:
    churn_count         9.0
    line_count        149.0
    relative_churn      0.1
    deleted_files       0.0
    dtype: float64

    In
    looking at the median values of the deleted files compared with
    the files that are still in the repository, you see that there are
    some differences. Mainly, the relative churn number is higher for
    the deleted files. Perhaps the files that were problems were
    deleted? It is unknown without more investigation.

  4. Next, create a correlation heatmap in seaborn on this DataFrame:
    In [26]: sns.heatmap(metadata_df.corr(), annot=True)

    Figure
    4 shows that there is a correlation, a very small positive one,
    between relative churn and deleted files. This signal might be
    included in a machine learning model to predict the likelihood of
    a file being deleted.

    Figure 4. Files deleted correlation heatmap

    files                             deleted heat map

    files deleted heat map
  5. Next, a final scatterplot shows some differences between deleted files
    and files that have remained in the repository:

    In [27]: sns.lmplot(x="churn_count", y="line_count", hue="deleted_files", data=metadata_df)

    Figure
    5 shows three dimensions: line counts, churn counts, and the
    category of True/False for a deleted file.

    Figure 5. Scatterplot line counts and churn
    count

    heatmap

    heatmap

Deploying a project to the Python Package index

With all of the hard work performed in creating a library and command line
tool, it makes sense to share the project with other people by submitting
it to the Python Package index. There are only a few steps to do this:

  1. Create an account on https://pypi.python.org/pypi.
  2. Install twine:
    pip install twine
  3. Create a setup.py file.

    The two parts that are the most important
    are the packages section, which ensures that the
    library is installed, and the scripts section. The
    scripts section includes the dml script that we used
    throughout this article.

    import sys
    if sys.version_info < (3,6):
        sys.exit('Sorry, Python < 3.6 is not supported')
    import os
    from setuptools import setup
    from devml import __version__
    if os.path.exists('README.rst'):
        LONG = open('README.rst').read()
    setup(
        name='devml',
        version=__version__,
        url='https://github.com/noahgift/devml',
        license='MIT',
        author='Noah Gift',
        author_email='consulting@noahgift.com',
        description="""Machine Learning, Statistics and Utilities around Developer Productivity,
            Company Productivity and Project Productivity""",
        long_description=LONG,
        packages=['devml'],
        include_package_data=True,
        zip_safe=False,
        platforms='any',
        install_requires=[
            'pandas',
            'click',
            'PyGithub',
            'gitpython',
            'sensible',
            'scipy',
            'numpy',
        ],
        classifiers=[
            'Development Status :: 4 - Beta',
            'Intended Audience :: Developers',
            'License :: OSI Approved :: MIT License',
            'Programming Language :: Python',
            'Programming Language :: Python :: 3.6',
            'Topic :: Software Development :: Libraries :: Python Modules'
        ],
        scripts=["dml"],
    )

    The
    scripts directive then installs the dml tool into the
    path of all users who pip install the
    module.

  4. Add a deploy step to the Makefile:
    deploy-pypi:
        pandoc --from=markdown --to=rst README.md -o README.rst
        python setup.py check --restructuredtext --strict --metadata
        rm -rf dist
        python setup.py sdist
        twine upload dist/*
        rm -f README.rst
  5. Finally, deploy:
    (.devml) ➜  devml git:(master) ✗ make deploy-pypi
    pandoc --from=markdown --to=rst README.md -o README.rst
    python setup.py check --restructuredtext --strict --metadata
    running check
    rm -rf dist
    python setup.py sdist
    running sdist
    running egg_info
    writing devml.egg-info/PKG-INFO
    writing dependency_links to devml.egg-info/dependency_links.txt
    ....
    running check
    creating devml-0.5.1
    creating devml-0.5.1/devml
    creating devml-0.5.1/devml.egg-info
    copying files to devml-0.5.1...
    ....
    Writing devml-0.5.1/setup.cfg
    creating dist
    Creating tar archive
    removing 'devml-0.5.1' (and everything under it)
    twine upload dist/*
    Uploading distributions to https://upload.pypi.org/legacy/
    Enter your username:

Conclusion

Part 1 of this series shows you how to create a basic data science
skeleton and explains its parts. Part 2 provides an in-depth
data exploration using Jupyter Notebook, using the code built in Part 1.
You also learned how to deploy the project to the Python Package
index.

This article should be a good building block for other data science
developers to study as they build solutions that can be delivered as a
Python library and a command line tool.


Downloadable resources

Related topics

Credit: IBM

Previous Post

These malicious Android apps will only strike when you move your smartphone

Next Post

Biggest Moments Of Truth In 2019

Related Posts

Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Is your Cloud infrastructure securely configured? Does your DevSecOps pipeline integrate ibm-terraform compliance checks? – IBM Developer

March 5, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Kafka Monthly Digest – February 2021 – IBM Developer

March 4, 2021
Six courses to build your technology skills in 2021 – IBM Developer
Technology Companies

Why developers should centralize their security – IBM Developer

March 4, 2021
13 challenges creating an open, scalable, and secure serverless platform – IBM Developer
Technology Companies

13 challenges creating an open, scalable, and secure serverless platform – IBM Developer

March 4, 2021
Developers can now use IBM’s cloud services across multiple environments with IBM Cloud Satellite – IBM Developer
Technology Companies

Developers can now use IBM’s cloud services across multiple environments with IBM Cloud Satellite – IBM Developer

March 2, 2021
Next Post
Biggest Moments Of Truth In 2019

Biggest Moments Of Truth In 2019

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

With its acquisition of Auth0, Okta goes all in on CIAM
Internet Security

With its acquisition of Auth0, Okta goes all in on CIAM

March 5, 2021
Survey Finds Many Companies Do Little or No Management of Cloud Spending  
Artificial Intelligence

Survey Finds Many Companies Do Little or No Management of Cloud Spending  

March 5, 2021
UVA doctors give us a glimpse into the future of artificial intelligence
Machine Learning

UVA doctors give us a glimpse into the future of artificial intelligence

March 5, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
Brand Positioning and Competitors’ Positioning
Marketing Technology

Brand Positioning and Competitors’ Positioning

March 5, 2021
Singapore Airlines frequent flyer members hit in third-party data security breach
Internet Security

Singapore Airlines frequent flyer members hit in third-party data security breach

March 5, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • With its acquisition of Auth0, Okta goes all in on CIAM March 5, 2021
  • Survey Finds Many Companies Do Little or No Management of Cloud Spending   March 5, 2021
  • UVA doctors give us a glimpse into the future of artificial intelligence March 5, 2021
  • Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021 March 5, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates