API

Getting useful data from your source control management tool is really a 2 steps process: first you need to get the log entries (e.g. svn log or git log) as a pandas.DataFrame, then process this output with the functions described below.

The pandas.DataFrame returned by each SCM specific function contains colums corresponding to the fields of codemetrics.scm.LogEntry:

class codemetrics.scm.LogEntry(revision, author, date, path, message, kind, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]

Data structure to hold git or svn data entries.

codemetrics.scm

Common logic for source control management tools.

Factor things common to git and svn.

class codemetrics.scm.ChunkStats(path, chunk, first, last, added, removed)
added

Alias for field number 4

chunk

Alias for field number 1

first

Alias for field number 2

last

Alias for field number 3

path

Alias for field number 0

removed

Alias for field number 5

class codemetrics.scm.DownloadResult(revision, path, content)
content

Alias for field number 2

path

Alias for field number 1

revision

Alias for field number 0

class codemetrics.scm.LogEntry(revision, author, date, path, message, kind, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]

Data structure to hold git or svn data entries.

astuple()[source]

Return the data as tuple.

changed

Sum of lines added and lines removed.

class codemetrics.scm.ScmDownloader(command, client)[source]

Abstract class that defines a common interface for SCM downloaders.

download(revision, path)[source]

Download content specific to a revision and path.

Runs checks and forward the call to _download (template method).

Parameters:
  • revision (str) – identify the commit ID
  • path (Union[None, str]) – file path. Can be left as None if all files in the commit are to be retrieved.
Return type:

DownloadResult

codemetrics.scm.parse_diff_as_tuples(download)[source]

Parse download result looking for diff chunks.

Parameters:download (DownloadResult) – Download result.
Yields:statistics, one tuple for each chunk (begin, end, added, removed).
Return type:Generator[ChunkStats, None, None]
codemetrics.scm.parse_diff_chunks(download)[source]

Concatenate chunks data returned by parse_diff_as_tuples into a frame

Return type:DataFrame

codemetrics.svn

Getting your data from Subversion.

_SvnLogCollector related functions.

class codemetrics.svn.SvnDownloader(command, svn_client='svn')[source]

Download files from Subversion.

codemetrics.svn.download(data, svn_client='svn')[source]

Download results from Subversion.

Parameters:
  • data (Union[DataFrame, Series]) – pd.Series containing at least revision and path.
  • svn_client (str) – Subversion client executable. Defaults to svn.
Return type:

DownloadResult

Returns:

list of file contents.

codemetrics.svn.get_diff_stats(data, svn_client='svn', chunks=None)[source]

Download diff chunks statistics from Subversion.

Parameters:
  • data (DataFrame) – revision ID of the change set.
  • svn_client (str) – Subversion client executable. Defaults to svn.
  • chunks – if True, return statistics by chunk. Otherwise, return just added, and removed column for each path. If chunk is None, defaults to true for data frame and false for series.
Return type:

Union[None, DataFrame]

Returns:

Dataframe containing the statistics for each chunk.

Example:

import pandas as pd
import codemetrics as cm
log = cm.get_svn_log().set_index(['revision', 'path'])
log.loc[:, ['added', 'removed']] = log.reset_index().\
                                    groupby('revision').\
                                    apply(cm.svn.get_diff_stats,
                                    chunks=False)
codemetrics.svn.get_svn_log(path='.', after=None, before=None, progress_bar=None, svn_client='svn')[source]

Entry point to retrieve svn log.

Parameters:
  • path (str) – location of checked out subversion repository root.
  • after (Optional[datetime]) – only get the log after time stamp. Defaults to one year ago.
  • before (Optional[datetime]) – only get the log before time stamp. Defaults to now.
  • progress_bar (Optional[tqdm]) – tqdm.tqdm progress bar.
  • svn_client (str) – Subversion client executable. Defaults to svn.
Return type:

DataFrame

Returns:

pandas.DataFrame with columns matching the fields of codemetrics.scm.LogEntry.

Example:

last_year = datetime.datetime.now() - datetime.timedelta(365)
log_df = cm.svn.get_svn_log(path='src', after=last_year)
codemetrics.svn.to_bool(bool_str)[source]

Convert str to bool.

codemetrics.svn.to_date(datestr)[source]

Convert str to datetime.datetime.

The date returned by _SvnLogCollector is UTC according to git-svn man page. Date tzinfo is set to UTC.

added and removed columns are set to np.nan for now.

codemetrics.git

Getting your data from git.

Git related functions.

codemetrics.git.download(data, git_client='git')[source]

Downloads files from Subversion.

Parameters:
  • data (DataFrame) – dataframe containing at least a (path, revision) columns to identify the files to download.
  • git_client (str) – Subversion client executable. Defaults to git.
Return type:

DownloadResult

Returns:

list of scm.DownloadResult.

codemetrics.git.get_git_log(path='.', after=None, before=None, progress_bar=None, git_client='git', _pdb=False)[source]

Entry point to retrieve git log.

Parameters:
  • path (str) – location of checked out subversion repository root. Defaults to .
  • after (Optional[datetime]) – only get the log after time stamp. Defaults to one year ago.
  • before (Optional[datetime]) – only get the log before time stamp. Defaults to now.
  • git_client (str) – git client executable (defaults to git).
  • progress_bar (Optional[tqdm]) – tqdm.tqdm progress bar.
  • _pdb (bool) – drop in debugger on parsing errors.
Return type:

DataFrame

Returns:

pandas.DataFrame with columns matching the fields of codemetrics.scm.LogEntry.

Example:

last_year = datetime.datetime.now() - datetime.timedelta(365)
log_df = cm.git.get_git_log(path='src', after=last_year)

codemetrics.core

The main functions are located in core but can be accessed directly from the main module.

For instance:

>>>import codemetrics as cm
>>>import cm.svn
>>>log_df = cm.svn.get_svn_log()
>>>ages_df = cm.get_ages(log_df)
codemetrics.core.get_mass_changes(log, min_path=None, max_changes_per_path=None)[source]

Extract mass changesets from the SCM log data frame.

Calculate the number of files changed by each revision and extract that list according to the threshold.

Parameters:
  • log (DataFrame) – SCM log data is expected to contain at least revision, added, removed, and path columns.
  • min_path (Optional[int]) – threshold for the number of files changed to consider the revision a mass change.
  • max_changes_per_path (Optional[float]) – threshold for the number of changed lines (added + removed) per file that changed.
Return type:

DataFrame

Returns:

revisions that had more files changed than the threshold.

codemetrics.core.get_ages(data, by=None)[source]

Generate age of each file based on last change.

Takes the output of a SCM log or just the date column and return get_ages.

Parameters:
  • data (DataFrame) – log or date column of log.
  • by (Optional[Sequence[str]]) – keys used to group data before calculating the age. See pandas.DataFrame.groupby. Defaults to [‘path’].
Return type:

DataFrame

Returns:

age of most recent modification as pandas.DataFrame.

Example:

get_ages = codemetrics.get_ages(log_df)
codemetrics.core.get_hot_spots(log, loc, by=None, count_one_change_per=None)[source]

Generate hot spots from SCM and loc data.

Cross SCM log and lines of code as an approximation of complexity to determine paths that are complex and change often.

Parameters:
  • log – output log from SCM.
  • loc – output from cloc.
  • by – aggregation level can be path (default), another column.
  • count_one_change_per – allows one to count one change by day or one change per JIRA instead of one change by revision.
Returns:

pandas.DataFrame

codemetrics.core.get_co_changes(log=None, by=None, on=None)[source]

Generate co-changes report.

Returns a DataFrame with the following columns: - primary: first path changed. - secondary: second path changed. - coupling: how often do the path change together.

Parameters:
  • log – output log from SCM.
  • by – aggregation level. Defaults to path.
  • on – Field name to join/merge on. Defaults to revision.
Returns:

pandas.DataFrame

codemetrics.core.guess_components(paths, stop_words=None, n_clusters=8)[source]

Guess components from an iterable of paths.

Parameters:
  • paths – list of string containing file paths in the project.
  • stop_words – stop words. Passed to TfidfVectorizer.
  • n_clusters – number of clusters. Passed to MiniBatchKMeans.
Returns:

pandas.DataFrame

See also

sklearn.feature_extraction.text.TfidfVectorizer sklearn.cluster.MiniBatchKMeans

codemetrics.core.get_complexity(group, download_func)[source]

Generate complexity information for files and revisions in dataframe.

For each pair of (path, revision) in the input dataframe, analyze the code with lizard and return the output.

Parameters:
  • group (Union[DataFrame, Series]) – contains at least path and revision values.
  • download_func (Callable) – callable that downloads a path on a given revision in a temporary directory and return that file in an object of type codemetrics.scm.DownloadResult.
Return type:

DataFrame

Returns:

Dataframe containing output of function-level lizard.analyze

Example:

import codemetrics as cm
log = cm.get_git_log()
log.groupby(['revision', 'path']).            apply(get_complexity, download_func=cm.git.download)

codemetrics.vega

Brdges visualization in Jupyter notebooks with Vega and Altair.

codemetrics.vega.build_hierarchy(data, get_parent=<function dirname>, root='', max_iter=100, col_name=None)[source]

Build a hierarchy from a data set and a get_parent relationship.

The output frame adds 2 columns in front: id and parent. Both are numerical where the parent id identifies the id of the parent as returned by the get_parent function.

The id of the root element is set to 0 and the parent is set to np.nan.

Parameters:
  • data (DataFrame) – data containing the leaves of the tree.
  • get_parent – function returning the parent of an element.
  • root (str) – expected root of the hierarchy.
  • max_iter (int) – maximum number of iterations.
  • col_name (Optional[str]) – name of the column to use as input (default to column 0).
Return type:

DataFrame

Returns:

pandas.DataFrame with the columns id, parent and col_name. The parent value identifies the id of the parent in the hierarchy where the id 0 is the root. The columns other than col_name are discarded.

codemetrics.vega.vis_ages(df, height=300, width=400, colorscheme='greenblue')[source]

Convert get_ages output to a json vega dict.

Parameters:
  • df (DataFrame) – input data returned by codemetrics.get_ages()
  • height (int) – vertical size of the figure.
  • width (int) – horizontal size of the figure.
  • colorscheme (str) – color scheme. See https://vega.github.io/vega/docs/schemes/
Return type:

dict

Returns:

Vega description suitable to be use with Altair.

Example:

import codemetrics as cm
from altair.vega.v4 import Vega
ages = cm.get_ages(loc_df, log_df)
desc = cm.vega.vis_ages(ages)
Vega(desc)  # display the visualization inline in you notebook.
codemetrics.vega.vis_hot_spots(df, height=300, width=400, size_column='lines', color_column='changes', colorscheme='yelloworangered')[source]

Convert get_hot_spots output to a json vega dict.

Parameters:
  • df (DataFrame) – input data returned by codemetrics.get_hot_spots()
  • height (int) – vertical size of the figure.
  • width (int) – horizontal size of the figure.
  • size_column (str) – column that drives the size of the circles.
  • color_column (str) – column that drives the color intensity of the circles.
  • colorscheme (str) – color scheme. See https://vega.github.io/vega/docs/schemes/
Return type:

dict

Returns:

Vega description suitable to be use with Altair.

Example:

import codemetrics as cm
from altair.vega.v4 import Vega
hspots = cm.get_hot_spots(loc_df, log_df)
desc = cm.vega.vis_hot_spots(hspots)
Vega(desc)  # display the visualization inline in you notebook.