codemetrics interface

Getting useful data from your source control management tool is really a 2 steps process: first you need to get the log entries (e.g. svn log or git log) as a pandas.DataFrame, then process this output with the functions described below.

The pandas.DataFrame returned by each SCM specific function contains colums corresponding to the fields of codemetrics.scm.LogEntry:

class codemetrics.scm.LogEntry(revision, author, date, path=None, message=None, kind=None, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]

Data structure to hold git or svn data entries.

codemetrics.scm

Common logic for source control management tools.

Factor things common to git and svn.

class codemetrics.scm.ChunkStats(path, chunk, first, last, added, removed)
property added

Alias for field number 4

property chunk

Alias for field number 1

property first

Alias for field number 2

property last

Alias for field number 3

property path

Alias for field number 0

property removed

Alias for field number 5

class codemetrics.scm.DownloadResult(revision, path, content)
property content

Alias for field number 2

property path

Alias for field number 1

property revision

Alias for field number 0

class codemetrics.scm.LogEntry(revision, author, date, path=None, message=None, kind=None, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]

Data structure to hold git or svn data entries.

astuple()[source]

Return the data as tuple.

property changed

Sum of lines added and lines removed.

class codemetrics.scm.Project(cwd=PosixPath('.'))[source]

Stores context information about the SCM tree.

At first the attributes are initialized to None until the first request to the SCM tool. The value used are cached for subsequent called so they don’t have to be specified again.

cwd

working directory to run the download_func from. It would typically point to the root of the directory under SCM.

class codemetrics.scm.ScmDownloader(command, client, cwd=None)[source]

Abstract class that defines a common interface for SCM downloaders.

download(revision, path=None)[source]

Download content specific to a revision and path.

Runs checks and forward the call to _download (template method).

Parameters
  • revision (str) – identify the commit ID

  • path (Optional[str]) – file path. Can be left as None if all files in the commit are to be retrieved.

Return type

DownloadResult

class codemetrics.scm.ScmLogCollector(cwd=None)[source]

Base class for svn and git.

See get_log functions.

abstract get_log()[source]

Call git log and return the log entries as a DataFrame.

Returns

pandas.DataFrame.

abstract process_log_entries(cmd_output)[source]

Convert output of git log –xml -v to a csv.

Parameters

cmd_output (Sequence[str]) – iterable of string (one for each line).

Yields

tuple of codemetrics.scm.LogEntry.

process_log_output_to_df(cmd_output, after, progress_bar=None)[source]

Factor creation of dataframe from output of command.

Parameters
  • cmd_output (Sequence[str]) – generator returning lines of output from the cmd line.

  • after (datetime) – date for the oldest change to retrieve. Usefull when progress_bar is specified. Ignored otherwise.

  • progress_bar (Optional[tqdm]) – progress bar if any. Defaults to self.progress_bar.

Returns

pandas.DataFrame

codemetrics.scm.get_log(project, *args, **kwargs)[source]

Convenience method to give a consistent functional interface.

Other functions (e.g. get_age) take data frames as input and eventually the project when they need information about the project. It gives a functional look and feel to the interface of codemetrics. We try to keepp it that way with this wrapper.

Forwards the call to project.get_log().

Return type

DataFrame

codemetrics.scm.normalize_log(df)[source]

Set dtype and categorize columns in the log DataFrame.

Specifically:
  • Converts date to tz-aware UTC.

  • Replace NaN in author and message with an empty string.

  • Make added, and removed numeric (float so we can handle averages).

  • Make textmods and propmods as bool (no NA).

  • Make kind, and action categories.

codemetrics.scm.parse_diff_as_tuples(download)[source]

Parse download result looking for diff chunks.

Parameters

download (DownloadResult) – Download result.

Yields

statistics, one tuple for each chunk (begin, end, added, removed).

Return type

Generator[ChunkStats, None, None]

codemetrics.scm.parse_diff_chunks(download)[source]

Concatenate chunks data returned by parse_diff_as_tuples into a frame

Return type

DataFrame

codemetrics.scm.to_frame(log_entries)[source]

Convert log entries to a pandas DataFrame.

Parameters

log_entries (Sequence[LogEntry]) – records generated by the SCM log command.

Return type

DataFrame

Returns

Data converted to a DataFrame with categories and type adjustments.

codemetrics.svn

Getting your data from Subversion.

_SvnLogCollector related functions.

class codemetrics.svn.SvnDownloader(command, svn_client=None, cwd=None)[source]

Download files from Subversion.

class codemetrics.svn.SvnProject(cwd=PosixPath('.'), client='svn')[source]

Project for Subversion SCM.

download(data)[source]

Download results from Subversion.

Parameters

data (DataFrame) – pd.DataFrame containing at least revision and path.

Return type

DownloadResult

Returns

list of file contents.

get_log(path='.', after=None, before=None, progress_bar=None, relative_url=None, _pdb=False)[source]

Entry point to retrieve svn log.

Parameters
  • path (str) – location to retrieve the log for.

  • after (Optional[datetime]) – only get the log after time stamp. Defaults to one year ago.

  • before (Optional[datetime]) – only get the log before time stamp. Defaults to now.

  • progress_bar (Optional[tqdm]) – tqdm.tqdm progress bar.

  • relative_url (Optional[str]) – Subversion relative url (e.g. /project/trunk/).

  • _pdb – drop in debugger on parsing errors.

Return type

DataFrame

Returns

pandas.DataFrame with columns matching the fields of codemetrics.scm.LogEntry.

Example:

last_year = datetime.datetime.now() - datetime.timedelta(365)
log_df = cm.svn.get_svn_log(path='src', after=last_year)
codemetrics.svn.get_diff_stats(data, svn_client=None, chunks=None, cwd=None)[source]

Download diff chunks statistics from Subversion.

Parameters
  • data (DataFrame) – revision ID of the change set.

  • svn_client (Optional[str]) – Subversion client executable. Defaults to svn.

  • chunks – if True, return statistics by chunk. Otherwise, return just added, and removed column for each path. If chunk is None, defaults to true for data frame and false for series.

  • cwd (Optional[Path]) – root of the directory under SCM.

Return type

Optional[DataFrame]

Returns

Dataframe containing the statistics for each chunk.

Example:

import pandas as pd
import codemetrics as cm
log = cm.get_svn_log().set_index(['revision', 'path'])
log.loc[:, ['added', 'removed']] = log.reset_index().\
                                    groupby('revision').\
                                    apply(cm.svn.get_diff_stats,
                                    chunks=False)
codemetrics.svn.to_bool(bool_str)[source]

Convert str to bool.

codemetrics.svn.to_date(datestr)[source]

Convert str to datetime.datetime.

The date returned by _SvnLogCollector is UTC according to git-svn man page. Date tzinfo is set to UTC.

added and removed columns are set to np.nan for now.

codemetrics.git

Getting your data from git.

Git related functions.

class codemetrics.git.GitProject(cwd=PosixPath('.'), client='git')[source]

Project for git SCM.

download(data)[source]

Download results from Git.

Parameters

data (DataFrame) – pd.DataFrame containing at least revision and path.

Return type

DownloadResult

Returns

list of file contents.

get_log(path='.', after=None, before=None, progress_bar=None, relative_url=None, _pdb=False)[source]

Entry point to retrieve git log.

Parameters
  • path (str) – location of checked out file/directory to get the log for.

  • after (Optional[datetime]) – only get the log after time stamp. Defaults to one year ago.

  • before (Optional[datetime]) – only get the log before time stamp. Defaults to now.

  • progress_bar (Optional[tqdm]) – tqdm.tqdm progress bar.

  • _pdb (bool) – drop in debugger on parsing errors.

Return type

DataFrame

Returns

pandas.DataFrame with columns matching the fields of codemetrics.scm.LogEntry.

Example:

last_year = datetime.datetime.now() - datetime.timedelta(365)
log_df = cm.git.get_git_log(path='src', after=last_year)
codemetrics.git.download(data, client=None, cwd=None)[source]

Downloads files from Subversion.

Parameters
  • data (DataFrame) – dataframe containing at least a (path, revision) columns to identify the files to download.

  • client (Optional[str]) – Git client executable. Defaults to git.

  • cwd (Optional[Path]) – working directory, typically the root of the directory under SCM.

Return type

DownloadResult

Returns

list of scm.DownloadResult.

codemetrics.core

The main functions are located in core but can be accessed directly from the main module.

For instance:

>>>import codemetrics as cm
>>>import cm.svn
>>>log_df = cm.svn.get_svn_log()
>>>ages_df = cm.get_ages(log_df)
codemetrics.core.get_ages(data, by=None)[source]

Generate age of each file based on last change.

Takes the output of a SCM log or just the date column and return get_ages.

Parameters
  • data (DataFrame) – log or date column of log.

  • by (Optional[Sequence[str]]) – keys used to group data before calculating the age. See pandas.DataFrame.groupby. Defaults to [‘path’].

Return type

DataFrame

Returns

age of most recent modification as pandas.DataFrame.

Example:

get_ages = codemetrics.get_ages(log_df)
codemetrics.core.get_co_changes(log=None, by=None, on=None)[source]

Generate co-changes report.

Returns a DataFrame with the following columns: - primary: first path changed. - secondary: second path changed. - coupling: how often do the path change together.

Parameters
  • log – output log from SCM.

  • by – aggregation level. Defaults to path.

  • on – Field name to join/merge on. Defaults to revision.

Returns

pandas.DataFrame

codemetrics.core.get_complexity(group, project)[source]

Generate complexity information for files and revisions in dataframe.

For each pair of (path, revision) in the input dataframe, analyze the code with lizard and return the output.

Parameters
  • group (Union[DataFrame, Series]) – contains at least path and revision values.

  • project (Project) – scm.Project derived class used to retrieve files for specific revision in

  • objects. (codemetrics.scm.DownloadResult) –

Return type

DataFrame

Returns

Dataframe containing output of function-level lizard.analyze

Example:

import codemetrics as cm
log = cm.get_git_log()
log.groupby(['revision', 'path']).            apply(get_complexity, download_func=cm.git.download)
codemetrics.core.get_hot_spots(log, loc, by=None, count_one_change_per=None)[source]

Generate hot spots from SCM and loc data.

Cross SCM log and lines of code as an approximation of complexity to determine paths that are complex and change often.

Parameters
  • log – output log from SCM.

  • loc – output from cloc.

  • by – aggregation level can be path (default), another column.

  • count_one_change_per – allows one to count one change by day or one change per JIRA instead of one change by revision.

Returns

pandas.DataFrame

codemetrics.core.get_mass_changes(log, min_path=None, max_changes_per_path=None)[source]

Extract mass changesets from the SCM log data frame.

Calculate the number of files changed by each revision and extract that list according to the threshold.

Parameters
  • log (DataFrame) – SCM log data is expected to contain at least revision, added, removed, and path columns.

  • min_path (Optional[int]) – threshold for the number of files changed to consider the revision a mass change.

  • max_changes_per_path (Optional[float]) – threshold for the number of changed lines (added + removed) per file that changed.

Return type

DataFrame

Returns

revisions that had more files changed than the threshold as a pd.DataFrame with columns revision, path, changes and changes_per_path.

codemetrics.core.guess_components(paths, stop_words=None, n_clusters=8)[source]

Guess components from an iterable of paths.

Parameters
  • paths – list of string containing file paths in the project.

  • stop_words – stop words. Passed to TfidfVectorizer.

  • n_clusters – number of clusters. Passed to MiniBatchKMeans.

Returns

pandas.DataFrame

See also

sklearn.feature_extraction.text.TfidfVectorizer sklearn.cluster.MiniBatchKMeans

codemetrics.vega

Brdges visualization in Jupyter notebooks with Vega and Altair.

codemetrics.vega.build_hierarchy(data, get_parent=<function dirname>, root='', max_iter=100, col_name=None)[source]

Build a hierarchy from a data set and a get_parent relationship.

The output frame adds 2 columns in front: id and parent. Both are numerical where the parent id identifies the id of the parent as returned by the get_parent function.

The id of the root element is set to 0 and the parent is set to np.nan.

Parameters
  • data (DataFrame) – data containing the leaves of the tree.

  • get_parent – function returning the parent of an element.

  • root (str) – expected root of the hierarchy.

  • max_iter (int) – maximum number of iterations.

  • col_name (Optional[str]) – name of the column to use as input (default to column 0).

Return type

DataFrame

Returns

pandas.DataFrame with the columns id, parent and col_name. The parent value identifies the id of the parent in the hierarchy where the id 0 is the root. The columns other than col_name are discarded.

codemetrics.vega.vis_ages(df, height=300, width=400, colorscheme='greenblue')[source]

Convert get_ages output to a json vega dict.

Parameters
  • df (DataFrame) – input data returned by codemetrics.get_ages()

  • height (int) – vertical size of the figure.

  • width (int) – horizontal size of the figure.

  • colorscheme (str) – color scheme. See https://vega.github.io/vega/docs/schemes/

Return type

dict

Returns

Vega description suitable to be use with Altair.

Example:

import codemetrics as cm
from altair.vega.v4 import Vega
ages = cm.get_ages(loc_df, log_df)
desc = cm.vega.vis_ages(ages)
Vega(desc)  # display the visualization inline in you notebook.
codemetrics.vega.vis_hot_spots(df, height=300, width=400, size_column='lines', color_column='changes', colorscheme='yelloworangered')[source]

Convert get_hot_spots output to a json vega dict.

Parameters
  • df (DataFrame) – input data returned by codemetrics.get_hot_spots()

  • height (int) – vertical size of the figure.

  • width (int) – horizontal size of the figure.

  • size_column (str) – column that drives the size of the circles.

  • color_column (str) – column that drives the color intensity of the circles.

  • colorscheme (str) – color scheme. See https://vega.github.io/vega/docs/schemes/

Return type

dict

Returns

Vega description suitable to be use with Altair.

Example:

import codemetrics as cm
from altair.vega.v4 import Vega
hspots = cm.get_hot_spots(loc_df, log_df)
desc = cm.vega.vis_hot_spots(hspots)
Vega(desc)  # display the visualization inline in you notebook.

Command line scripts

cm_func_stats

The codemetrics interface offers a command line tool cm_func_stats to compute statistics on functions.

For now the statistics are limited to the number of line of code (LOC), the complexity of the function (CCN), and the most frequent tokens together with the their span (see https://www.fluentcpp.com/2018/10/23/word-counting-span/ for more information):

>cm_func_stats --help
Usage: cm_func_stats [OPTIONS] FILE_PATH LINE_NO

  Generate statistics on the function specified by FILE_PATH LINE_NO.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

And for an example:

>cm_func_stats codemetrics\cmdline.py 42
codemetrics\cmdline.py(39): cm_func_stats@39-55@codemetrics\cmdline.py, NLOC: 16, CCN: 4
codemetrics\cmdline.py(43): f occurs 7 time(s), spans 13 lines (76.47%)
codemetrics\cmdline.py(41): func_info occurs 4 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(49): token occurs 3 time(s), spans 4 lines (23.53%)
codemetrics\cmdline.py(45): write occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(45): sys occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(45): stdout occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(48): span occurs 2 time(s), spans 5 lines (29.41%)
codemetrics\cmdline.py(48): func_span occurs 2 time(s), spans 5 lines (29.41%)
codemetrics\cmdline.py(43): msg occurs 2 time(s), spans 2 lines (11.76%)
codemetrics\cmdline.py(39): line_no occurs 2 time(s), spans 3 lines (17.65%)
codemetrics\cmdline.py(39): file_path occurs 2 time(s), spans 3 lines (17.65%)