codemetrics interface¶
Getting useful data from your source control management tool is really a 2 steps process: first
you need to get the log entries (e.g. svn log
or git log
) as a pandas.DataFrame, then process
this output with the functions described below.
The pandas.DataFrame returned by each SCM specific function contains colums corresponding to the
fields of codemetrics.scm.LogEntry
:
- class codemetrics.scm.LogEntry(revision, author, date, path=None, message=None, kind=None, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]¶
Data structure to hold git or svn data entries.
codemetrics.scm¶
Common logic for source control management tools.
Factor things common to git and svn.
- class codemetrics.scm.ChunkStats(path, chunk, first, last, added, removed)¶
- property added¶
Alias for field number 4
- property chunk¶
Alias for field number 1
- property first¶
Alias for field number 2
- property last¶
Alias for field number 3
- property path¶
Alias for field number 0
- property removed¶
Alias for field number 5
- class codemetrics.scm.DownloadResult(revision, path, content)¶
- property content¶
Alias for field number 2
- property path¶
Alias for field number 1
- property revision¶
Alias for field number 0
- class codemetrics.scm.LogEntry(revision, author, date, path=None, message=None, kind=None, action=None, textmods=True, propmods=False, copyfromrev=None, copyfrompath=None, added=None, removed=None)[source]¶
Data structure to hold git or svn data entries.
- property changed¶
Sum of lines added and lines removed.
- class codemetrics.scm.Project(cwd=PosixPath('.'))[source]¶
Stores context information about the SCM tree.
At first the attributes are initialized to None until the first request to the SCM tool. The value used are cached for subsequent called so they don’t have to be specified again.
- cwd¶
working directory to run the download_func from. It would typically point to the root of the directory under SCM.
- class codemetrics.scm.ScmDownloader(command, client, cwd=None)[source]¶
Abstract class that defines a common interface for SCM downloaders.
- download(revision, path=None)[source]¶
Download content specific to a revision and path.
Runs checks and forward the call to _download (template method).
- Parameters
revision (
str
) – identify the commit IDpath (
Optional
[str
]) – file path. Can be left as None if all files in the commit are to be retrieved.
- Return type
- class codemetrics.scm.ScmLogCollector(cwd=None)[source]¶
Base class for svn and git.
See
get_log
functions.- abstract get_log()[source]¶
Call git log and return the log entries as a DataFrame.
- Returns
pandas.DataFrame.
- abstract process_log_entries(cmd_output)[source]¶
Convert output of git log –xml -v to a csv.
- Parameters
cmd_output (
Sequence
[str
]) – iterable of string (one for each line).- Yields
tuple of
codemetrics.scm.LogEntry
.
- process_log_output_to_df(cmd_output, after, progress_bar=None)[source]¶
Factor creation of dataframe from output of command.
- Parameters
cmd_output (
Sequence
[str
]) – generator returning lines of output from the cmd line.after (
datetime
) – date for the oldest change to retrieve. Usefull when progress_bar is specified. Ignored otherwise.progress_bar (
Optional
[tqdm
]) – progress bar if any. Defaults to self.progress_bar.
- Returns
pandas.DataFrame
- codemetrics.scm.get_log(project, *args, **kwargs)[source]¶
Convenience method to give a consistent functional interface.
Other functions (e.g. get_age) take data frames as input and eventually the project when they need information about the project. It gives a functional look and feel to the interface of codemetrics. We try to keepp it that way with this wrapper.
Forwards the call to project.get_log().
- Return type
DataFrame
- codemetrics.scm.normalize_log(df)[source]¶
Set dtype and categorize columns in the log DataFrame.
- Specifically:
Converts date to tz-aware UTC.
Replace NaN in author and message with an empty string.
Make added, and removed numeric (float so we can handle averages).
Make textmods and propmods as bool (no NA).
Make kind, and action categories.
- codemetrics.scm.parse_diff_as_tuples(download)[source]¶
Parse download result looking for diff chunks.
- Parameters
download (
DownloadResult
) – Download result.- Yields
statistics, one tuple for each chunk (begin, end, added, removed).
- Return type
Generator
[ChunkStats
,None
,None
]
codemetrics.svn¶
Getting your data from Subversion.
_SvnLogCollector related functions.
- class codemetrics.svn.SvnDownloader(command, svn_client=None, cwd=None)[source]¶
Download files from Subversion.
- class codemetrics.svn.SvnProject(cwd=PosixPath('.'), client='svn')[source]¶
Project for Subversion SCM.
- download(data)[source]¶
Download results from Subversion.
- Parameters
data (
DataFrame
) – pd.DataFrame containing at least revision and path.- Return type
- Returns
list of file contents.
- get_log(path='.', after=None, before=None, progress_bar=None, relative_url=None, _pdb=False)[source]¶
Entry point to retrieve svn log.
- Parameters
path (
str
) – location to retrieve the log for.after (
Optional
[datetime
]) – only get the log after time stamp. Defaults to one year ago.before (
Optional
[datetime
]) – only get the log before time stamp. Defaults to now.progress_bar (
Optional
[tqdm
]) – tqdm.tqdm progress bar.relative_url (
Optional
[str
]) – Subversion relative url (e.g. /project/trunk/)._pdb – drop in debugger on parsing errors.
- Return type
DataFrame
- Returns
pandas.DataFrame with columns matching the fields of
codemetrics.scm.LogEntry
.
Example:
last_year = datetime.datetime.now() - datetime.timedelta(365) log_df = cm.svn.get_svn_log(path='src', after=last_year)
- codemetrics.svn.get_diff_stats(data, svn_client=None, chunks=None, cwd=None)[source]¶
Download diff chunks statistics from Subversion.
- Parameters
data (
DataFrame
) – revision ID of the change set.svn_client (
Optional
[str
]) – Subversion client executable. Defaults to svn.chunks – if True, return statistics by chunk. Otherwise, return just added, and removed column for each path. If chunk is None, defaults to true for data frame and false for series.
cwd (
Optional
[Path
]) – root of the directory under SCM.
- Return type
Optional
[DataFrame
]- Returns
Dataframe containing the statistics for each chunk.
Example:
import pandas as pd import codemetrics as cm log = cm.get_svn_log().set_index(['revision', 'path']) log.loc[:, ['added', 'removed']] = log.reset_index().\ groupby('revision').\ apply(cm.svn.get_diff_stats, chunks=False)
codemetrics.git¶
Getting your data from git.
Git related functions.
- class codemetrics.git.GitProject(cwd=PosixPath('.'), client='git')[source]¶
Project for git SCM.
- download(data)[source]¶
Download results from Git.
- Parameters
data (
DataFrame
) – pd.DataFrame containing at least revision and path.- Return type
- Returns
list of file contents.
- get_log(path='.', after=None, before=None, progress_bar=None, relative_url=None, _pdb=False)[source]¶
Entry point to retrieve git log.
- Parameters
path (
str
) – location of checked out file/directory to get the log for.after (
Optional
[datetime
]) – only get the log after time stamp. Defaults to one year ago.before (
Optional
[datetime
]) – only get the log before time stamp. Defaults to now.progress_bar (
Optional
[tqdm
]) – tqdm.tqdm progress bar._pdb (
bool
) – drop in debugger on parsing errors.
- Return type
DataFrame
- Returns
pandas.DataFrame with columns matching the fields of codemetrics.scm.LogEntry.
Example:
last_year = datetime.datetime.now() - datetime.timedelta(365) log_df = cm.git.get_git_log(path='src', after=last_year)
- codemetrics.git.download(data, client=None, cwd=None)[source]¶
Downloads files from Subversion.
- Parameters
data (
DataFrame
) – dataframe containing at least a (path, revision) columns to identify the files to download.client (
Optional
[str
]) – Git client executable. Defaults to git.cwd (
Optional
[Path
]) – working directory, typically the root of the directory under SCM.
- Return type
- Returns
list of scm.DownloadResult.
codemetrics.core¶
The main functions are located in core but can be accessed directly from the main module.
For instance:
>>>import codemetrics as cm
>>>import cm.svn
>>>log_df = cm.svn.get_svn_log()
>>>ages_df = cm.get_ages(log_df)
- codemetrics.core.get_ages(data, by=None)[source]¶
Generate age of each file based on last change.
Takes the output of a SCM log or just the date column and return get_ages.
- Parameters
data (
DataFrame
) – log or date column of log.by (
Optional
[Sequence
[str
]]) – keys used to group data before calculating the age. See pandas.DataFrame.groupby. Defaults to [‘path’].
- Return type
DataFrame
- Returns
age of most recent modification as pandas.DataFrame.
Example:
get_ages = codemetrics.get_ages(log_df)
- codemetrics.core.get_co_changes(log=None, by=None, on=None)[source]¶
Generate co-changes report.
Returns a DataFrame with the following columns: - primary: first path changed. - secondary: second path changed. - coupling: how often do the path change together.
- Parameters
log – output log from SCM.
by – aggregation level. Defaults to path.
on – Field name to join/merge on. Defaults to revision.
- Returns
pandas.DataFrame
- codemetrics.core.get_complexity(group, project)[source]¶
Generate complexity information for files and revisions in dataframe.
For each pair of (path, revision) in the input dataframe, analyze the code with lizard and return the output.
- Parameters
group (
Union
[DataFrame
,Series
]) – contains at least path and revision values.project (
Project
) – scm.Project derived class used to retrieve files for specific revision inobjects. (codemetrics.scm.DownloadResult) –
- Return type
DataFrame
- Returns
Dataframe containing output of function-level lizard.analyze
Example:
import codemetrics as cm log = cm.get_git_log() log.groupby(['revision', 'path']). apply(get_complexity, download_func=cm.git.download)
- codemetrics.core.get_hot_spots(log, loc, by=None, count_one_change_per=None)[source]¶
Generate hot spots from SCM and loc data.
Cross SCM log and lines of code as an approximation of complexity to determine paths that are complex and change often.
- Parameters
log – output log from SCM.
loc – output from cloc.
by – aggregation level can be path (default), another column.
count_one_change_per – allows one to count one change by day or one change per JIRA instead of one change by revision.
- Returns
pandas.DataFrame
- codemetrics.core.get_mass_changes(log, min_path=None, max_changes_per_path=None)[source]¶
Extract mass changesets from the SCM log data frame.
Calculate the number of files changed by each revision and extract that list according to the threshold.
- Parameters
log (
DataFrame
) – SCM log data is expected to contain at least revision, added, removed, and path columns.min_path (
Optional
[int
]) – threshold for the number of files changed to consider the revision a mass change.max_changes_per_path (
Optional
[float
]) – threshold for the number of changed lines (added + removed) per file that changed.
- Return type
DataFrame
- Returns
revisions that had more files changed than the threshold as a pd.DataFrame with columns revision, path, changes and changes_per_path.
- codemetrics.core.guess_components(paths, stop_words=None, n_clusters=8)[source]¶
Guess components from an iterable of paths.
- Parameters
paths – list of string containing file paths in the project.
stop_words – stop words. Passed to TfidfVectorizer.
n_clusters – number of clusters. Passed to MiniBatchKMeans.
- Returns
pandas.DataFrame
See also
sklearn.feature_extraction.text.TfidfVectorizer sklearn.cluster.MiniBatchKMeans
codemetrics.vega¶
Brdges visualization in Jupyter notebooks with Vega and Altair.
- codemetrics.vega.build_hierarchy(data, get_parent=<function dirname>, root='', max_iter=100, col_name=None)[source]¶
Build a hierarchy from a data set and a get_parent relationship.
The output frame adds 2 columns in front: id and parent. Both are numerical where the parent id identifies the id of the parent as returned by the get_parent function.
The id of the root element is set to 0 and the parent is set to np.nan.
- Parameters
data (
DataFrame
) – data containing the leaves of the tree.get_parent – function returning the parent of an element.
root (
str
) – expected root of the hierarchy.max_iter (
int
) – maximum number of iterations.col_name (
Optional
[str
]) – name of the column to use as input (default to column 0).
- Return type
DataFrame
- Returns
pandas.DataFrame with the columns id, parent and col_name. The parent value identifies the id of the parent in the hierarchy where the id 0 is the root. The columns other than col_name are discarded.
- codemetrics.vega.vis_ages(df, height=300, width=400, colorscheme='greenblue')[source]¶
Convert get_ages output to a json vega dict.
- Parameters
df (
DataFrame
) – input data returned bycodemetrics.get_ages()
height (
int
) – vertical size of the figure.width (
int
) – horizontal size of the figure.colorscheme (
str
) – color scheme. See https://vega.github.io/vega/docs/schemes/
- Return type
dict
- Returns
Vega description suitable to be use with Altair.
Example:
import codemetrics as cm from altair.vega.v4 import Vega ages = cm.get_ages(loc_df, log_df) desc = cm.vega.vis_ages(ages) Vega(desc) # display the visualization inline in you notebook.
See also
- codemetrics.vega.vis_hot_spots(df, height=300, width=400, size_column='lines', color_column='changes', colorscheme='yelloworangered')[source]¶
Convert get_hot_spots output to a json vega dict.
- Parameters
df (
DataFrame
) – input data returned bycodemetrics.get_hot_spots()
height (
int
) – vertical size of the figure.width (
int
) – horizontal size of the figure.size_column (
str
) – column that drives the size of the circles.color_column (
str
) – column that drives the color intensity of the circles.colorscheme (
str
) – color scheme. See https://vega.github.io/vega/docs/schemes/
- Return type
dict
- Returns
Vega description suitable to be use with Altair.
Example:
import codemetrics as cm from altair.vega.v4 import Vega hspots = cm.get_hot_spots(loc_df, log_df) desc = cm.vega.vis_hot_spots(hspots) Vega(desc) # display the visualization inline in you notebook.
See also
Command line scripts¶
cm_func_stats¶
The codemetrics interface offers a command line tool cm_func_stats to compute statistics on functions.
For now the statistics are limited to the number of line of code (LOC), the complexity of the function (CCN), and the most frequent tokens together with the their span (see https://www.fluentcpp.com/2018/10/23/word-counting-span/ for more information):
>cm_func_stats --help
Usage: cm_func_stats [OPTIONS] FILE_PATH LINE_NO
Generate statistics on the function specified by FILE_PATH LINE_NO.
Options:
--version Show the version and exit.
--help Show this message and exit.
And for an example:
>cm_func_stats codemetrics\cmdline.py 42
codemetrics\cmdline.py(39): cm_func_stats@39-55@codemetrics\cmdline.py, NLOC: 16, CCN: 4
codemetrics\cmdline.py(43): f occurs 7 time(s), spans 13 lines (76.47%)
codemetrics\cmdline.py(41): func_info occurs 4 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(49): token occurs 3 time(s), spans 4 lines (23.53%)
codemetrics\cmdline.py(45): write occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(45): sys occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(45): stdout occurs 2 time(s), spans 9 lines (52.94%)
codemetrics\cmdline.py(48): span occurs 2 time(s), spans 5 lines (29.41%)
codemetrics\cmdline.py(48): func_span occurs 2 time(s), spans 5 lines (29.41%)
codemetrics\cmdline.py(43): msg occurs 2 time(s), spans 2 lines (11.76%)
codemetrics\cmdline.py(39): line_no occurs 2 time(s), spans 3 lines (17.65%)
codemetrics\cmdline.py(39): file_path occurs 2 time(s), spans 3 lines (17.65%)