Getting added and removed lines from Subversion log

Subversion log (unlike git) does not provide the number of lines added and removed for each commit so codemetrics resort to a 2 passes process retrieving the log first:

log = cm.get_svn_log()

At this point, added and removed column will be NaN. To populate then run the second pass. It is slow because it relies on repeatedly calling svn diff -c for each revision:

log.loc[:, ['added', 'removed']] = log.groupby('revision').apply(cm.get_diff_stats)

Note chunks=True returns diff stats with a row for each diff chunks.

See also

Function codemetrics.svn.get_diff_stats

Leverage dask to speed up retrieval of added and removed line with Subversion

Retrieving added and removed line with Subversion can be slow because codemetrics makes repeated calls to svn diff --git -c XXX to count the number of pluses and minuses. To speed up the process somewhat, one can try to leverage dask like so:

import dask.dataframe as dd
import dask.diagnostics as ddiags
import multiprocessing as mp

n_cpus = mp.cpu_count()
log = cm.get_svn_log().reset_index()
meta = get_diff_stats(log[-1:]).iloc[0:0]
partitioned_log = dd.from_pandas(log, npartitions=n_cpus)
wf = partioned_log.groupby('revision').apply(get_diff_stats, chunks=False, meta=meta)
with ddiags.ProgressBar():  # optional
    addrem_df = wf.conpute()   # returns a pandas.DataFrame

Note that there is a significant overhead to start the parallel process.