Getting added and removed lines from Subversion log¶
Subversion log (unlike git) does not provide the number of lines added and removed for each commit so codemetrics resort to a 2 passes process retrieving the log first:
log = cm.get_svn_log()
At this point, added and removed column will be NaN. To populate then run the second pass. It is slow
because it relies on repeatedly calling
svn diff -c for each revision:
log.loc[:, ['added', 'removed']] = log.groupby('revision').apply(cm.get_diff_stats)
chunks=True returns diff stats with a row for each diff chunks.
Leverage dask to speed up retrieval of added and removed line with Subversion¶
Retrieving added and removed line with Subversion can be slow because codemetrics makes repeated calls to
svn diff --git -c XXX to count the number of pluses and minuses. To speed up the process somewhat, one can try to
leverage dask like so:
import dask.dataframe as dd import dask.diagnostics as ddiags import multiprocessing as mp n_cpus = mp.cpu_count() log = cm.get_svn_log().reset_index() meta = get_diff_stats(log[-1:]).iloc[0:0] partitioned_log = dd.from_pandas(log, npartitions=n_cpus) wf = partioned_log.groupby('revision').apply(get_diff_stats, chunks=False, meta=meta) with ddiags.ProgressBar(): # optional addrem_df = wf.conpute() # returns a pandas.DataFrame
Note that there is a significant overhead to start the parallel process.