Self documenting data manipulation with R-Markdown

12 December 2015

The company I worked for over the last few years provides a lot of data cleaning/data manipulation services, mostly with proprietary tools that I and another developer created over the last few years. One of the things I introduced before I left was a bridge between the proprietary datasets that are used inside that company and the R project. My main motivation for this was to enable self-documenting workflows via R-Markdown and in this blog post I want to talk about the advantages of this approach.

R Markdown is a syntax definition and a set of modules for R that make it very straightforward to write normal text documents that embed R code. When compiling the text files, the R code is executed and the result embedded into the compiled document. These results can just be the textual output of R functions (like summary() that describes a couple of important metadata of a data set) or even graphics.

As the name suggests, R Markdown uses the markdown syntax for formatting text, so you would write something between stars to make it bold etc. Markdown is pretty neat in that it is both easy to read as plain text but also easily compiled to html to be viewed with actual formatting in a browser.

It’s probably easier to understand with an example, so here is a simplified version of what this looks like:

This is a sample r-markdown script that plots Age vs Income as a 
Hexbin plot. This text here is the natural language part that can 
use markdown to format the text, e.b. to make things **bold**.

```{r Income vs Age - Hexbin}
# The backticks in the line above started an R code block. 
# This is a comment inside the R block. We now load the hexbin 
# library and plot the data2008 dataset (the code for loading 
# the dataset was ommited here)

library(hexbin)
bin <- hexbin(data2008[, 1], data2008[, 2], xbins = 50, xlab = "Alter", ylab = "Einkommen")
plot(bin)
```

And this is what the compiled html looks like (embedded here as a screenshot)

The great thing about inlining R code in a markdown document in this way is that you can create a new workflow that is much more maintainable because the focus shifts to documenting the intention. Instead of focusing on writing R code to get a job done and then documenting it a little with some comments or as text in a separate document, the analyst starts the work by describing, in plain text, what it is she wants to do. She then embeds the code to do the transformation, and can even generate graphs that show the data before and after.

This idea to document changes by embedding graphs was my original trigger for writing the bridge code. I had implemented the weighting code in our proprietary tool but the textual output describing the changes in the weights was a bit terse. It was clear that a graphical representation would be easier to understand quickly, but introducing a rich graph library into our proprietary DSL would have been a major undertaking. By making it fast and easy to get our data sets into R and back out again, we quickly got a way to create graphs plus it enabled the self-documenting workflow described above.

Another big plus is that since all transformations are described in natural language as well as in code, auditing data manipulations becomes a lot easier and quicker. I can thus wholeheartedly recommend this workflow to everyone who works with data for a living.

Laws - the source code of society

25 May 2014

Today the citizens of EU will elect a new parliament and this seemed a good opportunity to write down some of my thoughts on lawmaking. As the title suggests, I think that law texts are very similar to source code. Of course, source code is a lot stricter insofar as it defines exactly what the computer will do. Laws on the other hand describe general rules for behaviour as well as the punishment for violations of those - ultimately though, both are expressed as text. Yet where programmers have developed sophisticated tools to work with source code, laws are still developed in bizarre workflows that necessitate a huge support staff. In this post I want to describe one set of tools used by programmers to work on texts; how I think they could be useful for lawmaking; and what our society would gain if our lawmakers would adopt them.

When non-programmers write, they often realize that it would be beneficial to save old versions of their texts. This leads to document file names with numbers attached (“Important text 23.doc”) and then the infamous “final”, “final final”, “really final” etcetera progression. Programmers instead rely on a set of tools that are known as Distributed Version Control Systems (DVCS). The most famous of these is probably Git, which is used in many open source efforts. What these tools do is manage the history of the text documents registered with them, and allowing easy sharing and merging of changes.

In practice, after changing a couple of lines in one or more documents these changes are recorded as one “Changeset”. These changesets can be displayed as a timeline and one can go back to the state of the documents at any point in its history.

A sample of the timeline of several changes in a DVCS

This in itself is alreday clearly useful, but what really makes DVCS magnificent tools is the ability to manage not just a simple linear progression of changes but different “Branches”. This allows several people to make changes to the text, share their changesets and let the system automatically combine their changes into a new version.

Because changes by many people creating many branches can become confusing, there are some great tools to visualize these changes, as well as complex workflows that allow others to review and authorize changes via their cryptographic signature.

So how would these tools be useful in creating law texts? The main benefit would be to clearly document which politician introduced which line into a law, and which edits they made. Others in the working group could create branches with their favoured wording, and these could then be combined into the final version that is voted on by parliament.

One very useful tool is the blame tool (sometimes also called praise or annotate), which displays a document in such a way that each line is tagged with the person who last changed it. I think that it could be quite revealing to see who changed what in our laws, a process that at present would be very time consuming.

The website Abgeordnetenwatch already tracks the voting behaviour of the members of the German parliament, and it would be a great extension of this effort if the genesis of all law texts were plain to see for everyone as well. Bundesgit already puts the final German law texts into a Git database - but because these are only the final texts and Git is only accessed by a single person instead of the real authors, the real power of a DVCS can’t be used. For things like the blame tool to work, all the small changes during the draft stage in the parliament’s working group would have to be made inside a DVCS by their respective authors.

I am sure that there are many more potential improvements to the process of lawmaking that would be possible if lawmakers used DVCS tools. But the main and immediate advantage would be an increase in transparency, which in the end is what democracy is all about. Laws are the source code of our societies. Let’s make sure they are made with the best tools available.

Automatic edit detection with FFmpeg and import into Premiere via EDL

23 February 2013

I like to study films in an NLE like Premiere. You can see the rhythm of the scenes a lot clearer when you look at clips in the timeline. Like this:


Unfortunately it used to be very tedious to go through a scene (let alone a whole film) and set all the cuts again by hand. Until now. Today I created a workflow to automate edit detection for use in an NLE. All you have to do is run two tools and you get an EDL that you can import into your NLE of choice, link the media and off you go. The whole process takes maybe 15 minutes for a feature length movie.

You need to touch the command line for two commands, but stay with me, it's really simple. The hard part, the actual scene detection in the movie file, is done by the wonderful ffmpeg program, more specifically the ffprobe program. It takes a video file and creates a spreadsheet file with the times of the edit points it detected.

The second part is creating an EDL from this spreadsheet file. I wrote a little tool for this today that you can download below. It is written in C# as a console application, is release as GPL code and hosted on bitbucket if you want to compile it yourself or modify it. It should work as is on Windows and on OSX and linux if you install Mono.

If you want to use it, here is a step by step guide.

  1. Download ffmpeg from their website. Either put the bin directory in your path or if you don't know how to do that then put ffprobe.exe into the directory where your movie is.
  2. Open a command line and go to the folder where the movie file is (start->cmd.exe on windows)
  3. Run this command and replace MOVIEFILENAME with the name of your movie file. It shouldn't contain any spaces. Be sure to copy the command exactly as stated here:
    ffprobe -show_frames -of compact=p=0 -f lavfi "movie=MOVIEFILENAME,select=gt(scene\,.4)" > MOVIEFILENAME.csv

  4. This will take a while and output a bit of status information. The ".4" is the scene detection level between 0.0 and 1.0, lower numbers create more edits, higher numbers create less edits. 0.4 should be a good default.

  5. Download EDLGenerator.exe and run it, again from the command line, like so:
    EDLGenerator.exe MOVIENAME.csv FRAMERATE MOVIENAME MOVIENAME.edl
    The first file is the csv file you generated earlier, FRAMERATE is the framerate of the movie (needed for dropframe timecode corrections when appropriate), the second MOVIENAME is the source filename that should be written into the EDL file (might help with some NLEs to make linking easier) and the last is the name of the edl file to generate
  6. Import the edl file into your NLE (In premiere in File->Import)
  7. Link the media (in Premiere CS6 you can select all clips in the bin and choose link to media and just have to select the source file once even though premiere creates one source item for each edit)
  8. Voilla, you are done!

I have only used it on a couple of movies so far so there may be some rough edges - if you run into a problem, drop me a line.

Incremental backups made easy

11 December 2011

In my last post I wrote in length about backups but I omitted one thing: how to make incremental backups that use so called hard links and that barely take more space than 1:1 backups (on both windows and osx). First though, let me explain what is so nice about this concept.

Backups with a history

If space were no concern, it would be nice never to throw backups away. We would simply have folders that contain the date and time the backup was taken as part of the backup target folder name and keep all those backups. Then if one day we discover that we now need a file that was deleted two weeks ago we would simply access the backup from 16 days ago and restore it. If, like me, you have several TB of important data and can barely afford 2 additional sets of hard drives (one to keep as a daily backup, one that is stored at another location and that is swapped regularly) then this seems to be impossible.

Incremental backups

If you look at your whole hard drive(s) then you will notice that between two backups only a fraction of the data actually changes. This is what incremental backups use to their advantage. They only store the new and changed files and thus save a lot of space. However now you have a full backup at one point in time and every time you run the backup again you get a new folder structure (or, if you choose a bad backup software, a proprietary single file) containing only the new and changed files. This is a bit cumbersome. Wouldn't it be great to have a full snapshot each time?

This is where a feature called Hardlinks comes in handy. Hardlinks are a way for file systems to reference the same file several times, but only storing it once. Both NTFS (the main windows file system) and HFS+ (the main OSX files system) support hardlinks, but both operating systems hide this feature from the user interface.

What we gain from this approach

So taken together, these features enable incremental backups that look like full snapshots but only store the new and changed data. This way you only need a backup drive that is a bit bigger than your source (since you will want to have some additional space for the newly created and modified files) and you can keep a full history on it.

rsync and two GUIs for it

rsync is an open source application that is used to copy data. Since version 3 or so it supports creating snapshot copies using hardlinks. On OSX the tool backuplist+ allows you to easily create incremental backups by checking the "Incremental backups" check box and entering how many past snapshots to keep. On windows QtdSync allows you to do the same thing if you change the backup type from "synchronisation" to "incremental".