Ubuntu Package Repository Visualisation: Take 2

Over on my old blog I posted about some visualisations I made of the Ubuntu packages installed on my machine. At the time, I mistakenly believed that I had graphed the entire package repository when I had actually only graphed the packages that my laptop had seen. There were several questions about my method, and people requested desktop-sized images. This is a follow-up post to the original. I will describe my method in more detail, and provide links to the images, so you can download them and do whatever you like to them.

Creating the Graph

The first thing I need to do is to get a list of all the packages in the Ubuntu repository, and their 'Depends', 'Recommends', and 'Suggests' relationships to other packages. In the past, I used this rather nasty piece of python code:

def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
    output = check_output(['dpkg','-l','*'])
    packages = []
    for line in output.split('\n'):
        parts = line.split()
        if not parts or not match('[uirph][nicurhWt]', parts[0]):
            continue
        packages.append(parts[1])

This is the code that introduced the problem. I had assumed, that dpkg -l * would show me all the packages. Instead, it shows me a small subset of the packages.

The solution is to install the python-apt package and use the apt module. The code now looks like this:

from apt import Cache


def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
        cache = Cache()
    return cache.keys()

That looks much better! What's more, it returns 66,000 items, whereas dpkg -l * returns around 8000 items.

The next step is to build a graph around these packages. I use the excellent graph_tool module. The entire script is somewhat hacky, but it works. The full source listing is here:

#!/usr/bin/env python


from graph_tool.all import *
from apt import Cache

graph = Graph()
package_nodes = {}
package_info = {}
cache = Cache()


def get_package_list():
    """Return a list of packages, both installed and uninstalled."""
    global cache
    return cache.keys()


def package_is_virtual(pkg_name):
    global cache
    return pkg_name not in cache


def get_list(pkg_name, attr_name):
    global cache

    package = cache[pkg_name]
    version = package.candidate or package.versions[0]
    deps = []
    for dependency in getattr(version, attr_name, []):
        for alt_dep in dependency:
            deps.append(alt_dep.name)
    return deps


def get_depends_list(pkg_name):
    """Get a list of package names this package depends upon."""
    return get_list(pkg_name, 'dependencies')

def get_recommends_list(pkg_name):
    """Get a list of package names this package recommends."""
    return get_list(pkg_name, 'recommends')

def get_suggests_list(pkg_name):
    """Get a list of package names this package suggests."""
    return get_list(pkg_name, 'suggests')

def get_installed_size(pkg_name):
    global cache

    package = cache[pkg_name]
    return package.installedSize

def get_homepage(pkg_name):
    global cache

    package = cache[pkg_name]
    return package.homepage or "None"

def get_priority(pkg_name):
    global cache

    package = cache[pkg_name]
    return package.priority or "None"

def get_section(pkg_name):
    global cache

    package = cache[pkg_name]
    return package.section

def get_graph_node_for_package(pkg_name):
    """Given a package name, return the graph node for that package. If the graph
    node does not exist, it is created, and it's meta-data filled.

    """
    global graph
    global package_nodes

    if pkg_name not in package_nodes:
        n = graph.add_vertex()
        package_nodes[pkg_name] = n
        # add node properties:
        graph.vertex_properties["label"][n] = pkg_name

        if not package_is_virtual(pkg_name):
            graph.vertex_properties["homepage"][n] = get_homepage(pkg_name)
            graph.vertex_properties["priority"][n] = get_priority(pkg_name)
            graph.vertex_properties["section"][n] = get_section(pkg_name)
            graph.vertex_properties["installed-size"][n] = get_installed_size(pkg_name)
        return n
    else:
        return package_nodes.get(pkg_name)


def create_graph_properties():
    # Create property lists we need:
    global graph
    graph.vertex_properties["label"] = graph.new_vertex_property("string")
    graph.vertex_properties["homepage"] = graph.new_vertex_property("string")
    graph.vertex_properties["priority"] = graph.new_vertex_property("string")
    graph.vertex_properties["section"] = graph.new_vertex_property("string")
    graph.vertex_properties["installed-size"] = graph.new_vertex_property("int")
    graph.edge_properties["weight"] = graph.new_edge_property("int")


def create_graph_from_packages():
    global graph
    create_graph_properties()
    # get list of packages:
    packages = get_package_list()

    n = 0
    for pkg_name in packages:
        if package_is_virtual(pkg_name):
            continue
        node = get_graph_node_for_package(pkg_name)
        for dependency in get_depends_list(pkg_name):
            edge = graph.add_edge(node, get_graph_node_for_package(dependency))
            graph.edge_properties["weight"][edge] = 4
        for dependency in get_recommends_list(pkg_name):
            edge = graph.add_edge(node, get_graph_node_for_package(dependency))
            graph.edge_properties["weight"][edge] = 2
        for dependency in get_suggests_list(pkg_name):
            edge = graph.add_edge(node, get_graph_node_for_package(dependency))
            graph.edge_properties["weight"][edge] = 1
        n += 1
        if n % 1000 == 0:
            print "%d / /%d" % (n, len(packages))


if __name__ == '__main__':
    create_graph_from_packages()
    print "Saving graph file..."
    graph.save('graph.gml')

The interesting bits here are that I'm storing some additional information for each node:

  • label - this string will be picked up by Gephi, in a later stage, and used as the node label. This gets set to the package name.
  • homepage - I'm not sure why I thought this data would be useful, but it was easy to extract, so why not store it? Space is cheap!
  • priority - this is a string, usually one of 'required', 'important', 'standard', 'optional', or 'extra'.
  • section - this is a string that is supposed to categorize packages into different sections.
  • installed-size - what it says on the tin.

Additionally, we set a weight to each relationship between a package and it's dependency.

On my laptop, this script takes just under two minutes to run, and spits out a 41 MiB gml file.

Import into Gephi

The next step is to import the graph.gml file into Gephi. This is pretty straightforward: Gephi supports the gml file format, so this is as simple as opening Gephi and finding the gml file and opening it. Once that has completed (it can take a while), you should see something like this:

static/images/initial_gephi_layout.png

Gephi has loaded the graph, but there's no layout information yet, so we get each node in a random location (bounded within certain coordinates, which is why it looks like a solid rectangle of gray). You may notice at this point that Gephi is very slow to render the graphs. I have found that turning off the rendering of edges improves performance hugely. There's a button to do this just below the graph render window.

Laying out the Graph

Next, we need to lay out the graph. This is computationally expensive, and you need to take care to select an appropriate layout algorithm for the graph. Gephi ships with many different layout algorithms, but very few of them are appropriate for a graph of this size (many of them have an order of complexity of O(n**2), which translates to "several days to run a single iteration", so you don't want that!). The best algorithm I found for this graph is the "Force Atlas 2", with some tweaked settings. The settings I used are shown below:

static/images/gephi_layout_settings.png

Note: I found that setting the thread count to higher values resulted in a much faster graph layout... until my laptop overheated and rebooted. Since I want to set this process going, walk away, and come back to it several hours later, I decided to use a single thread to do the layout work. Your mileage may vary!

After a while, the graph should settle at a reasonably stable configuration. My graph looked like this:

static/images/gephi_layout_finished.png

You may need to tweak the layout settings if you find the results disagreeable. I spent a considerable amount of time tweaking layout settings in order to get a layout that was agreeable to me.

Drawing the Graph

Now that we have a layout that looks OK, we need to tweak the drawing settings. For my graph, I decided to color each node according to it's modularity class (which groups nodes by their neighborhoods), and size each node by it's total degree. Finally, switch to the preview tab and play around with the visualization settings. The settings I used are:

static/images/gephi_draw_settings.png

Note that I decided not to draw the edges in the graph. There are many many options here which produce very different graph images. I ended up spending a lot of time at this stage as well, simply tweaking colors, sizes, and opacities.

The Pretties

I've produced a number of images at various resolutions. Click the links below for the full-size. Some of these are reasonably large - you have been warned!

static/images/gephi_output/thumb.png
1024x768 1280x1024 1280x800
1366x768 1440x900 1920x1080
1920x1440 2560x1920 5120x3840

Moving On

This has been a reasonably interesting diversion, but I'm probably not going to return to this data set. If you do something interesting with it, please let me know!


comments powered by Disqus