The code is organised into packages, in the standard django way.

digraph d {
   node [shape=folder];
   disco_service [label="<project>\ndisco_service"];
   crawler [label="<app>\ncrawler"];
   metadata [label="<app>\nmetadata"];
   govservices [label="<app>\ngovservices"];

   disco_service -> crawler;
   disco_service -> metadata;
   disco_service -> govservices;


The following documentation is incomplete (work in progress), for the timebeing it’s better to reffer to the actual sources.

Package: disco_service

This is a django project, containing the usual, and


Also contains, which is configuration for async worker nodes

Package: crawler

This django app is a simple wrapper.

crawler app does not have an admin interface.


An ORM interface to the DB which is shared with the disco_crawler node.js app.

class crawler.models.WebDocument(*args, **kwargs)[source]

Resource downloaded by the disco_crawler node.js app.

The document attribute is a copy of the resource which was downloaded.

url uniquely defines the resource (there is no numeric primary key). host, path, port and protocol are attributes about the HTTP request used to retrieve the resource. lastfetchdatetime and nextfetchdatetime are heuristically determined and drive the behavior of the crawler. _hash is indexed and has a coresponding attribute in the metadata.Resource class (these are compared to determine if the metadata is dirty).

The rest of the attributes are derived from the content of the document.


This module contains integration tasks for synchronising this DB with the metadata used in the rest of the discovery layer.


dispatch metadata.Resource inserts for new crawler.WebDocuments


dispatch metadata.Resource updates for changed crawler.WebDocuments

Package: metadata

This django app manages the content metadata.


class metadata.models.Resource(*args, **kwargs)[source]

ORM class wrapping persistent data of the web resource

Contains hooks into the code for resource processing


Analyse resource content, return Goose interface


Lookup content of the coresponding WebDocument.document


Attempt to produce a plain text version of resource content


Search result summary.

This is a rude hack, it doesn’t even break on word boundaries. There should be much smarter ways of doing this.


Attempt to produce a single line description of the resource



Wrap metadata.Resource constructor

Stupidly, doesn’t even do any input validation.


ORM lookup then update

No input validation and foolishly assumes the lookup won’t miss.

Package: govservices

This app wraps public data about government services.


class govservices.models.Agency(id, acronym)[source]
exception DoesNotExist
exception Agency.MultipleObjectsReturned
Agency.objects = <django.db.models.manager.Manager object>
class govservices.models.SubService(id, cat_id, desc, name, info_url, primary_audience, agency)[source]
exception DoesNotExist
exception SubService.MultipleObjectsReturned
SubService.objects = <django.db.models.manager.Manager object>
class govservices.models.ServiceTag(id, label)[source]
exception DoesNotExist
exception ServiceTag.MultipleObjectsReturned
ServiceTag.objects = <django.db.models.manager.Manager object>
class govservices.models.LifeEvent(id, label)[source]
exception DoesNotExist
exception LifeEvent.MultipleObjectsReturned
LifeEvent.objects = <django.db.models.manager.Manager object>
class govservices.models.ServiceType(id, label)[source]
exception DoesNotExist
exception ServiceType.MultipleObjectsReturned
ServiceType.objects = <django.db.models.manager.Manager object>
class govservices.models.Service(id, src_id, agency, old_src_id, json_filename, info_url, name, acronym, tagline, primary_audience, analytics_available, incidental, secondary, src_type, description, comment, current, org_acronym)[source]
exception DoesNotExist
exception Service.MultipleObjectsReturned
Service.objects = <django.db.models.manager.Manager object>
class govservices.models.Dimension(id, dim_id, agency, name, dist, desc, info_url)[source]
exception DoesNotExist
exception Dimension.MultipleObjectsReturned
Dimension.objects = <django.db.models.manager.Manager object>


Suite of tests assuring that the code which manipulates govservices is working correctly.

It would be highly preferable to refactor this to use a REST API to interrogate the service catalogue, rather than messing about with the ServiceJsonRepository.

class, stderr=None, no_color=False)[source] extension. Call with:

python update_servicecatalogue


python update_servicecatalogue <entity>

where <entity> is the name of one of the classes in metadata.models