Code

The code is organised into packages, in the standard django way.

digraph d {
   node [shape=folder];
   disco_service [label="<project>\ndisco_service"];
   crawler [label="<app>\ncrawler"];
   metadata [label="<app>\nmetadata"];
   govservices [label="<app>\ngovservices"];

   disco_service -> crawler;
   disco_service -> metadata;
   disco_service -> govservices;

}

The following documentation is incomplete (work in progress), for the timebeing it’s better to reffer to the actual sources.

Package: disco_service

This is a django project, containing the usual settings.py, urls.py and wsgi.py

Note

Also contains celery.py, which is configuration for async worker nodes

Package: crawler

This django app is a simple wrapper.

crawler app does not have an admin interface.

crawler.models

An ORM interface to the DB which is shared with the disco_crawler node.js app.

class crawler.models.WebDocument(*args, **kwargs)[source]

Resource downloaded by the disco_crawler node.js app.

The document attribute is a copy of the resource which was downloaded.

url uniquely defines the resource (there is no numeric primary key). host, path, port and protocol are attributes about the HTTP request used to retrieve the resource. lastfetchdatetime and nextfetchdatetime are heuristically determined and drive the behavior of the crawler. _hash is indexed and has a coresponding attribute in the metadata.Resource class (these are compared to determine if the metadata is dirty).

The rest of the attributes are derived from the content of the document.

crawler.tasks

This module contains integration tasks for synchronising this DB with the metadata used in the rest of the discovery layer.

crawler.tasks.sync_from_crawler()[source]

dispatch metadata.Resource inserts for new crawler.WebDocuments

crawler.tasks.sync_updates_from_crawler()[source]

dispatch metadata.Resource updates for changed crawler.WebDocuments

Package: metadata

This django app manages the content metadata.

metadata.models

class metadata.models.Resource(*args, **kwargs)[source]

ORM class wrapping persistent data of the web resource

Contains hooks into the code for resource processing

_article()[source]

Analyse resource content, return Goose interface

_decode()[source]

Lookup content of the coresponding WebDocument.document

excerpt()[source]

Attempt to produce a plain text version of resource content

sr_summary()[source]

Search result summary.

This is a rude hack, it doesn’t even break on word boundaries. There should be much smarter ways of doing this.

title()[source]

Attempt to produce a single line description of the resource

metadata.tasks

metadata.tasks.insert_resource_from_row()[source]

Wrap metadata.Resource constructor

Stupidly, doesn’t even do any input validation.

metadata.tasks.update_resource_from_row()[source]

ORM lookup then update

No input validation and foolishly assumes the lookup won’t miss.

Package: govservices

This app wraps public data about government services.

govservices.models

class govservices.models.Agency(id, acronym)[source]
exception DoesNotExist
exception Agency.MultipleObjectsReturned
Agency.dimension_set
Agency.objects = <django.db.models.manager.Manager object>
Agency.service_set
Agency.subservice_set
class govservices.models.SubService(id, cat_id, desc, name, info_url, primary_audience, agency)[source]
exception DoesNotExist
exception SubService.MultipleObjectsReturned
SubService.agency
SubService.objects = <django.db.models.manager.Manager object>
class govservices.models.ServiceTag(id, label)[source]
exception DoesNotExist
exception ServiceTag.MultipleObjectsReturned
ServiceTag.objects = <django.db.models.manager.Manager object>
ServiceTag.service_set
class govservices.models.LifeEvent(id, label)[source]
exception DoesNotExist
exception LifeEvent.MultipleObjectsReturned
LifeEvent.objects = <django.db.models.manager.Manager object>
LifeEvent.service_set
class govservices.models.ServiceType(id, label)[source]
exception DoesNotExist
exception ServiceType.MultipleObjectsReturned
ServiceType.objects = <django.db.models.manager.Manager object>
ServiceType.service_set
class govservices.models.Service(id, src_id, agency, old_src_id, json_filename, info_url, name, acronym, tagline, primary_audience, analytics_available, incidental, secondary, src_type, description, comment, current, org_acronym)[source]
exception DoesNotExist
exception Service.MultipleObjectsReturned
Service.agency
Service.life_events
Service.objects = <django.db.models.manager.Manager object>
Service.service_tags
Service.service_types
class govservices.models.Dimension(id, dim_id, agency, name, dist, desc, info_url)[source]
exception DoesNotExist
exception Dimension.MultipleObjectsReturned
Dimension.agency
Dimension.objects = <django.db.models.manager.Manager object>

govservices.tests

Suite of tests assuring that the code which manipulates govservices is working correctly.

govservices.management.commands.update_servicecatalogue

It would be highly preferable to refactor this to use a REST API to interrogate the service catalogue, rather than messing about with the ServiceJsonRepository.

class govservices.management.commands.update_servicecatalogue.Command(stdout=None, stderr=None, no_color=False)[source]

manage.py extension. Call with:

python manage.py update_servicecatalogue

or:

python manage.py update_servicecatalogue <entity>

where <entity> is the name of one of the classes in metadata.models