Code¶

The code is organised into packages, in the standard django way.

$digraph d { node [shape=folder]; disco_service [label="<project>\ndisco_service"]; crawler [label="<app>\ncrawler"]; metadata [label="<app>\nmetadata"]; govservices [label="<app>\ngovservices"]; disco_service -> crawler; disco_service -> metadata; disco_service -> govservices; }$

The following documentation is incomplete (work in progress), for the timebeing it’s better to reffer to the actual sources.

Package: disco_service¶

This is a django project, containing the usual settings.py, urls.py and wsgi.py

Note

Also contains celery.py, which is configuration for async worker nodes

Package: crawler¶

This django app is a simple wrapper.

crawler app does not have an admin interface.

crawler.models¶

An ORM interface to the DB which is shared with the disco_crawler node.js app.

class crawler.models.WebDocument(*args, **kwargs)[source]¶

Resource downloaded by the disco_crawler node.js app.

The document attribute is a copy of the resource which was downloaded.

url uniquely defines the resource (there is no numeric primary key). host, path, port and protocol are attributes about the HTTP request used to retrieve the resource. lastfetchdatetime and nextfetchdatetime are heuristically determined and drive the behavior of the crawler. _hash is indexed and has a coresponding attribute in the metadata.Resource class (these are compared to determine if the metadata is dirty).

The rest of the attributes are derived from the content of the document.

crawler.tasks¶

This module contains integration tasks for synchronising this DB with the metadata used in the rest of the discovery layer.

crawler.tasks.sync_from_crawler()[source]¶: dispatch metadata.Resource inserts for new crawler.WebDocuments

crawler.tasks.sync_updates_from_crawler()[source]¶: dispatch metadata.Resource updates for changed crawler.WebDocuments

Package: metadata¶

This django app manages the content metadata.

metadata.models¶

class metadata.models.Resource(*args, **kwargs)[source]¶

ORM class wrapping persistent data of the web resource

Contains hooks into the code for resource processing

_article()[source]¶: Analyse resource content, return Goose interface

_decode()[source]¶: Lookup content of the coresponding WebDocument.document

excerpt()[source]¶: Attempt to produce a plain text version of resource content

sr_summary()[source]¶

Search result summary.

This is a rude hack, it doesn’t even break on word boundaries. There should be much smarter ways of doing this.

title()[source]¶: Attempt to produce a single line description of the resource

metadata.tasks¶

metadata.tasks.insert_resource_from_row()[source]¶

Wrap metadata.Resource constructor

Stupidly, doesn’t even do any input validation.

metadata.tasks.update_resource_from_row()[source]¶

ORM lookup then update

No input validation and foolishly assumes the lookup won’t miss.

Package: govservices¶

This app wraps public data about government services.

govservices.models¶

class govservices.models.Agency(id, acronym)[source]¶

exception DoesNotExist¶

exception Agency.MultipleObjectsReturned¶

Agency.dimension_set¶

Agency.objects = <django.db.models.manager.Manager object>¶

Agency.service_set¶

Agency.subservice_set¶

class govservices.models.SubService(id, cat_id, desc, name, info_url, primary_audience, agency)[source]¶

exception DoesNotExist¶

exception SubService.MultipleObjectsReturned¶

SubService.agency¶

SubService.objects = <django.db.models.manager.Manager object>¶

class govservices.models.ServiceTag(id, label)[source]¶

exception DoesNotExist¶

exception ServiceTag.MultipleObjectsReturned¶

ServiceTag.objects = <django.db.models.manager.Manager object>¶

ServiceTag.service_set¶

class govservices.models.LifeEvent(id, label)[source]¶

exception DoesNotExist¶

exception LifeEvent.MultipleObjectsReturned¶

LifeEvent.objects = <django.db.models.manager.Manager object>¶

LifeEvent.service_set¶

class govservices.models.ServiceType(id, label)[source]¶

exception DoesNotExist¶

exception ServiceType.MultipleObjectsReturned¶

ServiceType.objects = <django.db.models.manager.Manager object>¶

ServiceType.service_set¶

class govservices.models.Service(id, src_id, agency, old_src_id, json_filename, info_url, name, acronym, tagline, primary_audience, analytics_available, incidental, secondary, src_type, description, comment, current, org_acronym)[source]¶

exception DoesNotExist¶

exception Service.MultipleObjectsReturned¶

Service.agency¶

Service.life_events¶

Service.objects = <django.db.models.manager.Manager object>¶

Service.service_tags¶

Service.service_types¶

class govservices.models.Dimension(id, dim_id, agency, name, dist, desc, info_url)[source]¶

exception DoesNotExist¶

exception Dimension.MultipleObjectsReturned¶

Dimension.agency¶

Dimension.objects = <django.db.models.manager.Manager object>¶

govservices.tests¶

Suite of tests assuring that the code which manipulates govservices is working correctly.

govservices.management.commands.update_servicecatalogue¶

It would be highly preferable to refactor this to use a REST API to interrogate the service catalogue, rather than messing about with the ServiceJsonRepository.

class govservices.management.commands.update_servicecatalogue.Command(stdout=None, stderr=None, no_color=False)[source]¶

manage.py extension. Call with:

python manage.py update_servicecatalogue

or:

python manage.py update_servicecatalogue <entity>

where <entity> is the name of one of the classes in metadata.models