Downloading open data Colombia database using python

Open Data Colombia is a website that contains public information about public entities in Colombia, the information can be queried with commercial or non commercial activities, this post is about how to download the whole dataset to a local hard drive using python.

Getting container lists

A container is a goverment division responsable for one or more catalogs, We can get list of containers using console command curl:

    curl http://servicedatosabiertoscolombia.cloudapp.net/v1/

The result has AtomPub format, it would be easy use python to get the list and parse the response:

    def get_container_list():
        """
        Get container list from open data Colombia
        """
        h = httplib2.Http()
        (resp_headers, content) = (
            h.request("http://servicedatosabiertoscolombia.cloudapp.net/v1/",
            "GET"))
        root = ET.fromstring(content)
        return [ child.get('href') for child in root[0]]

Getting list of Catalogs

We can get catalog list by container with a GET request by every container: una petición GET a la url http://servicedatosabiertoscolombia.cloudapp.net/v1/

 curl http://servicedatosabiertoscolombia.cloudapp.net/v1/Ministerio_deacion_y_las_comunicaciones

We can use python to traverse entity list and get all catalogs:

    def get_catalog_list(container_list):
        """
        Get Catalog list
        """
        h = httplib2.Http()
        url = "http://servicedatosabiertoscolombia.cloudapp.net/v1/"
        catalogs_by_entity = {}

        for container in container_list:
            catalog_list = []
            if container is not None:
                (resp_headers, content) = h.request(url + container, "GET")
                try:
                    root = ET.fromstring(content)
                    for child in root[0]:
                        if 'href' in child.attrib:
                            catalog_list.append(child.attrib['href'])
                except Exception as e:
                    print e
                    print '\n'
                    print container
                catalogs_by_entity[container] = catalog_list
        return catalogs_by_entity

Getting the data

We can make GET request to http://servicedatosabiertoscolombia.cloudapp.net/v1/container/set?$format=json in order to get the data.

curl http://servicedatosabiertoscolombia.cloudapp.net/v1/Ministerio_de_tecnologias_de_informacion_y_las_comunicaciones/dataset3

We can use a python scritp to get all data from all entities and store on our hard drive:

    def get_catalogs(catalogs_by_entity, directory='./'):
        """
        Get catalog data and save it in disk
        """
        h = httplib2.Http()
        root_path = os.path.join(directory, 'opendata')
        # remove directory if exist
        if os.path.exists(root_path):
            shutil.rmtree(root_path)
        os.mkdir(root_path)
        url = "http://servicedatosabiertoscolombia.cloudapp.net/v1/"
        for k, v in catalogs_by_entity.items():
            dir_path = os.path.join(root_path, k)
            os.makedirs(dir_path)
            for cat in v:
                (resp_headers, content) = h.request(
                    url + k + '/' + cat + '?$format=json', "GET")
                fname = cat + '.json'
                f = open (os.path.join(dir_path, fname), 'w')
                f.write(content)
                f.close()

Source code can be found at this link

References