Some considerations for the ATGeo Gazetteer WG

Here are some notes intended as a starting point for discussion within the ATGeo Gazeteer Working Group. Mostly these are my opinions and I do not hold them dogmatically.

My framing is: How do we mine the 40+ years of accumulated best practices in Geographic Information Systems, and the 20+ years of hard-won experience working with user-generated location data on the Web, for the benefit of ATProtocol developers and their users?

On one hand, I think we should find the minimal subset of these best practices which are applicable to the ATmosphere, and publish protocols and guidelines that enable the ATmosphere community to benefit from past experience.

On the other hand, I'm sure that we are all familiar with overly complex technical architectures that, however well conceived, fail to be adopted because developers don't need or want their complexity.

So the following notes are meant to stake out some areas that clearly want review and discussion by the Working Group, towards a general consensus that helps the ATmosphere community build location apps quickly and safely. Where I use words like "does" or "should", please read them as expressions of semi-informed opinion only.

Terminology

The OED defines a gazetteer as "a geographical index or dictionary". In its simplest form, a gazetteer is just a list of toponyms, or place names. Obviously, to be useful for practical purposes, a digital gazetteer needs to include some reference information about where the named place exists on the Earth's surface.

In the practice of Geographic Information Systems (GIS), locations on the Earth's surface are indicated with a spatial geometry. The simplest geometry is a point, but lines, polygons, and collections of geometries are possible.

An abstract object with geographic information can be represented as a feature. Features have both a spatial geometry, and a set of non-geographic properties known as attributes. By definition, a feature is the association of a geometry with attribute data.

The most important attributes of any feature are its name and its type. Many places have names in multiple languages. It is conventional to speak of a "default" name for a place, but this is always an assertion with (sometimes contentious) sociocultural implications that deserve consideration.

The taxonomy of possible feature types is probably the most complex topics in location data. This, too, is a topic that carries sociocultural implications. Feature typing also has deep implications for applications that work with those features. I do not recommend that the Working Group seek to define a global ontology of feature types.

The English terms place and location have a wide variety of usages and connotations in location technology. I believe that the working group should adopt precise technical definitions of these terms to avoid confusion, or explicitly state that these terms should be non-technical in usage. One simple answer might be to say that place is a synonym for feature and location is a synonym for geometry.

The term geocoder is sometimes used as a synonym for "gazetteer". Properly speaking, geocoding is the practice of annotating non-geographic data with geographic features or geometries. Reverse geocoding refers to any process of retrieving non-geographic information associated with a particular geometry. A WG gazetteer reference service could be correctly termed a "geocoder" or "reverse geocoder", depending on the type of query used, but the database(s) backing the service should be referred to as a "gazetteer" proper.

Gazetteer types

All gazetteers should basically be designed to host ATProtocol content records under stable AT-URIs that permit the places described to be referenced and composed with other data elsewhere.

In our design process, we should bear in mind that the ATmosphere will contain a range of types of gazetteer.

On one hand, we should plan to provide reference gazetteer services that supply the ATmosphere with consistent and stable sources of information about public places.

On the other side, we should be prepared to support the development of third-party application gazetteers, that aggregate and publish domain-specific information about places.

Wherever possible, application gazetteers should maintain strong references to related entries in public reference gazetteers, so that other applications can relate and reason about places as described by different sources.

As we design schemas and services, we should do our best to ensure that both reference and application gazetteers will be able to publish place data in compatible (or even identical) ways.

Safety

Safety is a paramount consideration for applications on the DWeb. To paraphrase Erin Kissane, "If we protect the most vulnerable among us, then we protect everybody." Unintended access to the location of a person or a community resource can constitute a grave physical threat to that person or community.

The WG should consider a proactive practice of modeling threats against possible users of Lexicon community geo specifications and implementations before releasing them.

The WG should also publish a set of safety guidelines for application developers seeking to use location data in their apps.

Content moderation

User-generated geographic data winds up needing moderation in two different ways:

The usual kind of per se content moderation that the ATmosphere is already working on
Quality control over the location and attribute data generated by users

Location data that is purported to be accurate and precise, but is not, can create serious safety issues. We can look to the OpenStreetMap community, now 20 years old, as one potential source of inspiration for how these issues can be identified and mitigated.

Language and social considerations

Being able to name a place is an important form of self-determination. Users should have the right to refer to any place by the names that are most culturally and linguistically relevant to them.

Toponym language identification should be baked into ATGeo specs from the ground up. Wherever the language of a place name is known, it should be indicated. This helps application developers present location information to users with proper localization.

Coordinate conventions

The most common way to identify a point on the Earth's surface is to use spherical coordinates of latitude and longitude. While working with spherical coordinates comes with its own challenges, any other coordinate system brings tradeoffs that should be delegated to application developers.

One source of confusion around geographic coordinates is axis ordering. Conventionally, we say "latitude and longitude" in English, and give latitude north or south first when writing geographic coordinates. This perpetually runs afoul of computational coordinate systems that put the spatial X axis (i.e. longitude) first.

Another source of confusion is the naming of the coordinate axes. Latitude and longitude? Lat and lon? Lon or long?

Therefore my strong recommendation is the Working Group adopt the position that Earth-based spherical coordinates are always referred to as "latitude" and "longitude" in full, without abbreviation. Except where we incorporate external standards, we should explicitly not specify axis ordering in technical usage, and always reference the coordinate axes by name.

Coordinate values should always be given in decimal values, with negative values indicating latitude south or longitude west. No minutes and seconds, no "E/W" suffixes -- either signed decimal values or nothing. Application developers should be prepared to reject values that do not match this specification.

This coordinate system convention is known unambiguously as EPSG 4326 in the GIS industry.

Geoid model and altitude

The Earth is not a perfect sphere. It is sometimes referred to as an "oblate spheroid", which is more correct, but the reality is that the Earth is slightly lumpy. Any model of the Earth's actual surface is known as a geoid. The point of coordinate origin on the geoid is known as the datum. The modern standard for the Earth's geoid and datum is given by the WGS84 specification. The Working Group should simply adopt WGS84 as the standard reference geoid and datum for all Lexicon community work.

The third dimension in a geographic coordinate is the altitude or elevation. Properly, elevation refers to a point on the Earth's surface, so the Working Group should concern itself exclusively with the more general term "altitude", which can also cover objects in flight. Altitude should always be given in meters above the WGS84 definition of "mean sea level" (or MSL)

Other location types

The existing community.lexicon.location spec currently includes location types for h3 and address.

I confess to having mixed feelings about this. On one hand, both are legitimate and (generally) useful ways to identify geographic locations. On the other, both have application-specific considerations that give me pause, and lead me to wonder if they should be included here.

H3 is a discrete tessellation system that has certain nice properties. However, an H3 cell is almost impossible to calculate or use without relying on the H3 SDK. Also H3 is one of a number of similar location representations, such as geohash and S2. The selection of a geographic grid system winds up being an application-specific choice -- which seems to be why Uber designed H3 even though S2 already existed. It's unclear to me why the ATGeo WG should preference one of these schemes over others.

Physical street addressing, on the other hand, is a deep and complex topic, with endless variations not merely between countries but even within a single country. A street address may be human-readable but almost never provides location information that can be unambiguously interpreted in a computational context. This tends to push street address information towards being treated as auxiliary feature attribute data, rather than as a location per se.

Given the attention and care devoted elsewhere in the ATmosphere to drafting minimalist schemas that limit the burdens placed on downstream developers, it might be wise to evaluate these choices further, before implicitly encouraging devs to distribute location data that relies on these kinds of representation.

Lexicon schemas

With no judgment whatsoever intended, I want to propose a slightly different direction for Lexicon schemas than the one taken in the existing community.lexicon.location collection. This proposal is intended entirely as a basis for discussion, not as a fully baked technical specification.

NSID

I recommend that we consider shifting to the community.lexicon.geo namespace. This is for two reasons: First, I think we may want to use the term location to mean something more specific. Second, the geo identifier comes from the English prefix geo-, which is cognate with a variety of Indo-European words for Earth.

Location

The location type could identify the geometry of a point on the Earth's surface. The existing community.lexicon.location.geo is already very close to this, but I would take the name property and hoist it up to a higher level.

{
    "location": {
        "type": "object",
        "description": "A physical location in the form of a WGS84 coordinate, potentially with altitude in meters above mean sea level.",
        "required": [
            "latitude",
            "longitude"
        ],
        "properties": {
            "latitude": {
                "type": "string"
            },
            "longitude": {
                "type": "string"
            },
            "altitude": {
                "type": "string"
            }
        }
    }
}

One weird thing about this specification is that we are giving spatial coordinate values as strings, rather than number. This threw me for a loop, and I needed the choice explained to me, until I found this reference in the ATProtocol Data Model specification:

In short, de-serializing in to machine-native format, then later re-encoding, is not always consistent. This is definitely true for special values and corner-cases, but can even be true with "normal" float values on less-common architectures....

If you have a use-case where integers can not be substituted for floats, we recommend encoding the floats as strings or even bytes. This provides a safe default round-trip representation.

This is tough to argue with, but it means that the WG should clearly advise this fact to developers, so that they are aware that they need to (a) cast values and do bounds checking in their applications, as well as (b.) perform appropriate precision truncation when outputting coordinates. Most applications will not need or want to specify a location to more than 6 or 7 decimal places, and should not emit more decimal places than are needed.

Shapes

Inevitably, app developers will want access more complex features, such as the polygon shape describing the perimeter of a public park.

The absence of floating point values in the ATProto data model makes it a little challenging to come up with a sui generis way of representing more complex geoemtries. Fortunately, we don't have to -- there is good prior art that we should simply adopt: GeoJSON and OGC Well-Known Text. We can store these geometries in an ATProtocol record using a MIME-typed blob.

{
    "shape": {
        "type": "object",
        "description": "A physical geographic shape in WGS84 coordinates.",
        "required": [
            "geometry"
        ],
        "properties": {
            "geometry": {
                "type": "blob"
                "accept": [
                    "application/geo+json",
                    "application/vnd.geo+wkt"
                ]
            }
        }
    }
}

I don't know if we should specifically discourage the use of other formats, but both GeoJSON and WKT are well supported in the F/OSS ecosystem. Other formats might include GML or OGC Well-Known Binary.

One note: I couldn't find a standard MIME type for WKT. The original MIME type for GeoJSON was application/vnd.geo+json before it was adopted as an IETF standard, so I think it's reasonable to apply application/vnd.geo+wkt as an unambiguous MIME type identifier for now.

Names

A name and its language, if known, should always go together.

{
    "name": {
        "type": "object",
        "description": "A geographic name, possibly in a given language.",
        "required": [
            "text"
        ],
        "properties": {
            "text": { "type": "string" },
            "language": { "type": "string", "format": "language" } 
        }
    }
}

Name text should always be given in UTF-8.

Places

Perhaps we can assert that the distinguishing thing about a place is that it is a feature that potentially has a name.

{
    "place": {
        "type": "object",
        "description": "A geographic place with optional location, names, and URI.",
        "properties": {
            "uri": {
                "type": "string",
                "format": "uri"
            },
            "location": { 
                "type": "union",
                "refs": [
                    "#location",
                    "#shape"
                ]
            },
            "names": {
                "type": "array",
                "ref": "#name"
            }
        }
    }
}

By including a URI, we provide the ability for a place record to reference some external, canonical representation of that place.

When emitting a names array with multiple entry, whatever the application regards as the "primary" name of the place should be the first one listed. Applications should be free to interpret the names array in whatever way is most appropriate to the app.

Note that all three of these fields should be optional by design. It is possible to conceive of applications that might want to reference a private place solely by name, an anonymous place solely by its location, or to simply indicate that the place is unique but potentially descibed elsewhere.

The thing that is nice about describing a place as the triplet of (uri, location, name) is that it is resilient to changes over many years and even decades. The name of the place might change, or the reference gazetteer might go away, but you still have the original location as a reference. Or the location could be inaccurate, or even deliberately omitted, but you still have a name with which to locate the place. Or the name and the location could be missing or inaccurate, but you have an external reference to a record which might be kept up-to-date by someone else.

Devs should be discouraged from emitting places without at least one of these properties, but even that is fine, if you think about it.

(NB. The ATProtocol Lexicon spec is a little hazy on whether array items can be typed, but it seems appropriate here.)

Features

Virtually all applications will have attribute data to associate with a place. Sometimes this data will have a fixed schema - as with most traditional gazetteer databases - but sometimes the data scheme will be arbitrary, if sourced from OpenStreetMap or from an application that collects user-generated data.

It seems like the standard way to do this in the ATmosphere is with a layer of indirection?

{
    "feature": {
        "type": "object",
        "description": "A place with additional attribute metadata",
        "properties": {
            "place": {
                "type": "ref",
                "ref": "#place"
            },
            "attributes": { 
                "type": "union",
                "refs": [
                    "com.atproto.repo.strongRef
                ]
            }
        }
    }
}

This allows both gazetteer and application maintainers to tailor the data model of place attributes to their datasets and use cases.

Web services

Nick Gerakines included in his original WG proposal a conception of "gazetteer profiles" that I think is spot on. I think the catalog definitions could include an additional property that describes the feature attribute lexicons supported by the catalog.

I think we may need to workshop the community.lexicon.gazetteer.query parameters a bit. In particular, it may be helpful (or even necessary) to differentiate between gazetteer queries that search for places by name, by location, by feature type, or by some combination.

A lot of our web service design should rest on research into active developer use cases. We should publish and host reference gazetteers containing, e.g. Foursquare Open Source Places, Overture Maps, OpenStreetMap derived data, et cetera.

Some data sets (OSM, Overture) require the redistribution of licensing terms. We should be prepared to enable gazetteer services to advertise any upstream licenses for data they provide.

Application development considerations

Null Island
Inverted coordinate sign
Coordinate formatting
Type casting
Output precision truncation
Geometry blob type handling

schuyler/some-notes.md