-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefixes for extensions #36
Comments
Yeah, I think the essence of the idea to me is more to stop worrying about prefixes. Like letting Planet do 'mcid' instead of 'planet:mcid' or 'planet_mcid'. This may just be a bad idea - but I'm just sorta wondering what we've really gained by having all the prefixes. If you have a data model and want to validate it with a few different extensions then you'll be making choices about what to validate it with. The chances of overlap seem small, and if there's a set of 'known' extensions then people introducing new extensions that might need to be compatible can tweak their names. It would essentially just be a 'looser' approach to the ecosystem - here's a set of attributes that mean these things, but they're not trying to define things for all times. So like Planet would have 'mcid' defined at https://planet.github.io/fiboa/planet-fb-extension/v0.1.0/schema.yaml, some other org can have 'mcid' (and maybe it means something different...) defined at https://company.com/fiboa/company-extension/v0.1.0/schema.yaml, but if they wanted to make the ecosystem more compatible then they could just establish a new community extension at https://fiboa.org/mcid-extension/v1.0.0/schema.yaml, but the field name would stay 'mcid' - it'd just use the community-build JSON schema to validate.
And yeah, I think this is the other extreme of the approach, attempting to have the prefix have 'real' meaning. The original idea of the prefix in STAC was inpired by JSON-LD, with the intent to try to do just what you're saying @andyjenkinson - tie the prefixes to full URI's with well-known meaning in JSON-LD. I think one thing that threw it off is that the 'geo' representation in JSON-LD wasn't great if I remember right, and very few tools had support for it. But it's probably worth taking another run at figuring out if we could fully support it - I agree that hitting 'FIBOA, JSON-LD and OGC Features API compatibility all at the same time' would be really great.
But for extensions wouldn't you need to one 'context' per extension? Unless you put all the extensions into a single 'fiboa' context? Like if you don't put them all in a single context then there'd still be prefixes for all the ones that aren't the primary / default context? It also seems like when you map from JSON-LD to GeoParquet you'd need to bring the prefixes back in consistently, or else try to fully represent the URI's in Parquet. |
The context is a property included at the root of each payload, the value of which is either a context object or a URL of one, rather like a hypermedia link. So it can be unique to each implementation/dataset, and can include terms from any number of extensions. The extensions would give example contexts that correspond to example payloads, but when you implement FIBOA you can either:
Bear in mind JSON-LD contexts can remap any property to a URI, not only expand a prefix. This would allow Planet to have whatever terms it wanted in its payloads, they wouldn't even have to match the name of the property in the FIBOA spec and don't have to contain any prefixes; the mapping to FIBOA would be done entirely by the context file. All the machine readable stuff like validation, conversion etc can use the processed JSON-LD representation, but the files look completely normal to users and 'FIBOA unaware' software. Regarding parquet, yes either you define them as full URIs, or you would carry forward the JSON-LD context mappings into the headers (and vice versa of course) so that they look ‘normal’ in any existing software that processes parquet data. Basically the headers would have to say "these are the property names, and these are their equivalent URIs. The one thing I’m unsure about is the geo stuff. You may know there is a GeoJSON-LD context but I have not looked at it in detail. I would not want to abandon GeoJSON for some other random representation of geometry in JSON-LD, it’s about making a standard GeoJSON payload processable as JSON-LD. In fact that context is a good example explaining what I mean above about using the context object to essentially make JSON-LD look like completely normal JSON unchanged from its original format. All it is is a context, which maps all the original GeoJSON schema items to URIs exactly as they are. Personally I am not a fan of going the other end of the scale and just allowing a free for all on names. I get namespaces are annoying but I can see clashes happening especially for terms like "crop", and in particular it's useful to distinguish 'uncontrolled' terms - personally I'm not sure it's necessary to make a Planet extension as by definition there won't be any terms in common with anyone else. So long as FIBOA allows additional properties just document your schema and anything proprietary doesn't need a prefix. Then focus on trying to standardise things that seem common in a topic- not vendor-specific extension. Unless you adopt JSON-LD or something like it, someone's going to have to change their schema anyway. |
@m-mohr - why didn't you use a prefix on flik-extension? |
Good question, it was my first one and more an example. The fields in the original had no prefix, I guess I either forgot it or thought it's simpler, can't remember 😅 |
Cool. Yeah, both reasons to me point a bit to how it could be nicer to not have to think about them. Curious what you think about JSON-LD and contexts, and if you'd be up to dig into it a bit. Like if there is a way to enable us to pass through the 'schema' information without including the prefix, like all the way through geoparquet. I'm a bit less sure how much GeoParquet metadata should really handle - I wonder if there's any other examples of JSON-LD -> Parquet. And if it'll work when pulling a few different 'contexts' into one. |
I'm travelling next week, but I can dig into it afterwards, but it will likely take a three weeks or so. It doesn't seem to solve the colon / quote issue though. I'm not sure whether we can solve that if the allowed set of characters for SQL names is A-Z, 0-9 and _. I worry a bit that without a prefixes we'll end up with various extension that use crop_id and maybe even datasets that use no extension and have crop_id. If you want to merge them, what do you do with the fields names? You can't do it because the field is differently defined. In STAC we see that many clients actually don't check stac_extensions array and just use the fields because they can be sure there are usually no conflicts. Clients would need to be developed with more care, they may even need to read all schemas. So I tend towards a prefix at least for "common"/sebatable field names. If you have names like "flik" that are unlikely to conflict I could see that we allow without prefixes. We also sometimes do that in STAC. JSON-LD is an open question for now. |
It could solve the colon issue because there won’t be any colons in the payload any more, only in the context file which only ever needs to be read when doing things like validating or converting. And they’ll be in values, not keys. You can put whatever property names you like in your implementation, they don’t have to be named the same as the ‘standard’ ones, the context file provides a mapping. So the GeoJSON file literally looks like any normal JSON with one extra property ‘@context’ that links to the context. Everything else can be your native implementation, and if you want that to be a translation of a SQL schema, have at it. Think of the context as a set of instructions of how to convert the GeoJSON Feature/FeatureCollection to a FIBOA JSON-LD schema. It’s pretty much a glorified ‘find and replace’. So for example in your JSON you could have:
And after mapping if would look something like:
Here, I use an example where the FIBOA core specification would define the 'id' property, as a mapping to the existing Dublin core vocabulary term 'identifier', as well as its own unique properties. Meanwhile the FIBOA examples (and the format you'd use if you were creating a file from scratch) could look like:
And that would map to an identical RDF graph as the Planet example:
I've simplified the structure of all these of course. I'm typing on my phone. The standard context file would contain all the mappings from the core and extensions. Here the only reason for the colon is for the same reason we have it today: to allow independent development of extensions which might simultaneously use the same property name. But if this isn't important (eg each extension is allowed to basically claim a property name by using it first) it could be removed. Either way, each extension would have to provide a standard JSON-LD context that would translate the "nice plain JSON" to full messy URIs. The key to this is to hide as much as possible the mechanics of JSON-LD to make creating and maintaining extensions and making compatible features as easy as possible. |
Okay, I'm currently trying to wrap my head around this. I might have misunderstood parts of JSON-LD, please let me know if that's the case. Note Our extension mechanism doesn't require by any means a colon, it's a (pretty undocumented) best practice. We are used to it through STAC and it's primarily used to easily distinguish the fields while not looking at the list of implemented extensions. All fields can be unambiguously identified through their name + extension URI, but many STAC readers (except for the validators) actually do not use the JSON-LD vs fiboa extensionsMy current understanding of JSON LD and our current extension mechanism in fiboa is that they are very similar conceptually. This seems to be rarely the case (in STAC and fiboa) because it makes implementations more complex. While we can implement reference implementations that do that, user land implementations have proven through STAC that people take the simplest route and don't actually check against the provided extensions whether their assumptions about a field are actually true. Let's say we have three extensions that all define a For JSON there is LD tooling that could probably mitigate this, but for GeoParquet there's not. The question is whether people would use such tooling or whether they just use their normal JSON or Parquet reader. In this case they won't read the URIs (neither context nor fiboa extensions) and then you run into potential issues. For Parquet, you could prefix all fields by URI, e.g. PrefixesI think JSON-LD (or fiboa extensions) without prefixes looks nice, but has quite a number of hurdles. Merging filesThe "merge" issue that I spoke about in another comment before, can be resolved by implementing the merge tool in a way that if conflicts arise, the full URI will be used as a prefix. So for example the following two files would be merged as follows;:
Merged file has columns: Renaming (for compatibility)The "rename" machanism in JSON-LD looks good on first hand as it allows to use fiboa without changing the actual field names in an existing implementation, e.g. OGC API - Features. Users need good clients or a good understanding of LD to make the connection though. What about the values? If you need to change the values, e.g. from area in meters to area in hectares, how do you express in JSON-LD that you need to divide by a certain value? I couldn't find anything about it, so I assume that's not a thing. So the rename covers just a small part of being compatible with an existing response and as such I feel like it's not worth the hassle. CompatibilityThe compatibility between fiboa, JSON-LD and OGC API - Features would indeed be nice. The field renaming could solve that partially, but I think in many cases it would just cover parts of it (see the area example above). So you may still not be compatible with specific implementations and content schemes although you renamed everything nicely. So it leaves us with JSON-LD vs fiboa. I'm not quite decided on this yet. I generally find it hard to navigate the JSON-LD vocabularies. Like how can I verify whether there is an field boundary related vocabulary? There's GeoJSON-LD, but is there more? I'm not sure whether compatibility with JSON-LD would help us a lot. Is anyone asking for it? The fiboa extension mechanism is very similar. We'd need to define JSON-LD vocabularies in addition to the relatively simple fiboa Schemas, not sure whether they could be generated from the fiboa Schemas. We could probably go for JSON-LD if we think it's wort it, but currently, I'm not sure whether it's worth the effort. It might be worth the effort if we do more then field boundaries in the future, but I feel like field boundaries wouldn't benefit a lot from it unless we find existing LD vocabularies for it. |
I think it's advisable to some extent to try to "unlearn" some STAC idioms, just because it's something you worked on before doesn't mean that anyone expects it to work in the same way. People will intuitively understand that there is a need to define a group of unique terms for the properties of a GeoJSON feature/parquet object, and they will just be reading the docs explaining the terms constrained by the schema and what they mean exactly like it works in all OGC specs. Having said all that, it's for sure a useful observation from your experience of STAC that developers will tend to follow the path of least resistance and make assumptions wherever possible, and the examples you give of that are good ones. I think it would be a mistake to not make clear that uniqueness is a necessity between extensions, so would not be in favour of a solution that is neither JSON-LD nor uses a prefix, but the beauty of JSON-LD is that you can both make them unique whilst not looking strange to humans who reading plain JSON. For that reason of "the simple thing should just work", if you did it as JSON-LD (which I don't see as "vs" FIBOA btw, just the way FIBOA would be implemented) then I think the key is to get across that all of the terms defined in a FIBOA schema - whether core or extension - are unequivocally URIs first and foremost and the JSON must be valid GeoJSON-LD. Those URIs are strings just like I should probably point out also that I think geoparquet is certainly a different story, there is no concept of converting between short property names to URIs in parquet like there is for JSON (ie JSON-LD). In that case what I would probably do is just always use the native URIs for geoparquet files - they're just unique strings after all, and as I mentioned above these would be the 'normative' definitions that the schema is constraining. I think this is workable especially as the use cases for geoparquet are typically more analytical than visual - nobody is reading geoparquet files like humans do with JSON and it's also a more specialised technical community than JSON is touching upon, so the requirements of mapping from URIs to friendly quasi-readable property names just isn't as important, one may argue? I'm also not aware of anyone actively publishing geoparquet field boundaries today either, so the backwards compatibility requirement does not seem to exist there like it does for JSON. |
Fair points, thank you. I have some additional questions and comments.
Not sure I follow: You mean we should use column names such as
This brings up the question which encoding is the priority. Do we optimize for JSON or for tabular formats such as GeoParquet or flatgeobuf? How the work is going right now, it seems tabular is the priority and there we could end up with very weird behavior.
This looks really annoying and is something I'd want to avoid for sure.
I think I disagree with regards to the more specialised technical community. The reading is true, but then you are writing these column names in SQL for example, see the example above.
True, but then on the other hand for some tooling it's irrelevant from which file format they read. But if the structure of the file changes it makes a difference. On the other hand, we also don't really consider that right now. The structure also often changes with the current extension approach.
Me neither by the way, that was poorly phrased.
That's also the case with the current extension approach!
But if they don't resolve the context, then the names are different and the interoperability gets lost?! Isn't it annoying if people see different names everywhere and first need to verify via context what it is?
Doesn't that need a very specialised technical community? Why doesn't the developer need to know how it works?
Yes, I see how the property name thing could be solved, but that seems to be only a small part of the game.
Yeah, I'm aware from OGC work and URN's are from hell...
Agreed, but that's independant of whether we adopt JSON-LD or not. In any case we should reuse existing vocabularies and not invent our own. Unfortunately, I didn't find a lot of related vocabularies or standards yet. I still have to look into ADAPT... For crop classification, we already start by adopting HCAT through an extension, but there are so many out there and none is commonly used, it seems.
How does it do that? I feel like it's not very obvious. You could also always just define your own. I don't really see yet how JSON-LD encourages or requires reuse more than any other thing that we discussed or are using so far. |
Results from the discussion yesterday:
With regards to extension prefixes, we keep it open. Although our extensions currently use the colon, we don't require it. |
Sorry I lost track of this issue and didn't address your previous questions @m-mohr. Regarding geoparquet, what I was suggesting is that you do what JSON-LD does for JSON to convert strings into URIs: you simply define a mapping table (that's what the context object is, effectively). This means that in the data they are just normal strings, but if you really needed to convert to URIs (which only makes sense when you're doing something programmatic that depends on the data being Linked Data/RDF) then the converter pulls the 'context' property from the header and maps them. Essentially replicating what JSON-LD adds to JSON by doing the same thing for geoparquet. However to be honest this whole topic is to me not a very useful one - by definition JSON-LD only makes sense for JSON and I see no reason to replicate it inside a parquet file anyway. It's not a format you'd ever expect to combine directly with non-spatial semantic data so I don't see much value, and even if you did you can still use the JSON-LD context from the spec.
The FIBOA core schema doesn't contain URIs. for example 'id' is not a URI, it's just a string.
Because JSON-LD schemas are already expressed in RDF like ontologies such as DCAT are. Instead of copy-pasting values from an external vocabulary under a different property name (effectively 'forking' it) and mentioning in the human-readable documentation that it's borrowed from another place, you simply link to it directly and now when you merge into a graph containing other data already mapped to that ontology or with software that understands those ontology terms they are natively integrated using exactly the same URI. Perhaps to try to put a lid on this for now whilst providing a better summary for the future... basically if you were to implement JSON-LD as the schema engine for FIBOA JSON what I'm suggesting is you'd do it like this, preserving the benefits of the current implementation:
However this choice for FIBOA is entirely orthogonal - using JSON-LD guarantees globally unique terms anyway. Basically what you'd be doing is defining the FIBOA specification in such a way that FIBOA-compliant payloads are compatible with all of these at the same time:
Bear in mind that the purpose for suggesting JSON-LD in the first place is to enable point 9, e.g. Chris wanting to use a Planet format that is Planet-first, instead of having to rename all their properties "planet:prop1", "planet:prop2" etc. This is the main issue with namespacing - FIBOA right now is taking a "FIBOA is the centre of everything" approach which works fine for its of its core academic community creating greenfield FIBOA-compliant files from scratch, but is problematic to existing data providers to whom FIBOA is more like a 'bridge' and they need to fit their existing data into it. What JSON-LD provides is a standardised well-adopted mechanism (i.e. not re-inventing the wheel) of adding an explicit semantic schema on top of 'normal' JSON that happens to also be compatible with RDF and existing RDF-described ontologies and vocabularies. JSON-LD and RDF already exist as W3C specs with a huge arsenal of supporting code to validate semantics, inferencing, format conversion, data import etc whereas FIBOA is basically just adding proprietary rules about what certain strings mean on top of GeoJSON. Functionally for FIBOA it provides a way for e.g. Planet to make their datasets FIBOA-compliant by adding one 'context' property to the JSON, whilst appearing to a human as identical as it always was and how it is described in Planet's documentation. It also happens to mean that, at a stroke all FIBOA data will now also be valid RDF with zero effort from developers, meaning you can import it into a graph database with zero code, linking the geospatial world with all the other properties associated with the field that are not part of the boundary itself. I myself don't particularly care if you do or don't do it, we probably can't support FIBOA natively anyway as its schema is too simplistic to accommodate the deduplication and merging of datasets we do, so it's only ever going to be an interchange format for us. If we do it, it will only ever be an 'additional optional format' or maybe some sort of lossy converter. I just find myself in the position of posting about it because I have used it before and understand its relevance outside of the (sometimes insular) geospatial community and I can see an opportunity to re-use a lot of stuff. |
From the fiboa Slack:
@cholmes wrote:
@m-mohr wrote:
@andyjenkinson wrote:
The text was updated successfully, but these errors were encountered: