mixxt

Sign up here for free!

Welcome to mixxt!

ISO standardisation

link this post written on 12/08/2009
  • To top

I was in Singapore at the ACL Linguistic Annotation workshop last week and talked with Nancy Ide and Nicoletta Calzolari about integrating our discussions with the ISO standardisation effort. Both were very keen to see this go ahead and Nicoletta said that some funding for meetings (from FlareNet) might be possible if we were to make a proposal.

There is apparently a meeting of the ISO working group in November in the US where the Linguistic Annotation Framework (LAF) proposal will be discussed. Nancy Ide said that we would be welcome to contribute to that in some way. I could perhaps make this meeting but would need to find some funding somewhere.

As was pointed out at the meeting, there are also a number of meetings going on in Europe in coming months. We should see whether some representation at these is useful.

In the meantime, I would welcome a meeting via teleconference (access grid) to try to sort out some of the details of tiers etc.

The ACL SIGANN (Annotation Special Interest Group) has a Wiki at

http://nlp.cs.nyu.edu/wiki/corpuswg

there is a significant overlap with our discussions in the activities there although they are largely concerned with annotation of textual materials.

Steve
link this post written on 15/08/2009
  • To top
Thanks for this, Steve.

It seems to make sense to have a look at LAF and the proposed GrAF format and see what is "missing", in our point of view. E.g. GrAF defines a "nodeSet" element that can contain "nodes" (containing annotations, containing features etc). To what extend does "nodeSet" differ from what we think a tier/track/layer/"typed container" should be?

(As an alternative we could start from scratch and try to design a standard for exchange ourselves and then see how it relates to existing proposals. I prefer the former approach.)

November might seem far away, but it might still prove difficult to contribute anything substantial by that time.
A teleconference meeting is fine with me, but maybe better first prepare some sketches beforehand via mail or this forum?

Should we try to extend this group and invite other people of whom we know they might be interested in this topic?

So far for now,
Han
link this post written on 07/09/2009
  • To top
Nancy just forwarded my a link to the latest spec:

http://www.tc37sc4.org/new_doc/iso_tc37_sc4_N463_rev00_wg1_wd_LAF.pdf

I'm not sure there is anything in there that would do the job of Tiers unless clever use of a type system could do it. Tier membership could just be another property of an annotation object but we'd need some way to write down which tiers were present without resorting to arbitrary meta-data. The spec does refer to 'layers' so there's clearly a need for something already, but layers don't have the same relations that tiers can have.

The other main failing of the spec is that it doesn't give a way to refer to a temporal signal, instead using this 'edge graph' pointing to spaces between words. I think that is easy to fix (just generalise the location attribute of spans) but we need to come up with something that everyone can agree with.

Nancy also tells me that the ISO meeting (of the TC37 SC4 group) is in Boston on the 3-6th November. I will try to go to this but can anyone else make it? It would be good to have some ideas of what changes we might need before then.

Cheers,

Steve
link this post written on 15/09/2009
  • To top
For the Europeans:

There are several calls for proposals for research networks. Would anyone be interested in applying?

A) Call for Proposals for 2011 ESF Research Networking Programmes

deadline: 22.10.2009

http://www.esf.org/activities/calls.html

B) "European Cooperation in the field of Scientific and Technical Research" (COST)

deadline: 25. September 2009 and 26. March 2010

http://www.cost.esf.org/opencall
link this post written on 15/09/2009
  • To top
I've just uploaded some comments on the current LAF standard document along with a proposal for an object based model that's largely compatible with it.

http://corpho.mixxt.org/networks/files/download.7339

I've included a Tier object and an Anchor object for multi-modal annotation, they probably need to be refined before they will do what we need.

Comments encouraged.

Steve
link this post written on 23/09/2009
  • To top
Hi, here are some preliminary comments on your model proposal:
- Graph
"annotates" attribute: I suppose this means that a single document can be referenced. It might be good to be able to reference more than one document.

- Node
I'm not quite sure what the "label" attribute is meant for. "The primary content" is what I would consider the actual annotation value, but this goes into the FeatureStructure, doesn't it? Is it such that if e.g. the "type" attribute is "Word", the "label" attribute could be something like "Noun" or "Verb"?

- Span, Region
As far as I can see, the LAF document also has a "span" object that extends "node" (7.1.3), although that isn't visible in the XML schema. In a "graf-1.0.6b.xsd" that I found somewhere (very inofficial probably) there is also a "region" element with an "anchors" attribute, so they might already be thinking in the direction you propose.
Span and Region being Nodes means that they also inherit "label", doesn't it? Would this attribute have any meaning for those elements? And do these elements need feature structures? Otherwise it might be better to have an Annotation object as subclass of Node as well and move "label" (and maybe also FeatureStructure) to Annotation.

- Edge
In this model Edge has to be used to connect a (annotation) Node to a Span or Region, isn't it?

- Type Registries
It is good to have type registries.
The advantage of having registries of agreed type names is that it helps to make types explicit. The down side is of course that it limits the use of type attributes. The old problem; if it is mandatory to take a value from the registry it will be too constraining for many applications...
link this post written on 31/01/2010
  • To top
Folks, I'd like to revive this discussion. We just had an ISO meeting in Hong Kong where a bunch of things around the LAF standard and multimodal annotations were raised. I think the standard is now quite close to something that will work for us but would welcome more input from you all just to make sure something isn't missed.

The ISO standard is very close to being final and we need to get any changes sorted out in the next couple of months.

I've invited Nancy Ide and Keith Suderman (Vasser) to join the discussion here. I'll follow this message with a repost of a couple of emails we've exchanged. Meanwhile, to summarise the current position:

- The most recent version of the schema is at http://www.xces.org/ns/GrAF/0.99
- The main features of the GrAF format are stable, most discussion is about how nodes are linked to data
- GrAF currently uses a 'region' to represent a span or area in the source text, a region has a number of 'anchors' which are locations in the text but could be typed as pointers into a video signal.
- There is some discussion as to whether this model is appropriate - eg. I've used 'anchors' attached directly to nodes in the graph
- Tiers can be modelled as node groups within the graph with some attached meta-data, we need to make sure that this works for interchange of multimodal annotations
- There's a big set of questions (in my view) about types in annotations. GrAF has a number of attributes called 'label' or 'type' but no guidance as to what these mean or how they should be used.

There are probably other things that I'm forgetting or that will come up when you look at the discussion. Please join in and help make this work.

Steve
link this post written on 31/01/2010
  • To top
Forwarding a message sent earlier. The files mentioned are stored on this site: http://corpho.mixxt.org/networks/files/folder.2412 for reference.


Nancy, Keith, Laurent,
attached is an XSLT stylesheet to turn an ELAN annotation file into something like GrAF. I wanted to contribute this as a concrete example to help the discussion about the requirements for extending LAF to multimodal annotation.

About the only change I've made is to add an instant element, analagous to the region but only having a single location (I've avoided the word anchor). Some nodes are then linked to two instants as start and end rather than to regions. As we've been discussing, and as you'll see in the example, there is a requirement for some segments to ensure that the end point of one segment is identified with the start point of the next - that is, they're not just accidentally the same value, they are required to be the same. In ELAN this seems only to occur in time subdivision tiers, but other tools allow a more liberal use.

I've modelled tiers as nodeSets and added a feature structure in the node set to store tier metadata.

I've put everything in a single file since the multi-file model you use for ANC doesn't seem to be included in the GrAF schema or the standard document. In any case, I think it's clearer to have a single file with everything. However, there's no place for a header in the schema so I've added a container around the header and the graph element. There's more to go in the header but I've been concentrating on the data so far.

It would be really useful if we could have a public repository of the current schema, examples and standard document. I've suggested a Google Code project to Nancy but any public subversion space and wiki would do (google gives us ticketing which might be useful).

Hope this is useful.

Steve
The author has edited this post (on 01/02/2010)
link this post written on 02/02/2010
  • To top
Hi Steve, thanks for this all. I hadn't seen these particular documents at xces.org before, it's good to have them. Here are some comments (and questions) on both your conversion example and the GrAF schema and documentation.

- you introduce an "instant" element (to avoid "anchor"), which I support. Do you suggest this alongside "region" or instead of? You suggest to let the link element have optional "start", "end" and "to" attributes? Using the link element with a "to" attribute to link one node to another (instead of using edges) seems a quite significant change as well.
- I did a trivial conversion from EAF to GrAF once (then based on graf-1.0.6b.xsd) and used the nodeSet element as well to model a tier. Now, both in the schema's and in the documentation at xces.org the nodeSet and edgeSet seem to be gone (correct me if I'm wrong).
- introducing a feature structure element at the nodeSet level is probably also non-trivial. But even so, since feature structures can contain anything, successful exchange of tier metadata will entirely depend on (informal) agreement between developers on keys to use.
- (detail) the "as" element seems to have to contain "a" elements once again, that in turn contain the "fs" element(s)?

The main question is, as you already indicate, whether this is sufficient for succesful interchange between tools. Especially if no "fs" is allowed on the nodeSet level and/or if there's no agreement on "allowable" keys (feature names).

Han
link this post written on 03/02/2010
  • To top
Hello everyone,

I have placed the latest working copy of GrAF schemas online at http://www.xces.org/ns/GrAF/0.99. This schema has been updated to reflect many of the changes discussed since Hong Kong. This schema is also guaranteed to change as it is still undergoing testing and there are likely still inconsistencies. However, I wanted to put something up ASAP for comments.

The biggest changes, as Han noted, is that nodeSet and edgeSet are gone; they did not seem to be the most appropriate way to represent layers/tiers of annotations as they grouped "things that can be annotated" rather than the annotations themselves. The new specification allows for two methods of grouping/layering annotations:

1. annotation sets : a set of annotations with possibly some meta-data (a feature structure) attached.
2. layer : a collection of annotation sets.

How the above are used by an application to implement what they call layers or tiers will be up to the application. For example, after looking at the Elan documentation I would likely implement a "tier" as an annotation set:

There are some other minor changes to the XML format as well:
* 'as' and 'a' elements are now children of the 'graph' element rather than being nested under node/edge elements.
* 'a' elements are not required to be in an annotation set.
* the 'a' element now contains a @ref attribute to link it to the node/edge being annotated

Is the above sufficient for successful data interchange? Is there a better way to attach application specific meta-data?

Questions, comments, and suggestions are welcome.

Keith

Edit: I had included several examples with inline XML, however, the forum does not seem to like inline XML...
The author has edited this post (on 03/02/2010)
  • Statistics: 26 Posts | 6705 Visits

Sign in here

Not a member of this network?

Alternative logins

You can use an account of a third party.

Network details

  • Search for:

  • Network name

    Corpus Phonology
    Creating, searching, archiving and sharing spoken language corpora for phonological research

  • Your host is

    Ulrike Gut

  • Created on

    02/08/2009

  • Members

    178

  • Language

    English