Data grids: the eMinerals experience and further aspirations
 
This is the basis of a presentation to be given at a meeting on data management and data grids later this month. As a contribution to the discussion on data grids, this note gives a personal perspective on what eMinerals has been doing with data grids, and on what could follow into future developments.

I want to distinguish between the specific implementation of the SRB and what I will call the "SRB way of working", SRBwow. The SRBwow refers specifically to how the scientists use the concept of the SRB, and not to anything specific about the implementation of the SRB technology. Broadly it is clear that there are now grounds to be critical of the SRB implementation, but these should not necessarily follow through to criticisms of how scientists use the SRB, nor of how the availability of the SRB or SRB concept impacts on the working patterns of scientists.

My version of the history of eMinerals and data grids
From its outset, the eMinerals project had the aim to experiment with grid networks, from both the technical and user perspectives. Our CCLRC partners offered us a portal-based foundation for both compute and data grid interactions. Our first attempts to build on this foundation were to experiment with the condor and globus middleware tools as we started to put together the embryo of the eMinerals minigrid. CCLRC had a major emphasis on its data portal, which was designed to federate distributed data archives. It would enable users to browse through various data collections via a web interface, and selected data sets would be packaged for download. At an early stage in the eMinerals project our CCLRC partners brought the SRB to the UK and enabled us to become amongst the first set of users. I believe that at one stage it had been an intention to include something like the SRB's data management tools within the data portal, but with the porting of the SRB this became unnecessary.

With our first set of user experiments with the SRB, there was a set of informal discussions as to how it might be used. We had set up some trial vaults and experimented with the client tools. I remark here that it was a striking experience to find yourself working within two file structures at the same time; with ls you could list one set of files, and with Sls you could list another. Early on we had the vision that a virtual organisation would need to share data, and the SRB provided exactly the tool for that. At the time, the alternatives everyone used were ftp and http servers, which suffered from the serious disadvantage that scientists had to post their data as a particular task. I remark in passing that I did know about webdav at that time through my experience with Apple's iDisk tools; unfortunately Martin Keegan at NIEeS was very firm that NIEeS should not implement webdav due to some unspecified security issues. In the early days of putting together the eMinerals minigrid we faced the problem of how to transfer data to and from the minigrid. At the time, the Globus approach to file transfer was not ideal. For example, with gridFTP you had to know the names of the files (ie no wildcard names were allowed), and you couldn't transfer directories. Although on hindsight we could have programmed around these issues, it seemed to us that the SRB provided a rather neat solution to the integration of data and compute grids. We had been posing the question of whether the SRB should be reserved only for final data sets or whether it should be used for all files during the progress of a study. By incorporating the SRB into the minigrid, we chose to use the SRB for all files, including files that might end up being completely useless. In other words, the SRB provided a complete archive of the data generated within any study, and of course, the SRB created the instant possibility of sharing data with colleagues, and early on some of us started to use it for that purpose.

When we created the Lakes, we immediately decided to include an SRB vault on each cluster, together with vaults on Pond and at Reading. It should be remarked at this stage that this approach was an experiment. We could have simply created one very large vault at one site. In practice we didn't put a lot of thought into this decision. In effect, we felt that a distribution of vaults would be democratic and within the spirit of a distributed virtual organisation. Moreover, by creating a "distributed data grid" (noting the tautology of this term) we were following the experiment of working with grids.

It is worth noting at this point that the eMinerals SRB is possibly the longest-running production-level instance of the SRB in the UK (possibly in Europe), and is possibly also the most production-active instance as well. It is also worth noting that much of the progress achieved by the eMinerals project was due to our early success in implementing the SRB.

With the SRB, we created the possibility of sharing data with colleagues within the virtual organisaion. What we then needed was a way of understanding the content of the data archive(s). The usual way then (and still is how most people work) was simply to tell each other by email about files or data collections within the SRB. Admittedly we didn't actually do a lot of data sharing as a matter of course; typically the SRB was more commonly used for sharing codes, scripts, binaries and manuscripts. The idea had always been for the data portal to enable project members to locate partner's data. Some effort was put into developing a metadata model, and a metadata insertion tool was developed. This had a close interaction with the SRB. It was possible to make a data set available to colleagues through the metadata editor, and for colleagues to locate a data set by browsing through a set of menus. However, the lack of any uptake in usage of this tool suggested that it didn't match users' ways of working. In particular, I felt that the metadata tools were not appropriate for way our scientists work (nothing was captured automatically, and the metadata was attached at a high level in a study rather than at a file level), and I also felt that users needed a search tool rather than a browsing tool. At this stage we were learning about tools such as Apple's spotlight, and the ubiquitous Google. Thus Rik gave as the RCommands, which enabled us to capture metadata and to search on the metadata.
Not only do we now archive and share data through the SRB, but we use the SRB as the basis for post-processing of data. The development of tools such as TobysSRB and ccViz have meant that the SRB is ideal for management of data associated with post-simulation analysis. This is illustrated by my favourite anecdote, namely that when our third year project student Lucy was running simulations on oun minigrid, she could point me to the place on the SRB where her analysis was stored and I could quickly look at graphs generated on the fly using ccViz.

What the SRB and the SRBwow have given us
In my view, our work using the SRB has changed (even revolutionised) a number of ways in which we are now working. These include:
Using the SRB has revolutionised how we share data with collaborators. It goes way beyond what is possible with other data sharing technologies (such as ftp, http etc). I have never seen data sharing made so easy.
Using the SRB has revolutionised how we manage data within a scientific study. In particular, we now automatically generate complete archives of the files associated with a single study (or part of a study, down to the data set and data object levels). I have never seen this done before, yet it is automatic and easy with the SRB. We now expect to archive complete sets of data associated with a single run or study as a matter of routine. The way we use my_condor_submit and the SRB means that the complete archive is automatically created.
The SRB has made transparent access to distributed file systems very easy, and made extensibility easy too. The addition of new vaults to extend the storage capacity is very easy.
The SRB has given a platform for development of what I call information delivery tools, such as TobysSRB and ccViz, and Rik's metadata tools.
The SRB has made data delivery and data management within a grid computing environment easy - and clearly without the SRB the eMinerals minigrid would not have been possible.
Although there are now criticisms of the SRB, when eMinerals started it was the only data grid tool on the market. It gave us a platform for experimenting. The five points above have emerged from the opportunities the SRB has given us. In whatever follows, I would like to see these significant positive features maintained and developed further.

What do I want from a data grid
Definitions
At the outset, let me deal with what I mean by a data grid and why eMinerals and MaterialsGrid might want one.
A generally accepted idea of a data grid is that it is a federation of multiple data sources with a single point of access. The point of access will provide capabilities for data discovery and data delivery. 
A particularly poor description is given within wikipedia. Closer to home is the NERC data grid. The NERC data grid has features that accord with the working definition given above. 
Within eMinerals, we use the SRB as our data grid tool. It provides a federation of several data sources, namely our SRB vaults, and a single point of access (via Scommands, MySRB, TobysSRB or InQ). What the SRB doesn't provide, at least in the way we use it, is the means for data discovery. To some extent the RCommands provide something of data discovery.
Some comments on the current eMinerals data grid
It is worth noting that in practice eMinerals doesn't ''need'' a federation of data sources. The current use of distributed SRB vaults is something we experimented with as a default option (see comments above). A single source of data would be just as good for a well-defined virtual organisation. However, see note 5 below. On the other hand it is quite likely that MaterialsGrid will want to the capability of using multiple data grids.
We also are not federating pre-existing data sources. Our data grid consists entirely of data generated, or placed, within the lifetime of the project. Again, MaterialsGrid might need the ability to federate pre-existing data sources.
Data sharing between collaborators, in our case enabled via the SRB, is an absolute necessity. The SRB enables automatic data sharing and makes data sharing extremely easy.
Data discovery is something that will be extremely useful, but which is still rudimentary within our project. Rik has done some good work on the tools, but we need to learn how to use them better in my view.
With regard to my first note on the federation of multiple data sources, what a data grid could give (and the SRB does this) is the ability to add new data sources easily. In the SRB context, one simply adds a new vault.
In terms of hardware and software costs, academic projects never have the luxury of expensive data management solutions. The SRB has the key advantage of being free for academic users, and we would not have been able to implement it otherwise. Moreover, commodity hardware products are generally considered reliable enough for academic usage and are affordable. Commodity hardware products, however, may impose constraints on what is achievable in an ideal data grid (for example, the lack of ability to hot-swap commodity drives). There are two issues with cost. One is of course that academic projects don't have a lot of money (and note that research councils prefer to invest in people rather than equipment). The other is that in terms of data longevity, projects are usually time limited and thus long term data storage has to be effectively free.
What I want from a data grid
Based on the comments above, what I want from a data grid are
A means to share data with no overhead on either those who provide or access the data. This needs to be automatic, in that data are uploaded to the data grid automatically from the grid job rather than through a subsequent and separate user operation.
A means to add additional data sources with minimal effort, and for data view across the multiple data sources to be completely transparent (ie one should not need to specify the source name in the file name).
Ideally, if one can add data sources, it should be possible for data sources to be removed as well, in a seamless manner. This would be required for cases when partners are making a data archive temporarily available.
A means for data discovery other than via a physical or logical file hierarchy. This has to be a long term goal, but searching tools are becoming ever more sophisticated (both for web and desktops).
The ability to see a comprehensible underlying file structure, and for users to be able to work with this underlying file structure.
A decoupling of the file structure and the data grid file management tools.
The ability to download files from the file discovery / data grid management tools.
Access control should be lightweight, but nevertheless there should be some access control. I would would like to see data access constraints or permissions applying to both individuals and groups. Typically eMinerals doesn't worry too much about data access, but it will be an issue for MaterialsGrid. Changing access rights should be easy. I would envisage that access control could be enforced within the interface layer, not at the file level.
Lightweight! It should only do what we want it to do (namely store files with easy access and associated tools for metadata management and data discovery).
The ability to use more than one data management tool, with none of the data management tools forcing constraints on the running of the underlying file system.
With the SRB we have to download files in order to use then for data analysis. At the beginning of the eScience programme there was always talk about moving the analysis tool to the data. It would be useful to see whether this could be made possible. This may be not be easy to implement, and if not it would be nice to at least make it appear as if this were so. I think that webdav does this already, if I understand webdav properly.
Should a data grid be symmetric with respect to reading, writing and editing? There are several times when it would be nice to be able to edit files in place, yet with too much freedom there is scope to cause havoc. The SRB is effectively write-once, no-edit, read-often, but with the ability to delete and overwrite. What we currently lack is the ability to tie post-writing activities into the RCommands metadata database. A datagrid should certainly have the means to couple file deletion and file overwriting to changes in the metadata. What I don't see how to easily achieve is to tie editing to the metadata, but that would be nice to have.

Some ideas for a lightweight eMinerals/MaterialsGrid data grid
Lightweight is important. My guess is that this is best achieved by decoupling the data store from the data management tools. Thus we have two problems, one of setting up file systems with access from multiple external sources, and one of providing an interface that incorporates.
It is becoming generally accepted that webdav is a good candidate technology for data sharing. My own personal experience with webdav is very positive.
To my mind, Rik's RCommands provide a good basis for data management tools. With the RCommands you can associate a file URL with all sorts of metadata, including metadata that could be used to replicate a file system if you wanted to work via that route. The challenge then is to maintain the metadata properly, and thus data insertion should be carried out using tools (such as MCS) that also add metadata. One can envisage an equivalent of Sput that simply writes a file and writes the location as metadata together with some category words.
I would like tools to interface to the metadata for browsing for files. Clearly it would be good to be able to identify files based on more than one keyword, but I think that how we develop this should be open to discussion.
One thing that I think we could do with the metadata is to build a dynamic file system via the metadata. For example, suppose we have a file called OUTPUT that is part of a study characterised by keywords such as MARTIN, SILICA, 4096_TETRAHEDRA, TTAM_POTENTIAL, PRESSURE=0GPA, TEMPERATURE=500K, there may be several ways to order the hierarchy. On this basis, one could order the hierarchy in various ways, e.g. MARTIN/SILICA/4096_TETRAHEDRA/TTAM_POTENTIAL/PRESSURE=0GPA/TEMPERATURE=500K or MARTIN/SILICA/TTAM_POTENTIAL/4096_TETRAHEDRA/TEMPERATURE=500K/PRESSURE=0GPA. In ordering the metadata keywords we have created a de facto logical file system. Although this may not be something everyone wants (not least perhaps because there may be a latency overhead due to searching through a database), it is something that I think could be extremely useful to many. Moreover, I think that it could be implemented reasonably easily though the RCommand framework.

Some ways in which the SRB is not satisfactory for use in a data grid
Although the SRB has given the eMinerals project a tremendous boost, there are various intrinsic features associated with SRB-like data grid products that I would not ideally like to see in a production-level data grid. These include
By distributing data across several vaults, we increase the number of points of failure. This is not a criticism of the SRB per se; we chose to implement the SRB in a widely distributed manner. However, in so doing we suffer problems when one vault goes down, because it is likely to contain data we need. Better use of replication would have helped, but the SRB by itself doesn't direct us in this direction.
Our usage of the SRB requires that the MCAT server is working, and too often this needs to be taken off-line for various database upgrades and for other reasons. Since we operate the SRB as part of an integrated compute/data grid, when the SRB is off the whole operation of the eMinerals minigrid is threatened. Jobs completing whilst the MCAT is off will simply die with a loss of data. Any data grid that is integrated into a compute grid must have near 100% availability.
The SRB stores files on the vaults using a set of naming conventions that are not at all transparent to people who wish to access the raw data. The reasons for this are clear. Files stored within the SRB are given logical names, and the SRB will support replicas within the same vault having the same name. However, I feel that access to the underlying file system with sensible file and directory names is essential for a number of reasons.
The SRB has a high overhead with the database. It requires dedicated staff time. Although tools on top of the data grid may require the use of a database, it is not desirable for the database to be the only means through which the data can be accessed if the requirement for a complex database is so costly.
Through its reliance on the MCAT server, access to the data is reliant on the continued provision of the MCAT server. Since the provision of the MCAT server carries a financial cost, long term access to the MCAT server is not guaranteed. Thus there is a problem with regard to long term access to the data. The fact that one cannot extract data directly from the vaults means that prior to the owner of the MCAT server removing access, we have to use SRB tools to extract all data. A data grid needs to ensure that longevity is built into it from the outset by removing reliance on different components.
The SRB doesn't easily enable federation of existing data sources; data sources need to be created as SRB vaults.
There are also a number of implementation issues, which are probably better left for others to fill in. These include issues with filenames, issues with database implementation and accordingly database access speeds, some inconsistencies with regards to interfaces and behaviour, and the difficulties in fixing some errors (our old friend 1107 springs to mind).
Monday, 30 October 2006