Re: Fast shared or local storage? (Cedar McKay)
This archive was generated by
Beware of using NFS – it may not be posix compliant in ways that seem minor but have caused problems for HDF5 files. I don't know what the blast db file structure is or how they organize their writes, but it can be a problem in some circumstances.
I really like the suggestions of using the ephemeral storage. I suggest you create a plugin that moves the data to the drive from S3 on startup when you add a node. That should be simpler than the on demand caching which although elegant may take you some time to implement.
Thanks for the very useful reply. I think I'm going to go with the s3fs option and cache to local ephemeral drives. A big blast database is split into many parts, and I'm pretty sure that every file in a blast db isn't read every time, so this way blasting can proceed immediately. The parts of the blast database download from s3 on demand, and cached locally. If there was much writing, I'd probably be reluctant to use this approach because the s3 eventual consistency model seems to require tolerance of write fails at the application level. I'll write my results to a shared nfs volume.
I thought about mpiBlast and will probably explore it, but I read some reports that it's xml output isn't exactly the same as the official NCBI blast output, and may break biopython parsing. I haven't confirmed this, and will probably compare the two techniques.
Received on Fri May 09 2014 - 13:49:20 EDT