(Apologies for the delay in me getting back - we've had Linux.conf.au here in Perth (http://www.youtube.com/user/Linuxconfau2014) and I've been on the road for work)
So all 18 TB as of 4th January are in AWS S3. New files since then are kept on live storage (S3) for 90 days after ingestion, and are then sent to archive (AWS Glacier). Pulling files back from archive is a 3 -5 hour delay. Files have been archived with the same UUID filenames as on Sibelius.d.o.
Thanks to a Delan on CC - a student here in Perth (W. Aus) who saw my presentation at Linux.conf.au - we have a simple web front-end to manage the restoration of files from archive. It works like this:
Clients hit a URL for a file on a management micro server. It checks if the corresponding file you wanted is on live storage, and
-- a) if it is live, redirects the browser to S3 to fetch it directly
-- b) if it is already being restored, lets clients know to come back later
-- c) if there are less than some daily maximum restores per day (100?), schedules a restore for a month, and lets clients know to come back later
This is currently on http://22.214.171.124/ - but restores and fetches not yet available from Internet (watch this space).
Currently the UUID files download with the same name they were ingested with - I am contemplating going back and adding a content-disposition header for each with the corresponding original filename. I'll have to bash the Postgres database for this for each file.
Peter, does this sound sensible - the logic for restores, and the setting of content-disposition header so clients can download direct from the AWS S3 Bucket?
Peter, is S3 receiving a copy of all new ingested files into snapshot? If not, who should do this? Also, may I get a read-only access to the live postgres DB to be able to monitor files being ingested and to check the real filenames that should be set (access from localhost on sibelius is fine)?
/Mobile:/ +61 422 166 708, /Email:/ james_AT_rcpt.to