Re: Debian snapshot.debian.org replica on the Amazon cloud?

To: Peter Palfrader <weasel@debian.org>, Paul Wise <pabs@debian.org>, debian-snapshot@lists.debian.org
Subject: Re: Debian snapshot.debian.org replica on the Amazon cloud?
From: James Bromberger <james@rcpt.to>
Date: Fri, 06 Dec 2013 00:35:32 +0800
Message-id: <[🔎] 52A0AB54.2030904@rcpt.to>
In-reply-to: <20131204175447.GK32290@anguilla.noreply.org>
References: <1380274351.6986.17.camel@chianamo> <5245CED5.8040204@rcpt.to> <1381566041.30595.23.camel@chianamo> <52592527.4090903@rcpt.to> <20131119102311.GW886@anguilla.noreply.org> <528CE261.6030100@rcpt.to> <20131120163731.GD4959@anguilla.noreply.org> <528CEA43.1030407@rcpt.to> <20131204175447.GK32290@anguilla.noreply.org>

On 5/12/2013 1:54 AM, Peter Palfrader wrote:

I wonder if we can somehow, somewhere tag files that we got from non-US
or archive.d.o (which also covers non-US) and no other tree.  Somebody
would have to write code for that.

Do you have the source of where the files originated from somewhere? We can add arbitrary metadata as headers to objects if we want to. That metadata tag gets served back as headers when you get/head the object (file).

I wonder if we can replicate to a running postgres instance.  If not, we
might have to feed it individually, importing the dumps that the master
produces.  Thoughts?

A dump from the current master would be a good start. What size are they
(is it the 2.1 GB file I saw in there)? Peter, would you like the
credentials for this DB (also in US-East right now)? If so, can you give
me an IPv4 you'll be accessing it from?

I'm not sure I can make use of DB access right now, thanks.  When we
still had a mirror at UBC, we used postgresql's DB replication feature
to keep that mirror in sync.  Is that an option with this instance?

Not right now - the replication that is supported is currently wholly within the AWS environment - Multi-AZ is the feature, synchronous block level replication from host in one cluster of data-centers (Availability Zone - AZ), to a standby host in the second AZ.

OTOH, we may not necessarily need a DB at amazon.  It should certainly
be possible to seperate backend hosts from frontend from database hosts.

Absolutely - completely your choice.

See attached for an untested example to 'restore' a given file that has been archived; this is not a working example, just an initial sketch. The user credentials I sent you (offlist, naturally) for data ingest to S3 does not have access to call this restore at this time; we'll do a separate user with only access to restore (and not ingest) that can be in this script - but the concept is easy:
* We'll default to restoring a file to live storage for 14 days (after which the duplicate in lice storage is automatically removed)
* We'll limit to doing 100 file restores per day from archive

Feel free to edit parameters as above - but hopefully it shows you how this plugs together.

James
(Going to bed now - 12:30am now here at AWST+0800)

--
Mobile: +61 422 166 708, Email: james_AT_rcpt.to

#!/usr/bin/python
# vi: ft=python
restore_time_days = 14
restores_per_day = 100
bucket_name = aws.snapshot.debian.org
simpledb_table = snapshot.debian.org

import datetime
import boto.sdb
from boto.s3.connection import S3Connection
s3_conn = S3Connection('<aws access key>', '<aws secret key>')
bucket = s3_conn.create_bucket(bucket_name)
from boto.s3.key import Key
sdb_conn.create_domain(simpledb_table)


def check_file_is_archived(file_name):
  if bucket.list(prefix=file_name).storage_class = "GLACIER"
    return 1
  return 0

def check_file_already_being_restored(file_name);
  if bucket.get_key(file_name).ongoing_restore
    return 1
  return 0

def check_daily_restores():
  today = datetime.datetime.today()
  tomorrow = datetime.date.today() + datetime.timedelta(days=1)
  sdb_domain = sdb_conn.get_domain(simpledb_table)
  todays_sdb_data = sdb_domain.get_item(today.day)
  return True if todays_sbd_data.count < restores_per_day

def restore_file(file_name, days):
  key = bucket.get_key(file_name)
  key.restore(days)  

def update_daily_restores():
  today = datetime.datetime.today()
  sdb_domain = sdb_conn.get_domain(simpledb_table)
  todays_sdb_data = sdb_domain.get_item(today.day)
  if todays_sdb_data.today != today
    todays_sdb_data.count=1
    todays_sdb_data.today = today
  else
    todays_sdb_data.count++
  todays_sdb_data.save()

def process(file_name): 
  if (not check_file_is_archived(file_name))
    return "File is not achived."
  else if (check_file_already_being_restored(file_name))
    return "File is already being restored - please wait 3 - 5 hours from the inital restore"
  else if not check_daily_restores
    return "Too many restores have been done today - come back tomorrow"
  else
    restore_file(file_name, restore_time_days)
    update_daily_restores()
    return "Your file has been scheduled for restore - please try and access it in 3 - 5 hours"

def main():
	# If this is web accessible, probably drop some HTML in here...
	print process()

Reply to:

Index(es):
- Date
- Thread