null

TCP over Anycast: Your Options

max — Sat, 19 Feb 2022 09:52:41 GMT

Previously I gave some background on TCP over anycast, discussing the motivations and some possible challenges, now I'd like to talk about implementations. As a quick reminder, the situation we have looks like the diagram below and we are looking to gain redundancy/availability in the load balancing layer such that traffic for a single connection can arrive at any endpoint/load-balancer and still be forwarded to the backend that initially handled the connection.

Single-node Failover

Only one load-balancer in this cluster can service a connection at a time

This is probably the most common approach taken and may seem like the simplest. We discussed before how allowing multiple endpoints advertising an address leads to issues if traffic for a single connection arrives at different endpoints. It seems like a really easy way to address this is to only ever advertise from a single endpoint and failover to a passive standby server as needed. There are some additional things that we should consider when choosing this option though.

One very obvious drawback is that we are always sitting on substantial (50%!) idle capacity with our failover nodes. Another is that we are only able to scale up and not scale out; this may or may not be a problem depending on your requirements. Somewhat related is that we will need to figure out how to distribute the services we are exposing across our pairs of load balancers.

Finally, if you've ever worked with distributed systems you know that reliably and accurately detecting failures is non-trivial, and thus so is automatically failing over. A well-known VIP failover mechanism is VRRP and there are many similar proprietary options available. All of them rely on the clustered load-balancers to be specially-configured on a single LAN and – at least to me – seem somewhat complex. From where I sit, if I can avoid the complexities of a setup that relies on failover I would prefer to do so.

HA with fully-shared connection state

Any of the load-balancers in this cluster can service any connection and they share state between themselves

This high-availability setup mitigates many of the drawbacks of the single-node implementation. The basic idea here is that all the load balancers are sharing a single global table of connections. When a load-balancer selects a backend for a connection, it tells all the other load-balancers the connection details and the backend that was selected so they can store the information. Losing a load-balancer here is no big deal because all are equally capable of handling traffic from any of the connections they are collectively sharing.

This kind of setup introduces some operational complexity though because now nodes in a cluster either need to know about each other or be on the same LAN (utilizing broadcast for sharing and discovery). The latter is popular among open source solutions. This is not necessarily an impossible barrier to overcome, but it likely will make your setup more static.

There's also additional resource overhead as a result of state sharing. For a load-balancer cluster handling 100k connections per second the compute and network overhead of sharing connection state is non-negligible. If we very conservatively assume 8 (address information) + 4 (port information) + 8 (IP packet carrying the information from one host to another) bytes per TCP connection will need to be sent over our connection-sharing protocol, that's 20 gigabits/second that will need to be both sent and processed for control-plane traffic alone.

A good question is whether we can make our cluster shared-stateless to avoid these operational challenges and still get the benefits of HA.

A brief digression on selecting a backend and connection state

Initially we need to choose one of N possible backends to service a new connection and then make sure that all traffic in that connection also gets sent to that same backend. Ideally we are also choosing backends in a balanced manner such that they all handle roughly the same number of connections.

We could choose a backend at random without too much fuss, but this introduces some complexity because we then need to remember – for additional packets in that connection – what backend was initially chosen. Basically we just need a table to store all of our connections with their chosen backend in. But as discussed above, this teeny bit of complexity begins to percolate through the rest of our system. Not only will we need to remember locally which backend was selected, but we'll also have to tell all the other load balancers which backend we chose, because if packets for that connection arrive there, then that load balancer is also going to need to choose the same one.

To avoid this complexity, we look for a way to deterministically route packets in a connection so that we don't have to do this sharing and all of our load balancers can independently determine where a packet should go without talking to each other. For example, we can assign a backend to a "bucket" and do a hash of something that is common to every packet in a connection – like the tuple of (source IP, source port, IP protocol number, destination IP, destination port) aka the 5-tuple – and map the range of hash outputs into discrete buckets. And here it seems we have escaped the need to keep track of – and thus share – connections if all endpoints are using the same hashing algorithm with the same set of backends. However with naïve hashing this strategy falls apart when backends are added or removed. If we add or remove buckets, the designated bucket for any given connection is very likely to change.

A solution is to use rendezvous hashing or consistent hashing, both of which are algorithms that minimize the reshuffling of hashed objects when the number of buckets changes.

Which leads us to, drumroll please...

HA without shared state

Any load-balancer can service any connection and each operates completely independently from the others.

In 2016, Google published a paper detailing the architecture of their internal load-balancer called Maglev. In it they describe a system much like the one shown above, based on load-balancers that don't share state. Github has published the source of their similar system, glb-director. And there are others!

By utilizing variations on rendezvous hashing, these systems avoid the complexity of keeping global state while still allowing for a scale-out model.

One issue I've personally found investigating these solutions is that the source is unavailable (Maglev) or they are implemented using tools like dpdk which utilize Linux features that may require specific hardware (glb-director). But it turns out you can build a system like this just using well-understood Linux components like LVS and netfilter(iptables/nftables) that are already implemented in-kernel. And very fortuitously, Maglev hashing was added to LVS in ~2018!

Talk is cheap though, so I made a docker-based lab for investigating these properties. There's still a lot to be done here, I'd like to set up a test harness and do some additional tuning for faster convergence, but the basic setup works! Many thanks to Vincent Bernat for his excellent blog post detailing this architecture.

The next step for me is to make a test harness for this system to prove resiliency and then write everything up, stay tuned!

Background on TCP over anycast

max — Sat, 29 Jan 2022 09:07:14 GMT

People will often ask me at parties, "how can you possible make stateful connections work with anycast addressing?"

I'm so glad you asked! If the issue here isn't immediately apparent to you, no worries, we'll dig in.

By Easyas12c~commonswiki - Wikimedia Commons, Public Domain, https://en.wikipedia.org/w/index.php?curid=53850281

The above diagram is a basic representation of anycast addressing, where the node in red want to talk to a single address which can be routed to any of the green nodes. A really common use case for this is DNS; most folks are aware that the 8.8.8.8 or 8.8.4.4 Google DNS servers are not a single endpoint, but in fact many.

And here's proof:

Talking to 8.8.8.8 from my laptop in Hong Kong takes 2.33ms

~ ping 8.8.8.8
64 bytes from 8.8.8.8: icmp_seq=4 ttl=116 time=4.660 ms

and takes just 1ms from a VM I have in New York.

root@ubuntu-vpn-nyc:~# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=1.95 ms

This means the furthest the distance from my VM to the server replying to me can be is 300km, while the furthest it can be from me in Hong Kong is ~600km. I'd challenge you to find me the datacenter that fulfills both of these requirements!

Okay, so 8.8.8.8 is at least two separate servers, but admittedly this is not that interesting. Because DNS is just a single UDP request and reply, it makes total sense that our request can be served by any endpoint and then return to our public-facing IP.

If we are speaking TCP though, things are not quite so simple. Imagine a TCP handshake that went like this:

me to 8.8.8.8, landing at endpoint 1: SYN! I'd like to make a TCP connection!
server at endpoint 1 to me: SYNACK!
me to 8.8.8.8, landing at endpoint 2: ACK
server at endpoint 2: ????

How awkward, right? Unfortunately, there is no magic in the internet that is guaranteeing our packets always wind up at the same endpoint if there are multiple endpoints for a single address. And the Internet Protocol never promised us anything like this, all it cares about are addresses. If we decide to say we have one address available at multiple endpoints, then we are going to be responsible for handling them arriving at any of them for a given connection.

In reality, for short-lived TCP connections, it is likely to be the case that all packets in your connection take the same path simply due to the fact that packets are generally routed on the shortest path to their destination. And even though routes change all the time on the internet, they are unlikely to change on the order of seconds.

Why do we want this?

It's worth asking what benefit we get from an IP being able to route to multiple endpoints. In general, I think we understand that there can only ever be a single backend handling a connection. If a piano, cartoon-style, falls through the sky onto the rack holding the server that's processing a request there's very little we can do. But let's look at a typical setup for serving web requests.

H2g2bob, CC0, via Wikimedia Commons

Generally, the server handling the request is not exposed directly to the internet and a reverse proxy layer sits in front, providing load-balancing, caching, TLS termination, etc. This proxy layer is not just sitting in front of a single service, but several. It's generally specialized: it likely lives in a DMZ network zone and may have fancy hardware specifically for performing load-balancing. The redundancy we are looking to gain is in this proxy layer.

The number of proxy servers is <<< the number of total application instances, meaning in a naïve implementation losing a single proxy server results in large disruption to connections across many different applications. It also means that losing a proxy server is more likely than losing any given application server. If we assign the IP(s) we're designating to an application to just a single proxy server (because remember, we don't have anycast), we also have to handle failing over that IP to a new instance. If we don't, in addition to losing existing connections, we are also blocking new connections from being able to be made.

So let's say we've got something like the following requirements:

Losing an endpoint has minimal, if any, impact to existing connections
Automatic IP failover, or IP failover not required
Losing a backend only impacts connections to that backend (the reason for this one will become more obvious when we talk about possible solutions)

Stay tuned for part 2 where we talk about solutions to these problems!

host ports and hostnetwork: the NATty gritty

max — Mon, 10 Feb 2020 02:42:50 GMT

if you're familiar with kubernetes you know that pods (the basic workload unit in kubernetes) are all assigned their own IPs and exist in their own separate network (also pid, mount, etc.) namespaces. thus it's possible for two pods living on the same host to bind the same port without any conflicts. similarly, something running in the host's root network namespace would be able to bind the same port with no issues.

sometimes we need a way to statically expose a pod's bound ports on the host's IP. if we do this we (as k8s operators) need to ensure that there will not be port conflicts wherever the pods are scheduled so this is usually done through the use of daemonsets where scheduling (and thus port conflicts) will be obvious. for these reasons though using either hostports or hostnetwork is generally considered an antipattern outside of a few specialized scenarios!

(there's are also nodeport services, however these are not directly exposing a pod's ports so I'm not going to talk about them here)

hostNetwork

host network, as you might be able to guess, is a setting that allows the pod to run in the host's root network namespace. it's a field that can be set on the pod spec and when a pod with hostNetwork: true is launched, two things that usually happen when starting a pod will not happen:

setting up a separate network namespace for the pod
running of CNI plugins

these two things are essentially all that are required to implement hostNetwork. now let's talk about the implications of this.

if your pod is going to bind ports, the spec says that you must report them in the ports section of the spec, but there is nothing that actually enforces this. it also gives access for these workloads to do things like tcpdump all traffic on a host's real interface. thus, this is a dangerous mode of operation as these workloads could wreak havoc in the root net namespace.

another downside of hostNetwork is that traffic from these pods is indistinguishable from traffic from the host the pod is running on. thus, regular k8s network policy cannot be applied against them.

the upside of hostNetwork is that 1 + 2 mean that your networking provider doesn't need to be up in order for these pods to function and any time the host has connectivity the pods will have connectivity also. if you have pods that need to come up before pod networking is available (assuming your pod networking is driven by other pods, as is the case for many network providers) then hostNetwork is mandatory.

my recommendation though is to avoid this feature when you can.

host port

host port is a lighter-weight way of binding a port on a host and allows for enforced collision detection at schedule time. it's implemented in the portmap CNI plugin and is a field on the container spec in the ports section.

when a pod with hostports specified is launched the portmap plugin creates the following iptables rules in the prerouting and output chains of the nat table:

a rule that looks for any traffic on the host port and retargets it to the destination port and the pod IP. eg if my host port is 8080, my pod port is 8081, and my pod's IP is 172.16.15.15 then the rule looks for traffic coming into the host on 8080 and redirects it to 172.16.15.15:8081
a rule that looks for hairpin traffic (eg pod -> hostport -> back to pod) that marks traffic for MASQUERADE. without this rule the source address of the traffic going back to the pod be the pod's own IP.
a rule that looks for local traffic (eg localhost -> local address) that marks traffic for masquerading. without this rule the source address of the traffic going to the pod would be 127.0.0.1 which would route to the pod's own net namespace instead of the host's

host ports are a better option than hostNetwork if you can use them, but with one caveat that I think is worth mentioning. because the NAT is implemented in iptables, you won't see a socket listening on the host port being used which may be unexpected behavior if you don't understand what's happening. ie nc localhost will work fine, but if you look at ss -l nothing will show up for and you have to go to iptables to see what's actually "listening" on the host.

code is just a byproduct

max — Sun, 27 May 2018 00:10:41 GMT

We program with constructs. We have programming languages. We use particular libraries, and those things, in and of themselves, when we look at them, like when we look at the code we write, have certain characteristics in and of themselves.

But we're in a business of artifacts. Right? We don't ship source code, and the user doesn't look at our source code and say, "Ah, that's so pleasant."

-- Rich Hickey, simple made easy

i would like to talk about this tweet:

If developers are software "engineers", then being told to develop an app with Scala instead of Haskell is like being mandated to build a bridge out of plywood instead of steel.
— Mark Hopkins (@antiselfdual) May 3, 2018

from what i can tell, Mark is a pretty smart guy and probably a good programmer. also his tweet "did well", so i feel a little less bad about what i'm about to do with the tweet, which is critique it.

the first thing i think we need to do is crystallize what the "bridges" we build are. they're binaries, right? that's why we call them "builds." this should be obvious to us, but i think we oftentimes forget it. i also think our bridges include other things we construct, for example processes (here meaning a series of actions or steps taken in order to achieve a particular end) or execution environments. but they're not source code! outside the context of this tweet, i've seen people get tripped up on this a bunch. we need to think of the outputs of the development process as being binaries, not as being code.

my issue with the tweet is that Mark is conflating properties of the code (or construct, as Hickey says) with properties of the artifact. i'd guess that Mark believes that qualities like purity, a more syntactically-nice type system, and much-less-accessible mutability are things that can add strength to our "bridges". this is a sentiment that i think is widely shared by many in our profession. but, again, source code isn't executed. these qualities are qualities of our code and, by definition, cannot be inherited by the artifact. the only value these qualities provide us is in how they affect the production of the artifact.

so let's talk about them like that. i truly believe that capturing side effects in the type system, like haskell does, improves a programmer's ability to reason about what the outputs of her programs will be. what a program's outputs are going to be is definitely a property of the artifact, big win for haskell here. but haskell's runtime execution model is inhibitory to the same programmer being able to reason about how her artifact executes compared to a language like java or scala. changing or even understanding the performance profile of a haskell artifact is going to be much more difficult than other languages.

my point is not that there is a tradeoff (although there definitely is), it's that in order for us to compare our tools, the thing we should be looking at is what impacts those tools have on the production of artifacts. our experience as programmers naturally centers around code and the experience of writing it because that's what we spend our time doing. however our job is to produce binaries; code is just a byproduct

Oncogenesis and Protein Folding

max — Thu, 04 Feb 2016 03:42:00 GMT

So this post is mostly going to be about/related to my project last summer at the University of Utah. I realize that my "audience" is probably not going to be biologists or biochemists, so I'm going to try and make this post as accessible as I can to those without that background. We're going to learn about cancer!

Really quickly, you will just need to know that:

All living things are composed of cells
Humans are multicellular organisms, but unicellular organisms exist

Overview and the Cell Cycle

In order to talk about cancer, we first have to talk about the cell cycle. For a multicellular organism, the above is a representation of the states each of its cells could be in. Every time a cell passes through M, a new, duplicate cell is produced.

Now, we can imagine in a healthy organism that this process would need to be heavily regulated, and indeed it is. We don't want cells dividing willy-nilly in our body, we need them to divide at a regulated rate or in response to some signals from other cells. Think about what happens when you get a cut -- we need lots of new cells to replace the ones that are no longer there! However, if there's no injury, it's unnecessary and even detrimental for us to be producing all these excess cells.

The loss of this regulation is essentially the first step of oncogenesis AKA the formation of a tumor.

Central Dogma

In order to talk about loss of regulation, we need to talk about the central dogma of molecular biology.

In simplest terms, the central dogma of biology states that "DNA makes RNA and RNA makes protein." This represents the basic flow of information in each of the cells in your body.

Each of your cells has long strands of a relatively stable molecule called DNA. In fact, each of your cells (with some exception) has the exact same strands as every other cell you have. DNA by itself does not do anything though, it simply encodes information. In order for cells to make protein, which we can think of as the machinery of the cell, it must first be transcribed into RNA. RNA is a molecule that is similar in structure to DNA, but is single stranded and much less stable. As such, it's fairly transient in a cell. This RNA molecule is then translated into protein. Protein serves many purposes, but for our discussion we will think of it primarily as a signaling molecule and as cellular "machinery".

The best analogy I'm able to think of is this: DNA is the library of the master copies of all the blueprints of all the possible machinery a cell can produce. It's risky to produce the machinery from the master copy (for reasons you're about to see!), so we make a temporary copy of it and use that as the basis for our machinery. It should be noted that, like our "blueprint copies", the machinery is also relatively short-lived in a cell and will eventually be degraded.

I'm going to try and stick to this "machinery" analogy when possible for the rest of this post.

Cancer is a disease of mutation, and you might be able to guess that it's DNA that's mutated. If we were to have "erroneous" RNA, we might produce some defective machines, but they won't last too long. Similarly, if we have an "erroneous" protein, we again just have a defective machine that isn't going to last very long.If we acquire an error in our DNA, then we've potentially modified the master copy of the blueprint of some machine and every one of these specific machines we make thereafter is going to be "erroneous." Even worse, every single one of this cell's descendants will have this same error.

So, back to a loss of regulation. Certain proteins signal or push the cell to go through the aforementioned cell cycle. These proteins are referred to as oncoproteins or oncogenes. We can think of them as the gas pedal for the cell cycle. Some other kinds of proteins do the opposite and pause/slow the cells' transition through the cell cycle. These proteins are called tumor suppressors and we think of them as the brakes on the cell cycle. So, looking at all of this, you might think that a defect-causing mutation in a tumor suppressor -- the "brakes" -- might result in cancer. And indeed, that's one way it can happen! But it's more likely for a mutation in an oncogene -- the gas pedal -- to be a driving mutation in cancer. This might seem odd, as we said that mutations cause defects, so a defect in an oncogene should mean the cell would move more slowly through the cell cycle, right? The explanation for this will bring us (finally) to protein 3D structure.

Regulation at the protein level

When we said that a protein pushes a cell through the cell cycle, I didn't give you the full picture. It's usually not enough for these proteins to exist in the cellular goop in order for them to function; they need to be switched on. A protein's function is determined by the chemical properties of your active site(s). For example, if you're a DNA slicing protein, then your active site has chemical properties that allow it to slice DNA. These chemical properties aren't easily changed, so how can a protein rapidly and easily turn on or off? One answer is three-dimensional structure "conformations".

We can think of many proteins as being chemical "globs." Any given protein has a certain three dimensional structure that is determined by its chemical properties. For example, it's energetically favorable for a positive and negative part of the protein to be close to one another, while energetically unfavorable for two positive or two negative portions of the protein to be close together. Certain parts of a protein may be favorable (for reasons) to be on the outside of the protein glob -- facing the cellular goop -- while others are favored to be internal, facing other parts of the protein. An example of a globular protein is shown.

Let's imagine we have a protein whose active site binds DNA. In its initial state, it's very neatly folded with lots of favorable interactions between the positive and negative parts. The active site is hidden at the center of the glob though, so it's unable to do its function at the present time. If, all of a sudden, one of these positive parts becomes negative or neutral, our favorable interaction now becomes an unfavorable or neutral interaction, causing our glob to take on a new shape. In this new shape, our active site is exposed, and now we can bind DNA. An example picture is shown.

This is one example of how a cell can regulate the activity of its proteins. By chemically modifying certain regions of a protein we can turn on or turn off certain activities.

Back to our imagined DNA-binding protein, let's say it's also an oncoprotein: when it binds DNA it increases transcription of other proteins that cause the cell to go through the cell cycle. If the encoding for it is mutated such that one of our normally negative regions is now neutral or positive, our active site is now permanently exposed and always on. We now have a cell with the accelerator pedal for the cell cycle that's been taped down! This single mutation in DNA has given us a loss of regulation of the cell cycle and thus tumorigenesis.

So....

So now we have some understanding of how a mutation in DNA can cause proteins to fold incorrectly, resulting in incorrect function. Where can we go from here? Knowing what mutations are likely to cause cancer is far from being a solved problem. We may have some genetic information about a mutation that's been acquired by an individual and want to ask "Is this individual now at an increased risk for cancer?".

My project this summer was to use a 3D structural prediction model to simulate what proteins with known-oncogenic mutations might look like in vivo. Using this data, our plan is to use machine learning methods in combination with known-non-oncogenic mutations to classify mutations of unknown oncogeneity. Our preliminary results suggested that oncogenic mutations in oncogenes became less stable, indicating that these mutated proteins were more likely to going through changes in three dimensional structure. Oncogenic mutations in tumor suppressor genes tended to result in proteins that were more stable and thus less likely to undergo conformational shifts, indicating that they may have become "locked" in an inactive state.

Further thoughts

There are some 250k encoded proteins in the genome, most of which have unknown functions. We may want to ask "Could this gene be an oncogene or a tumor suppressor?" Or given that we know a mutation in a gene is oncogenic, we may want to ask whether this gene is an oncogene or a tumor suppressor. If we can classify known oncogenes or known tumor suppressors by how they look when they are mutated, we might be able to answer open questions about the role of other proteins in the cell.

Elm

max — Fri, 02 May 2014 02:47:00 GMT

Background

Last quarter I took a course called Principles of Safe Software which had a lot to do with program correctness, compile time guarantees, etc. We specifically explored this through Haskell and its type system. Haskell is a pure, statically typed functional language. Almost all of my programming experience has been with 'impure', dynamically typed, imperative languages so this was quite the trip for me! While I haven't had much time to master Haskell as much as I'd like, an opportunity in class has come up for me to play with Haskell-like things, namely Elm!

Elm

Recently I've been playing a lot with Elm and by playing I mean actually playing! Elm is a language with similar syntax to Haskell that compiles to HTML and JS. Elm is tons of fun and really easy to write. I don't (and don't necessarily want to) know a whole lot of JavaScript, so Elm has been a good way for me to build some fun UI stuff with a syntax and idioms that are very similar to those of Haskell. It's really neat what you can do with a couple lines of code sometimes(hint: click and drag)!

GSoC 2014: Dockerizing Flumotion, Building a Build System

max — Fri, 14 Mar 2014 03:06:00 GMT

Proposer: Max Stritzinger

University of California, Santa Barbara
mstritzing@gmail.com

Synopsis

This is a proposal for TimVideos for GSoC 2014. This proposal is centered around the 'Dockerization' of Flumotion allowing for quick and portable deployments of the TimVideos streaming system. Following completion of Dockerizing the streaming system, work will begin on an automated build tool based on the newly Docker'd components.

~Dockerization~

Docker is an open-source framework allowing for the creation of portable application containers. Docker allows for an application and all of its dependencies to be wrapped in one of these containers and deployed with ease to any host with Docker installed. My proposal is to wrap the streaming system up in a few Docker packages to make it extremely easy and quick to deploy. For example, following 'Dockerization', setting up the website portion of the streaming system might be as simple as:
docker pull streaming-system/website
docker run -name website -p 80:80 -d streaming-system/website python streamingsystem/tools/server_start.py some_conf_file

Automated Build System

Dockerization would be the meat of the project, however my belief is that time will allow for development of a script/tool to generate most of the streaming system automatically assisted by Docker. Following delivery of the Docker portion of the proposal, I would plan to continue working on an automated build/deployment tool. Docker fits in really well with an automated deployment tool. Docker supports running in daemon mode, so you can imagine PXE/AWS booting an image running a Docker daemon. Then, deployment and configuration for our sample website component becomes as simple as:
docker -H remote_host:4243 pull streaming-system/website
docker -H remote_host:4243 run -name website -p 80:80 -d streaming-system/website python streamingsystem/tools/server_start.py some_conf_file
You can imagine the ease at which this can be scripted out!

My Proposed Work

Starting with a base Ubuntu Docker image, I will build a streaming-system/base Docker image with all dependencies (including watchdog, register) installed. This image will be capable of functioning as any of the three components, given some additional configuration
Building off of the streaming-system/base package, I will build separate images for the Encoder, Collector, and Website components of streaming system. These images will come preconfigured for their specific component function and will be able to accept configuration arguments on startup for easy setup. This will involve generation of custom scripts/config files for each of the different components.
streaming-system/(base, website, encoder, collector) will published to the docker.io index under the streaming-system or similar username
Following the completion of the Docker portion of the project, I will then begin work on the automated build and deployment system.
The automated build system will be integrated around a Cobbler daemon and the AWS Python API.
The build tool will, based on the configuration file/utility, boot any PXE images it needs via cobbler and start EC2 instances via AWS, push the correct component(s) to each host (through a docker remote pull), and then start each docker instance based on the configuration options specified.
Timeline

Now - May 18th : Prior to the official start of the program I'll be active on IRC and poking around the streaming system components to gain familiarity with the different parts of the streaming system.
May 19th - May 25th: Development environment setup, setup of the streaming-system/base container.* May 26th - June 1st: Setup / Configuration of the streaming-system/website container. By the end of this week we will be able to 'docker pull' a website container and start the web service with a simple configuration file and start script.
June 2nd - June 11th: Setup / Configuration of the streaming-system/collector container. By the end of this week we will be able to 'docker pull' a collector container and start the encoding service with a simple configuration file and start script
June 12th - June 22nd: Setup / Configuration of the streaming-system/encoder container. By the end of this week we will be able to 'docker pull' an encoder container and start the collecting service with a simple configuration file and start script
June 23rd - June 29th: Dockerization should be fully complete! Work begins on the automated build system. By the end of this week we will have a tool that will take a component type, some parameters regarding it's associated components, and will output a configuration file.
June 30th - July 13th: Extend build tool (or possibly create a separate 'boot' tool) to incorporate host booting. We will be able to take a list of hosts and their types(EC2, PXE) and deploy a docker-enabled image to them.
July 14th - August 10th: Add support to our tool for specifying entire systems of components with the hosts they will run on. Tool will boot hosts, deploy docker containers to them, and start the components. This should correctly start the streaming system
August 11th - August 18th: Clean up documentation, final testing/debugging and debrief to TimVideos team.
Accountability

I am very much interested in completing my project successfully as I am sure the TimVideos team is. Some accountability is a great way to make sure that happens! I am (and will be) almost always available on IRC for interrogation and would propose bi or tri-weekly updates/progress meetings with my mentor(s) to ensure that I'm on the right track. In addition, I would plan on keeping a blog tracking my work throughout the project on my website.

About Me

Second year student at UC Santa Barbara, studying Biochemistry and Computer Science. I currently hold a 3.89 GPA
Student Web/App developer for Associated Students at UCSB.
Very available this summer (the second week of June I have one or two finals, but I'm free otherwise)
Familiar with many languages and web technologiesincluding, but not limited to:
- AWS
- Python
  - Flask
- Haskell
- SQL (experienced with MySQL specifically)
- PHP
- Wordpress
- General Bash/Shell scripting
- (Somewhat rusty) DevOps/SysAdmin experience with:
  - CentOS/RHEL
  - Ubuntu Server
  - MySQL Server
I am, of course, available for contact almost anytime on IRC or through e-mail as listed above.

Comments, Concerns?

If there's anything you'd like to share, feel free to post it below or give me a ping on IRC.