Sunday, December 27, 2015

New blog

I'm moving over to github pages: https://github.com/turtlemonvh/turtlemonvh.github.io

I'll probably write a short post about why, but here are a few:

* excuse to play with new technology
* better code formatting
* can do stuff like embed ipython notebooks and running javascript pretty painlessly
* better seo opportunities (this one seems to be very hard to find, even if you search for exact terms)

Adios blogger.  You were useful, and I would still recommend you are a good starter platform.

Wednesday, November 11, 2015

Cool Systems: ITA Flight Search

There are a few neat systems I've read a lot about lately.  This article by Paul Graham about ITA software (famous for revolutionizing the flight search market) gives a little insight to their approach:
http://www.paulgraham.com/carl.html

Here is what I got out of it.

1. They keep large amounts of data in memory mapped files

This puts the onus of memory management on the OS file system cache, which makes sense.  Kafka, lucene, elasticsearchvarnish, mongodb, and many other systems take a similar strategy.

They also chose to hold the data in C/C++ structs and just access pointers to that object in lisp.  I think this is probably similar to what numpy does in python, keeping the large data objects out of the python runtime and avoiding GC.

2. They find options for flights to and flights from using simple graph search

That makes sense too.  The amount of data here is relatively limited.

3. To find optimal flights they use "very clever algorithms" on a "pricing-graph".

I'm thinking for this they use something like A* as a path finding algorithm (for a great intro to A*, see this series from Red Blob Games).  This would allow them to say that they search through the full cartesian product (every possible combination) of pricing combinations but really only evaluate items that are likely to have the lowest prices.

The article mentions that results are

ordered according to the function f", assuming of course certain restrictions on f,

This makes A* (or another Djikstra derivative) even more likely, since this sounds like a requirement for an admissible distance heuristic.

4. They preallocate data structures and just fail queries when per query memory runs out

[...] we pre-allocate all data structures we need and die on queries that exceed them

This reminds me a bit of how elasticsearch works.  They use threadpools to maintain a bunch of threads ready to do a specific amount of work.  Each thread pool also has a queue of a certain size.  When the pool if full, it will either block or just reject the attempt to add to the queue, depending on the type of process.  This allows elasticsearch to use a consistent bounded amount of memory for its core performance critical code.

Manual memory management is also very common in other performance critical systems like game engines, scientific computing, and virtual machines / language runtimes.


High level take aways:




Tuesday, September 9, 2014

Angular constants vs services

A few months ago I've started using a pattern to wrap library code as angular constants to make them injectable.  I'd handle something like d3 with:

app.constant('d3', window.d3);

This was easy, but I always felt a little weird doing this because these library objects are not really constants.  Depending on the library, these can either be factory functions or large complex objects with a huge amount of internal state.  Definitely not constant.

A co-worker was wrapping some simple code to inject into a route handle configuration block.  The code was just a simple function for generating a template string based on a string passed to it.  This sounds like a great place to use a factory, but the docs say that
Only providers and constants can be injected into configuration blocks.
So this made me think again about constants.  Why could they be injected early?

The docs have a little to say about this.  See "the guide" documentation on modules:

Configuration blocks - get executed during the provider registrations and configuration phase. Only providers and constants can be injected into configuration blocks. This is to prevent accidental instantiation of services before they have been fully configured. 
Run blocks - get executed after the injector is created and are used to kickstart the application. Only instances and constants can be injected into run blocks. This is to prevent further system configuration during application run time.

The module type documentation also hints at the difference:

constant(name, object);

Because the constant are fixed, they get applied before other provide methods. See $provide.constant().
But this talk of "fixed" and "system configuration" is a little misleading.  Looking at the code for the injector, it starts to become clear what they really mean to say is that constants don't have any async configuration.

When a constant is set up, the constant function simply checks the name for validity and shoves the object into a couple cache arrays.

However, when a service is instantiated, it starts with a call to the service function, which calls factory, which calls provider.  Because services call provider (by way of factory) with an array of arguments, they are set up with a call to providerInjector.instantiate.  This method makes a call to the getService method of createInternalInjector.  This function is where the async handling magic happens.  Because the call to the factory constructor method for a service can be async, this function sets a marker at the assigned position of that service in the cache to prevent the service from getting instantiated multiple times when the thread of control gets passed back to the main process which is injecting other modules.

Check it out.  It's pretty neat.

After seeing all the complexity hoops that angular jumps through to make services work (and how simple constants are by comparison), the difference becomes clear.  If you use services, angular handles aync for you.  If you're using constants, you're on your own.

Monday, June 30, 2014

Opening and closing ports on iptables

It turns out this is pretty easy.  The insert and append commands make it easy.

iptables works by setting up chains of filters for certain types of requests.  To see all your chains, type:

# This gives you a verbose list of the rules, with numerical displays for port numbers
iptables -L -v -n

You may find that some chains feed into other chains.  Understanding the flow of how iptables handles letting packets through is the first step toward getting your rules in the right place.  There are 3 predefined chains (INPUT, FORWARD, and OUTPUT).  These are the starting points for processing any network traffic handled by iptables. 

iptables works by running rules against packets in order until it finds one that matches.  When it finds a rules that matches, it applies the relevant action to the packet, which could be accepting, dropping, rejecting, forwarding, or any of a number of other actions.

The rules are processed by order in their chains, so order matters.  Often a chain will end with a line that looks like this:

REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

This rule rejects all traffic on all ports.  This is a common way to handle whitelisting only approved activities and rejecting everything else.  For your rule to take effect, it has to come before this rule in the chain.

Once you understand this, the value of the insert command makes more sense.  You need to get your rule into the appropriate place in the chain.  An example of opening port 8080 is below.  In this example I'm adding the rule to a specific chain (RH-Firewall-1-INPUT) that is handling all packets routed through the default INPUT and FORWARD chains.

# The rule closing all ports was previously in the 16th spot in the chain
# This new rule opens port 8080 by putting a rule right before that "catch all" exclusion rule
iptables -I RH-Firewall-1-INPUT 16 -m state --state NEW -p tcp --dport 8080 -j ACCEPT

If you happen to make a mistake, you can easily delete a rule at a specific point in your chain with the following.

# Deletes rule 12 in chain RH-Firewall-1-INPUT
iptables -D RH-Firewall-1-INPUT 12

It's usually good to add a line blocking all unspecified traffic at the end of your config file.

 # Reject all traffic not explicitly allowed in previous rules
iptables -A RH-Firewall-1-INPUT -p all -j REJECT

There is a lot more you can do with iptables, but hopefully this was a helpful starting point.

BONUS

If you're working a VM in VirtualBox, you can edit the port forwarding rules and they will take effect without having to reboot the VM.

DOUBLE BONUS

When trying to make sure a network service is working, here are a few good steps that I found to minimize frustration:
  • Turn off selinux (sudo setenforce 0)
  • Turn off iptables (service iptables stop)
  • Use nmap to scan open ports (nmap -sS -O 127.0.0.1)
  • Use curl to make sure you can access the service locally (applies to HTTP services only)
Once you can get to the service from the inside, gradually start turning service back on until something breaks.  Then fix it.



Wednesday, June 25, 2014

Hosting multiple versions of a django site on subdomains with apache/modwsgi

I've run into a couple scenarios recently where customers want to have access to multiple versions of a site at the same time.

Why multiple versions

The first scenario involved an analysis application where a bit of simulation code changed.  In that case we were fairly sure that the customer would want to use the updated model, but we wanted to provide access to both versions so they could do some comparisons.

The second scenario involved an application that pulled data from a remote database, cleaned it up, and provided an interface for browsing the data.  The format of the data in the remote database changed, but the customer wanted to be able to still connect and update from tables containing data in the old format as well as the new format.

For both of these situations, it would have been possible to edit the user interface and the backend code to allow access to both versions of the application at the same time, but this would have made for more confusing interfaces and a much more complex codebase.

Why subdomains

There are 2 main ways to handle serving 2 versions at the same time: 1) using different ports 2) using subdomains.  Each method has its upsides and downsides.

If you serve off multiple ports, you first have to open another port in your firewall.  For many applications (esp. those sitting in a customer testbed) this isn't a big deal.  In my group's situation, we deploy into some pretty tightly monitored environments, and minimizing the number of open ports makes certification and approval a simpler process.  Also, serving off of multiple ports makes just makes for less pretty urls.  "simv2test.myapplication.com" is just cleaner and more self documenting than "myapplication.com:8080".

If you choose to work with subdomains, you'll need a wildcard DNS record to be able to grab all the traffic to your site.  Some hosts and DNS services provide this automatically, and some make you pay more for that service.  Also, if you're serving over SSL, you'll need a wildcard SSL certificate.  This may also cost a bit more than a normal single domain certificate.

After considering both options, we decided subdomains made more sense in each of the scenarios described above.

How

First consider your data.  In both of the situations described above we were serving up 2 different versions of the code and the data.  We handled this by copying our database into a new database, and pointing the new fork of the application at this new database.

In mysql the command was as simple as this (after creating the "appname_2013" database):

sudo mysqldump appname | sudo mysql appname_simv2test

Mongo has a command specifically for this, which can be evoked from the mongo shell.

Next, setup the application code.  The code for one of the projects was being served from "/var/www/deploy/appname/", so I copied the new version of the code to "/var/www/deploy/appname_simv2test".  Make sure to make the necessary permission changed to the files and directories.  I found that writing a fabric task to deploy each version of the application made this much easier.

Finally, setup your apache configuration to serve up each version of the application at the appropriate subdomains.  Something like the following should work ok.



You probably don't want to do this with too many subdomains on one server, because each subdomain is basically doubling the amount of resources running on your computers (2x number of application threads, 2x number of database tables).

But for a simple temporary solution, that should do it.

BONUS

Here's a version that deploys over ports instead of subdomains.  One of the ports is served over https (port 80 redirected to 443), and the other is just over http on port 8080.


Monday, June 23, 2014

Getting django LiveServerTestCase working with selenium's remote webdriver on VirtualBox

Our group does our development on linux VMs, usually running on a Windows host.  We want our developers to be able to write selenium system tests to wrap some of our existing functionality before we start diving into some deep refactoring.

Most of the LiveServerTestCase documentation I have seen is for the case of django running locally and talking to selenium directly.  Getting an instance of django running a VM working with selenium running on the VM host required a few adjustments.

Start with a modern Django

LiveServerTestCase was introduced in Django 1.4.  We were on Django 1.3.  I tried using django-selenium, but had significant problems with their built in test server implementation not starting, not stopping, or crashing in strange ways.

I ended up upgrading our project from django 1.3 to 1.6.5.  For our large project this just took ~2 hours of fiddling.

Open/forward required ports

It's probably best to just turn off the iptables service when setting things up.  If you have selinux running, set it in permissive mode.

Add a line in your settings.py file to configure the port to use for the testserver used by LiveServerTestCase.

os.environ['DJANGO_LIVE_TEST_SERVER_ADDRESS'] = '0.0.0.0:8008'

It's important that you use '0.0.0.0' and not 'localhost' so that the port forwarding on VirtualBox works.  I'm using 8008 because it is one of the auxiliary http ports recognized by the selinux default configuration.

Then edit the settings of the VM in VirtualBox to forward port 8008 to some unused port on your local machine.  We're forwarding port 80 on the VM to 8888, so I forwarded this test port to 8889.

Serve up static files

We have apache serving static files at /static_assets/.

The test server is a python server, so we had to configure it to find and serve these static files.  In a test-specific settings file, I added:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': 'fact_rdb',
    }
}

if settings.DATABASES['default']['ENGINE'] == 'django.db.backends.sqlite3':
    import os
    from django.conf.urls.static import static
    STATIC_ASSETS_URL = "/static_assets/"
    STATIC_ASSETS_ROOT = os.path.join(settings.PROJECT_PATH, 'static_assets')
    LOGIN_REQUIRED_URLS_EXCEPTIONS = tuple(list(LOGIN_REQUIRED_URLS_EXCEPTIONS) + [r'^/static_assets/.*$'])
    urlpatterns += patterns('',
        url(r'^static_assets/.*$', 'myapp.views.serve_static_assets', name="static_assets_for_testing"),
    )

The last line is the important one.  It redirects to a view described below.

The lines where I an editing "LOGIN_REQUIRED_URLS_EXCEPTIONS" exist because we are using a middleware to handle restricting access to certain urls.  You can see what that middleware looks like here.  Remove that line if you're not using that middleware.

I'm also configuring the tests to use an in-memory sqlite database.  I recommend that you use this if possible.  If you are using custom sql in your code, this may not be possible, but if you're justing using the ORM it should work just fine.  Running with an in memory database (in my experience) seems to take a couple seconds off the execution time of every test in your test suite.

The view that ties into the url in the code above and serves up the static assets is as follows:

import os
from mimetypes import guess_type
from django.http import HttpResponse
from django.core.servers.basehttp import FileWrapper

def serve_static_assets(request):
    # Take a url like /static_assets/path/to/file.js and create a path
    filename = "/path/to/static/dir" + request.path_info
    static_file = FileWrapper(open(filename, 'rb'))
    mimetype = guess_type(request.path_info, False)[0] or 'binary/octet-stream'
    response = HttpResponse(static_file, mimetype=mimetype)
    response['Content-Length'] = os.path.getsize(filename)
    return response

WARNING: This is not a secure way to serve static files.  Please do not use this for anything but testing.


Add setUpClass and tearDownClass methods to your test classes

The following sets up a web driver (available at "self.driver" in your test functions) connected to the driver on the host machine.

SELENIUM_HOST = '10.0.2.2'
SELENIUM_PORT = 4444

class myTest(LiveServerTestCase):

    @classmethod
    def setUpClass(cls):
        cls.driver = webdriver.Remote(
            command_executor='http://%s:%s/wd/hub' %(SELENIUM_HOST, SELENIUM_PORT),
            desired_capabilities=DesiredCapabilities.CHROME)
        super(LoginTest, cls).setUpClass()

    @classmethod
    def tearDownClass(cls):
        cls.driver.quit()
        super(LoginTest, cls).tearDownClass()

To connect to a server running on the host machine, use the following settings.

SELENIUM_HOST = '10.0.2.2'
SELENIUM_PORT = 4444

On virtualbox, the IP of the host machine is usually 10.0.2.2.

Running your tests

You'll need to first start a selenium server running on the host machine.  If the selenium driver server isn't running there will be nothing for your selenium test runner to talk to.

To do this you'll need java and selenium installed.  Also, add the executables for any desired driver plugins (e.g. the chrome driver plugin) on your path.  Run the server with something like:

"C:\Program Files (x86)\Java\jre7\bin\java.exe" -jar selenium-server-standalone-2.33.0.jar

On the VM, run the command to test your application.  Something like:

python manage.py test --settings=custom_settings_file module.class.test_function

I hope that helped!


Wednesday, June 18, 2014

Decompiling python bytecode with pycdc

We somehow lost the correct working version of a python file for a project, but one of our servers still had the pyc file (which was working fine in production).  To fix this,  went hunting for a good solution to get back our sourcecode.

From what I found, it seems the pycdc library is the best option currently, though there is also:



When I tried unpyc it threw and error for me, and uncomplye2 only works with 2.7.

Here are the steps to setup pycdc.  These instructions are for Centos 5.3, so they may need to be tweaked for your system.

Install CMake


wget http://www.cmake.org/files/v2.8/cmake-2.8.12.2.tar.gz
tar xzvf cmake-2.8.12.2.tar.gz
cd cmake-2.8.10.2
./bootstrap
make
make install

Download and compile pycdc

git clone git@github.com:zrax/pycdc.git
cd pycdc
/usr/local/bin/cmake ../pycdc/
make

Using pycdc to decompile

The program outputs to stdout, so redirect to a file.

./pycdc/pycdc filename.pyc > filename.py

That's it.