Gijs Koot

Partial batch failure with SQS driven Lambda functions

2022-05-09T00:00:00+00:00

When you have a lambda that is driven by a SQS queue, like this, your lambda can receive up to ten messages per batch. Your handler can look like this, handling all of the messages in a single event in a for loop.

def lambda_handler(event: dict, context: dict) -> None:
    for msg in event["Records"]:
        handle_msg(msg)

If you do nothing and all messages are handled without exceptions, the messages will be deleted from the SQS queue automatically for you. Logically, if there is an error, the messages will not be deleted. They will be put back to the queue, or, depending on the arrangement, they will be sent to the dead letter queue. But if your handler function raises an exception, the whole batch will be failed, including the messages that have been processed already. This is typically not the behaviour you want, and to solve this you have to delete or put messages to the queue, keeping track of the failures and succesfully handled messages yourself. This is not very obvious, and you can find some questions on how to handle this properly here and here.

AWS has introduced a new possibility for handling this, pretty recently, in December 2021. If you include the failed messages in a lambda response called batchItemFailures, only those will be reposted to the queue (or the dead letter queue). In python, this looks like this.

def lambda_handler(event: dict, context: dict) -> dict:
    batch_item_failures = []  # list of things that failed

    for msg in event["Records"]:
        try:
            handle_msg(msg)
        except Exception as e:  # more specific is better
            batch_item_failures.append(msg["messageId"])

    return {
        "batchItemFailures": batch_item_failures
    }

Reading large tiles from S3 directly with `rasterio`

2022-05-06T00:00:00+00:00

At SpotR, we make heavy use of rasterdata containing gridded height measurements.

When working in python, the rasterio package is useful. This package is essentially a more pythonic binding to the GDAl library, as explained in their introduction. The file below was obtained from data.gov.uk and shows a 1x1km patch of height measurements in the UK. The resolution is 1000x1000 pixels, every pixel represents the (maximum) height of 1x1m. The lighter dots along the top of the image are houses, there are some ragged parts where there is no data. In the middle there is a depression, could be a riverbed, and then at the bottom the terrain is rising a little.

import rasterio
import matplotlib.pyplot as plt
import numpy as np

with rasterio.open("/tmp/sd9863_DSM_1M.tiff") as dataset:
   heights = dataset.read(1)

fn = "images/heights.png"

# replace nodata values with nan
heights[np.where(heights==dataset.nodata)] = np.nan

plt.imshow(heights)
plt.title("Example of a 1km x 1km raster")
plt.tight_layout()
plt.savefig(fn)
fn

Rastertiles, typically GeoTiff files, can become quite large in terms of memory size. This grid above takes up \~4Mb as an uncompressed GeoTiff file, down from 6.5Mb as a .asc file, which is a simple text-based format. There are a couple of interesting compression techniques like DEFLATE and LZW that can bring the size of the data down further. It is possible to convert rasters with rasterio, but the gdal_translate utility is the tool for the job.

gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M.tiff > /dev/null
gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_lzw.tiff -co COMPRESS=LZW > /dev/null
gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_def.tiff -co COMPRESS=DEFLATE > /dev/null
gdal_translate /tmp/sd9863_DSM_1M.asc /tmp/sd9863_DSM_1M_def_pred.tiff -co COMPRESS=DEFLATE -co PREDICTOR=2 > /dev/null
ls -lha /tmp/sd*

-rw-rw-r-- 1 gijs gijs 6,5M jun 13  2018 /tmp/sd9863_DSM_1M.asc
-rw-rw-r-- 1 gijs gijs 1,1M mei  9 09:11 /tmp/sd9863_DSM_1M_def_pred.tiff
-rw-rw-r-- 1 gijs gijs 1,5M mei  9 09:11 /tmp/sd9863_DSM_1M_def.tiff
-rw-rw-r-- 1 gijs gijs 1,8M mei  9 09:11 /tmp/sd9863_DSM_1M_lzw.tiff
-rw-rw-r-- 1 gijs gijs 3,9M mei  9 09:11 /tmp/sd9863_DSM_1M.tiff

Interestingly, all compression techniques available in GDAL are lossless. There are JPEG based compression systems, but they can only be applied to 8bit unsigned data, in other words, images, and these height measurements which are organized as floating point numbers cannot be stored using JPEG compression. I can definitely think of some usecases where some distortion of these measurements is fine, as long as it's bounded somehow, but I haven't come across examples of a lossy compression for rasters of floating points.

Partial reads

Compression can save us almost an order of magnitude, but to store this data at our scale, things still add up. I live in the Netherlands which has an area of 41,543 km². That's 40k+ tiles at 1Mb+ each, 50Gb in total. Perfect to save on cloud storage such as S3.

aws s3 ls s3://heights-tiles/tiles/sd980

2022-04-29  23:08:55  2903641  sd9800_DSM_1M.tiff 
2022-04-29  23:08:54  2871755  sd9801_DSM_1M.tiff 
2022-04-29  23:08:54  2938302  sd9802_DSM_1M.tiff 
2022-04-29  23:08:55  2719476  sd9803_DSM_1M.tiff 
2022-04-29  23:08:55  2643684  sd9804_DSM_1M.tiff 
2022-04-29  23:08:55  2533681  sd9805_DSM_1M.tiff 
2022-04-29  23:08:55  2715498  sd9806_DSM_1M.tiff 
2022-04-29  23:08:55  2818095  sd9807_DSM_1M.tiff 
2022-04-29  23:08:55  2755601  sd9808_DSM_1M.tiff 
2022-04-29  23:08:56   468739  sd9809_DSM_1M.tiff

When doing a calculation, we're typically not interested in the whole of the tile. For example, we only want to know the height of a single pixel in the raster file. It is possible to avoid downloading the whole file, this operation can be done using a partial read. This is possible because S3 allows random-access reads, and GDAL supports reading over a network with virtual file systems.

Depending on how large your tiles are, this can make a big difference. Let's benchmark this.

import rasterio
from rasterio.windows import Window

with rasterio.open("s3://heights-tiles/tiles/sd9800_DSM_1M.tiff") as raster:
  dt = raster.read(1, window=Window(500, 500, 501, 501))

time python src/read_raster_window.py

real    0m17,300s
user    0m3,026s
sys     0m1,038s

Wait a minute .. 17 seconds is still a long time. It turns out that GDAL will scan the whole folder for other files before opening a file. This is interesting behaviour that makes sense when geodata files are often accompanied by other files that include information about transformation, possibly some indexes and more. We can disable this behaviour by setting an environment value.

time GDAL_DISABLE_READDIR_ON_OPEN=YES python src/read_raster_window.py

real    0m1,230s
user    0m0,400s
sys     0m0,948s

Adjusting age group for local vaccination rate (2)

2021-10-09T00:00:00+00:00

Adjusting percentages for local vaccination rate

This is an answer to a question on Stats Overflow.

I want to estimate the probability of a person aged 40-49 in Delaware to be vaccinated, but I only have nationwide statistics on vaccination levels by age, and a level of vaccination in Delaware, but no age breakdown for that state. So the question is, how can we combine those percentages, for the agegroup and Delaware, into a specific percentage for that agegroup in Delaware?

I tried doing that in my

post, but, I came up with a more straightforward method.

To begin, from the official statistics, we get the percentage of people vaccinated in Delaware, which is 56.6%. Let \(D\) be the total population of Delaware. Then there are \(0.566 \cdot D\) vaccinated persons in Delaware.

The number of people in the US in the age group 40-49 is 12.2%. But they make up 14.2% percent of the people vaccinated. Let's assume these percentages hold in Delaware as well.

Then the total number of people aged 40-49 living in Delaware is \(0.122\cdot D\). And the number of people vaccinated aged between 40-49 is 14.2% of vaccinated subjects. So the final percentage is

\[ \frac{0.142 \cdot 0.566 \cdot D}{0.122 \cdot D} = \frac{.142 \cdot .566}{0.122} \approx 65.9\% \]

Adding log odds to combine statistics

2021-10-07T00:00:00+00:00

Adding log odds to combine statistics

This is an answer to a question on Stats Overflow.

I will need to make some independence assumptions, notably, that the age distribution of vaccinations is the same in Delaware. See below for another assumption I have to make to work with the provided data.

The method I use is to manually calculate the coefficients in a logistic regression model. As you will see, what happens is that we cannot add and subtract percentages directly, but we can add and subtract logodds.

Logistic regression model

To begin, we need two percentages from the official statistics, the nationwide (full) vaccination grade, and the percentage in Delaware.

us_vacc_p = .561
del_vacc_p = .566

0.566

I'm going to be using the following functions. The programming language I'm using is Julia, but I'm using only two basic functions and assignments so the code is going to be pretty much the same as in Python or R.

function logit(p)
     log(p / (1 - p))
end

function logistic(x)
    exp(x) / (exp(x) + 1)
end

The logit function calculates the so called log odds of a probability.

us_vacc_logodds = logit(us_vacc_p)

0.24522149244752528

The logistic function inverts this operation. A logistic regression model for this looks like

\[ \text{logit}(p) = \text{base} \]

for the general population, and

\[ \text{logit}(p) = \text{base} + \text{coefficient for Delaware} \]

for persons living in Delaware, where \(p\) is the probability of that person being vaccinated. Because the logistic function is the inverse of the logit function, we can calculate \(p\), the probability we are after, with the formula

\[ p = \text{logistic}\left(\text{base} + \text{coefficient for Delaware}\right) \]

Now the trick is that we can manually calculate the coefficient for Delaware using the following formula.

del_vacc_coef = logit(del_vacc_p) - logit(us_vacc_p)

0.020328051655252644

To check this, let's use this model to calculate the vaccination probability of the general us population,

logistic(us_vacc_logodds)

0.561

and for Delaware we use the coefficient as well, and we get

logistic(us_vacc_logodds + del_vacc_coef)

0.566

Now the next step is to calculate the coefficient for the age group, and add that to our model as well.

Calculating the age coefficient for 40-49

The official statistics aren't yet in the form we need them. On the graphs, you can find that 14.1% of those vaccinated are in the age group 40-49. What we want to know is how many in this age group are vaccinated. A complication here is that only 91% of those vaccinated have reported their age. We need another assumption here, namely that this nonresponse is independent from age group. If we assume that, we know that 14.1% of all vaccinated are in the age group 40-49.

vac_n = 186387228           # total number of vaccinated
age_vacc_n = .142 * vac_n   # in the age group 40-49

26466986.376

Also, we need the total number of people in the US in this age group, which isn't listed directly either. From the graph, it's 12.2% of the total population. The total population isn't listed either, but, 56.1% of the population is vaccinated, so we can calculated the total population from that.

us_n = vac_n / .561

332241048.1283422

So the percentage vaccinated in the age group 40-49 is

age_n = .122 * us_n
age_p = age_vacc_n / age_n

0.6529672131147541

Converting this to log odds, the calculation of the coefficient for the age group 40-49 is the same as earlier for Delaware

age_vacc_coef = logit(age_p) - logit(us_vacc_p)

0.38688616375890655

In the final calculation I combine the age based coefficient to the coefficient for Delaware. This is the step where I need the assumption that the age distribution is the same in Delaware.

logistic(us_vacc_logodds + del_vacc_coef + age_vacc_coef)

0.6575591337731559

So with the listed assumptions I estimate the probability of a person aged 40-49 living in Delaware to be vaccinated at 65.7%.

It is interesting to compare this to the original probabilities, with 65.3% of this age group being vaccinated in general, which is then corrected by comparing the 56.6% Delaware population average to the 56.1% general us population average.

Thanks for reading! If you want to reach out, post an issue to the Github repository of this website or contact me on Twitter!

Using Futures and the ProcessPoolExecutor in python

2021-09-16T00:00:00+00:00

When to use `ProcessPoolExecutor`

Using the ProcessPoolExecutor in concurrent.futures is a quick way to divide your workload over multiple processes. This is useful if you have a couple of tasks that you want to run in parallel to save time. Compared to the ThreadPoolExecutor, the process pool is a bit more primitive, basically, the whole process is forked into multiple copies that each do their own business, with the concurrent.futures.ProcessPoolExecutor taking care of cleaning up and basic communication between the tasks.

	`ProcessPoolExecutor`	`ThreadPoolExecutor`
strength	Parallel CPU bound tasks	Parellel IO bound tasks
weakness	Memory usage	Limited to single CPU due to GIL

You should use the ProcessPoolExecutor over the ThreadPoolExecutor if your tasks are CPU bound. The weakness of copying the process is that you also copy it's memory which may add up, there is a bit more overhead when compared to splitting into threads, but, the big advantage is that multiple processes each can use up to 100% of a single CPU core, while, due to the limitations of the Global Interpreter Lock, multiple threads will not saturate multiple CPU's. There's a couple of holes in this simplified model, for example, python code can sometimes release the GIL, notably numpy code, in which case threads can also effectively use multiple GPU's. But for now, let's not worry about those details too much and learn how to use multiprocessing easily in python.

Basic multiprocessing with `os.fork`

First, to get started, have a look at this demonstration of os.fork. In practical terms, this duplicates the running process. Two copies of the same program will run, basically identical, except for their pid, their process id.

cat ./src/fork.py

import os

os.fork()

print(os.getpid())

python3 ./src/fork.py

18359
18360

It is definitely possible to write some multiprocessing code directly on this primitive system.

cat ./src/fork_mp.py

import os
import time
import sys

tasks = list(range(10))

part_a = tasks[:5]
part_b = tasks[5:]

res = os.fork()

if res == 0:
    # main process
    for task in part_a:
        print(f"I am process {os.getpid()} working on task {task}")
        time.sleep(.2)
        sys.stdout.flush()
else:
    # child process
    for task in part_b:
        print(f"I am process {os.getpid()} working on task {task}")
        time.sleep(.2)
        sys.stdout.flush()

This program divides the tasks between the two processes. I added a time.sleep(0.2) for every task, so the tasks in total take two seconds. The script however takes approximately 1 second to run, saving exactly one second over running the tasks in a single process.

We use the output of os.fork to determine which of the processes we are, if the result is 0, we are the main process, and if the result is different, we know we are in the child process.

python3 ./src/fork_mp.py

I am process 18377 working on task 0
I am process 18376 working on task 5
I am process 18377 working on task 1
I am process 18376 working on task 6
I am process 18377 working on task 2
I am process 18376 working on task 7
I am process 18376 working on task 8
I am process 18377 working on task 3
I am process 18376 working on task 9
I am process 18377 working on task 4

These examples show at a lower level than the ProcessPoolExecutor how multiprocessing works. However, if you want to extend the latter approach into working code that deals with failures, passes the tasks to the processes consistently and also collects the results, you can see it'll get quite a bit more complicated. Enter the ProcessPoolExecutor, which does all those hard things for you!

cat ./src/ppool_demo.py

from concurrent.futures import ProcessPoolExecutor
import time
import os

tasks = range(10)
start = time.time()


def do_work(task):
    print(f"I am process {os.getpid()} working on task {task}")
    time.sleep(.2)


with ProcessPoolExecutor(max_workers=4) as pool:
    for task in tasks:
        pool.submit(do_work, task)

print(f"Main process done after {time.time() - start:.2f}s")

In only three lines you can start 4 workers that divide the work as evenly as possible.

pool.submit sends a task to one of the workers
The context manager (with block) waits until all the workers are done before proceeding

This example takes 0.6 seconds. We have four workers, 10 tasks, so some workers get 2 tasks and some get 3 tasks, and the workers with 3 tasks take 3 * 0.2 seconds to finish.

python ./src/ppool_demo.py

Note that in the output, the processes seem ordered. However, this is a subtle effect of buffering of the stdout. Each process has its own buffer and they flush all their output in one go, making it look like they run in succession. You can surpress this behaviour by manually triggering the flushes as I showed earlier with sys.stdout.flush().

Here, it is crucial that pool.submit is non-blocking, if you call the function the main process doesn't wait until the worker is done. This allows us to schedule all the work to the workers quickly.

There are three things I want to explain in this post

How you can collect return values of the tasks (using concurrent.futures.Future)
What happens if workers run into an exception and how you can deal with it

Collecting return values

In the examples above, I simply fired off the tasks and showed that they were doing something by printing statements to stdout. But in a practical situation, you typically want to collect the results of the work. To do that with a ProcessPoolExecutor, you will need to deal with concurrent.futures.Future. If you have experience with Javascript for example, dealing with futures is very common, but you can write quite a bit of python code without encountering these.

from concurrent.futures import ProcessPoolExecutor

def work(word):
    print(word)
    return len(word)

with ProcessPoolExecutor(max_workers=1) as pool:
    result = pool.submit(work, "hello")
    print(result)

The result of a pool.submit is an instance of Future. A Future is a reference to work in progress. Its most fundamental method is Future.result. From the official documentation

Return the value returned by the call. If the call hasn’t yet completed then this method will wait up to timeout seconds. If the call hasn’t completed in timeout seconds, then a concurrent.futures.TimeoutError will be raised. timeout can be an int or float. If timeout is not specified or None, there is no limit to the wait time.

If the future is cancelled before completing then CancelledError will be raised.

If the call raised an exception, this method will raise the same exception.

It is important to understand that this method is blocking, as opposed to the pool.submit method I used earlier. A mistake I have seen often is the following:

cat ./src/block_mistake.py

from concurrent.futures import ProcessPoolExecutor
import time

tasks = range(10)
results = list()
start = time.time()


def do_work(task):
    time.sleep(0.1)
    return task ** 2


with ProcessPoolExecutor(max_workers=10) as pool:
    for task in tasks:
        future = pool.submit(do_work, task)
        results.append(future.result())  # collect results

print(f"Done after {time.time() - start}")

Can you spot the mistake? The problem is that before scheduling the next task, the main process waits the result of the task just scheduled. This script takes 1 second to run, because in effect, all tasks are run in succession.

python ./src/block_mistake.py

Instead, the results should be collected after the process pool context is done scheduling the tasks.

cat ./src/block_fixed.py

from concurrent.futures import ProcessPoolExecutor
import time

start = time.time()

tasks = range(10)
futures = list()


def do_work(task):
    time.sleep(0.1)
    return task ** 2


with ProcessPoolExecutor(max_workers=10) as pool:
    for task in tasks:
        futures.append(pool.submit(do_work, task))

results = [future.result() for future in futures]

print(results)
print(f"Done after {time.time() - start}")

python ./src/block_fixed.py

Dealing with exceptions in child processes

If a child process raises a unhandled Exception, this exception is passed to the main process when calling Future.result. In the example below, you can see how to catch those errors when unpacking the results of the process workers.

cat ./src/error_example.py

from concurrent.futures import ProcessPoolExecutor
from random import random

tasks = range(10)
futures = list()


def do_work(task):
    if random() > .5:
        return task ** 2
    else:
        raise Exception("OW!")


with ProcessPoolExecutor(max_workers=10) as pool:
    for task in tasks:
        futures.append(pool.submit(do_work, task))

results = list()

for future in futures:
    try:
        results.append(future.result())
    except Exception as e:
        results.append(f"Failed with {e}!")

print(results)

python ./src/error_example.py

[0, 1, 4, 9, 16, 'Failed with OW!!', 'Failed with OW!!', 'Failed with OW!!', 64, 81]

Thanks for reading! If you want to reach out, post an issue to the Github repository of this website or contact me on Twitter!

Pommodoro timer in Elisp

2021-09-09T00:00:00+00:00

I wrote a pommodoro timer in elisp. Elisp is the language of Emacs, the 40 year old editor that is still going! Elisp is to Emacs what Javascript is to Visual Studio Code if that means more to you ;).

The pommodoro technique is new to me. It is a time management technique that consists of five simple steps.

Get a to-do list and a timer.
Set your timer for 25 minutes, and focus on a single task until the timer rings.
When your session ends, mark off one pomodoro and record what you completed.
Then enjoy a five-minute break.
After four pomodoros, take a longer, more restorative 15-30 minute break.

The functionality consists of two global variables and two functions. I use a variable for the timer itself, and a variable that can be used to customize the time. I use defvar, which allows me to add documentation to a variable as well. All the code I added to init.el.

(defvar pommodoro-timeout "25m" "Duration of a pommodoro timer")
(defvar pommodoro-current-timer nil "The current pommodoro timer")

Then I have two function that I can call when in Emacs. These I have tied to F2 and F3, so that they are really easy to reach. With pommodoro-start-timer I am prompted for a title and then a timer is set with that title. With pommodoro-show-timer a message appears in the minibuffer that reminds me of the current timer and when it will be finished.

(defun pommodoro-start-timer ()
  "Start a pommodoro timer which shows a notification after 25 minutes"
  (interactive)
  (catch 'cancel
    (progn
      (if (and pommodoro-current-timer (time-less-p (current-time)
                (timer--time pommodoro-current-timer)))
    (if (yes-or-no-p "There is a current pommodoro running, do you want to cancel it? ")
        (cancel-timer pommodoro-current-timer) (throw 'cancel t)))
      (setq pommodoro-current-timer
      (run-at-time pommodoro-timeout nil #'shell-command
       (format "notify-send -i messagebox_info -u critical 'Pommodoro done' %s"
         (read-string "Task description: ")))))))

This function sets a timer with run-at-time, based on a description that you have to enter (read-string). There is one additional functionality, if there is a currently running timer, I am prompted to be sure I want to cancel that one. The notification is sent using shell-command, using the Linux utility notify-send. There is a package notifications in Emacs which works well, except for one thing, I couldn't get the notifications to show when in full-screen mode.

This is the function for seeing how much time there is left in the current pommodoro which is very simple at the moment.

(defun pommodoro-show-timer ()
  (interactive)
  (message "Pommodoro done at %s"
     (format-time-string "%T" (timer--time pommodoro-current-timer))))

And finally the keybindings are set with

(global-set-key (kbd "<f2>") #'pommodoro-start-timer)
(global-set-key (kbd "<f3>") #'pommodoro-show-timer)

I'll be trying to stick with the technique and this home-brewn functionality for a while! Let's see how it works. Three improvements I want to add right away are

The pommodoro-show-timer function should show the name of the current task as well
The icon should be a tomato
The pommodoro-show-timer function should show the relative time until running out instead of the actual time ("5m to go!" instead of "done at 22:04:23")

Thanks for reading! If you want to reach out, post an issue to the Github repository of this website or contact me on Twitter!

Popping balloons (2)

2021-09-08T00:00:00+00:00

Expected value of a single balloon

This is a follow up to the previous post on the same game

Let us assume, as before, that the balloon pops at a certain level \(M\), and we believe \(M \sim \text{uniform}(1, 21)\). \(M = 1\) means the balloon will pop immediately after we pump it once.

I figured out in the last post that the optimal strategy in this case is to push 10 or 11 times, then stop. But what is then the expected value of this game? There are two options, either we reach 10 pushes without popping the balloon, or it pops before we get there. Let's call our cashout \(B\), and the expected value is

\begin{align} \mathbb{E}\left(B\right) &= \mathbb{P}\left(\text{reach 10 pushes}\right) \cdot 10 \\ &= \mathbb{P}(M > 10) \cdot 10 = 5 \end{align}

This is a simple formula, the expected value is the probability of reaching the goal times the goal set. Using this, we can easily evaluate other strategies. For example, what if our strategy is to play 11 times?

11 * (9 / 20)

4.95

Plotting the expected value for all strategies, it follows a parabole, topping out at 10 as calculated before.

using Plots

bar(x -> (x * (20 - x) / 20), 0:20, labels="expected value", color="purple")

A different distribution for \(M\)

Now what if I believe \(M\) to follow a Poisson distribution, and let's take 11 as the parameter as an example. First a plot of the distribution of \(M\).

using Distributions

belief = Poisson(11)

bar(x -> pdf(belief, x), -10:30, labels="probability", color="brown")

And these are the expected values of playing on until a certain payoff

belief = Poisson(11)

expected = x -> (1 - cdf(belief, x)) * x

bar(expected, 0:20, labels="expected value", color="yellow")

The optimal strategy is to play 8 rounds. The interesting thing is that the two underlying distributions I analyzed have the same average, but with a Poisson distribution it is optimal to stop before reaching the average round at which the balloon pops.

mean(Uniform(1, 21)), mean(Poisson(11))

11.0

Summary

In this post I used a much more straightforward formula to find the optimal strategy, and was able to calculate expected payoffs for different strategies for two different distributions. How the shape of the distribution influences the optimal strategy is interesting, can this be generalized to other distributions? Thanks for reading this post, if you want to reach out, post an issue to the Github repository of this website or contact me on Twitter!

Popping balloons

2021-08-27T00:00:00+00:00

Popping balloons risk assessment game

Today I played an interesting game as part of a test at work. This game was for testing my risk-aversity and risk-assesment skill.

In the game, you are pumping balloons one at the time. There are at any moment two options.

Cash in the balloon
Pump the balloon

If you cash in the balloon, the amount you gain is the number of time the balloon was pumped. Then you get a new balloon to try on. If you pump the balloon, it may pop, you get nothing and a new balloon starts. If it doesn't pop, you can choose again with the same balloon. You have a starting total of 30 balloons and the target is to maximize your gains.

The risk comes from the fact that you don't know how many pumps the balloon can take.

Optimal strategy

Can we come up with an optimal strategy for this game? The problem is a bit like the Multi-armed bandit problem, perhaps even equivalent to some form of it, but I am not an expert on the topic. In this post I want to analyze a couple of strategies and assumptions of the pumping balloon game.

What makes this problem really hard is the tradeoff between exploration and getting as much out of the current balloon as possible. In the game, it is probably worth sacrificing a couple of balloons to learn about how many pumps they can take. But how many you want to sacrifice will depend on the total number of balloons you can spend. So first, let's calculate some numbers on the best strategy if you have only one balloon. I will leave the analysis of the exploration tradeoffs for some other time.

Uniform prior

If you know the exact number of pumps a balloon can take, the optimal strategy is easy, you just pump until one below its maximum. Let's call the maximum pressure the balloon can take M. Now let's assume you have some kind of idea of M, you don't know what it is, but you have a "prior" belief about M. For example, you believe the maximum pressure is not above 20, but any number below is equally likely.

using Plots
using Distributions

belief = Uniform(1, 21)
bar(x -> pdf(belief, x), -10:30, labels="probability")

There is one situation in which you are absolutely sure you want to cash in, if you have pumped it 20 times. If you are at 19, what is your updated belief about \(M\)? You've discarded all the possibilities \(M < 19\), and the other two cases are equally likely, so

\[ \mathbb{P}(M = 21) = \mathbb{P}(M = 20) = \frac{1}{2} \]

Now the expected value of cashing in is exactly 19, and the expected value of pumping is 10,

\begin{align} \mathbb{E}(\text{pump}) = \mathbb{P}(M = 20) \cdot 0 + \mathbb{P}(M = 21) \cdot 20 = \frac{1}{2}\cdot 20 = 10 \end{align}

Let's do this calculation for 19. Your updated prior is that the probability that it will break after one pump is \(\frac{1}{3}\). The expected value of a pump is

2 / 3 * 19

12.666666666666666

So still not worth it. Let's generalize a bit, if you have pumped \(i\) times, there is a 1 in \(21 - i\) probability it will break on the next try. Let's assume that after that try, it is not worth playing on. Then the expected value of trying is \(\frac{20 - i}{21 - i}\cdot (i + 1)\), which you have to weigh against the immediate payoff of \(i\).

scatter(i -> (i + 1) * (20 - i) / (21 - i), 0:20, labels="Pump")
scatter!(i -> i, 0:20, labels="Cash")

From the graph, you can see that the break-even occurs at 10 pumps, as intuitively makes sense. At that point, taking the risk and cashing in has an equal payoff.

Summary

An interesting game, and there is a lot more to explore, even for playing only a single balloon! For example, what is the expected value of playing the game? And can we analyze other distributions, such as the Poisson distribution? Thanks for reading this post, if you want to reach out, post an issue to the Github repository of this website or contact me on Twitter!

Completing the Advent of code 2020 with Julia

2021-01-15T08:30:00+00:00

I just finished all 25 puzzles for the https://adventofcode.com/2020/. It was fun, I learned things and I took the chance to use the Julia programming language. Prior to this project, my experience with the language was working through some a couple of chapters in Quantecon. I have also not participated in the Advent of Code competition before.

Advent of Code

This is a competition consisting of 25 puzzles, released at midnight EST/UTC-5 every day starting December 1st, until December 25th. It has been running for a couple of years now. This year, 130k participants sent in the answer to the first puzzle, and around 10% of them persevered and finished all of the puzzles. There is a competition where every morning, the first 100 solutions get points awarded. My personal goal was to just finish the puzzles, following the strict time schedule would be an additional challenge I am not up for at this time.

On December 2nd, it took “goffrie” 1 minute and 47 seconds to solve the second puzzle, this one https://adventofcode.com/2020/day/2. That’s pretty incredible, just understanding it takes me 5 minutes. However, most puzzles are tricky but not hard, and can be finished well within one hour.

Julia

In my job, I am primarily a Python programmer, and I sometimes use R for visualization or model fitting specifically. I have been following and experimenting with Julia over the last two years, but I haven’t started using it professionally. Python and pytorch have been very effective for us. After this experience, I have become even more convinced that Julia is a candidate for replacing Python as the most popular language for data science. In general, its advanced compilation system is an advantage. And as I’m learning and working with the language, it just feels very well designed to me.

The combination of Advent puzzles with Julia worked out very well. I think it’s a great language for solving these problems, which is impressive, because Julia is designed for numerical and scientific computation. My solutions can be found on a GitHub repository.

Below some additional thoughts.

Learning path

As an example of my learning, the problems all start with reading and parsing some input. Day one required reading an input file with some numbers

; head -5 ./input.txt

On the first day, I used this

x = open(io -> read(io, String), "input.txt")
y = map(s -> parse(Int, x[s.offset+1:s.offset+s.ncodeunits]), split(strip(x), "\n"))
y[1:5]

5-element Array{Int64,1}:
 2000
   50
 1984
 1600
 1736

Then I figured out a nice system with the do keyword

lines = open("./input.txt") do io
    return split(strip(read(io, String)), "\n")
end

lines[1:5]

5-element Array{SubString{String},1}:
 "2000"
 "50"
 "1984"
 "1600"
 "1736"

After some more iterations, I figured out I could just do this

numbers = parse.(Int, readlines("input.txt"))
numbers[1:5]

5-element Array{Int64,1}:
 2000
   50
 1984
 1600
 1736

Nice!

Workflow

I used emacs, with the julia-repl-mode plugin. This worked well except for sending functions I used a lot of Ctrl+Enters. I should just take the time to figure out a way to do that more efficiently. Compilation time was never a problem, I occasionally had the restart the kernel because I wanted to redefine types.

# the testing macro worked great in this setup, I put in a lot of tests throughout the code

using Test

function s(a, b)
    a + b
end

@test s(2, 3) == 5

[32m[1mTest Passed[22m[39m

It’s all about arrays

The big attraction of Julia over Python for is that it allows me to program performant algorithms in the language itself. In Python, I always cringe a bit when combing numpy with loops, or having a pd.Series.apply when working with pandas. In Julia, I am allowed to write for loops without feeling guilty, and they are fast!

In Python, there are many packages for fast linear algebra, such as numpy or torch. However, these libraries don’t work together well with other parts of the language or other packages, unless they are specifically designed for working together. Things such as automatic differentiation have to be built on top of an ecosystem. I think of torch and numpy as sublanguages of Python, with their own datatypes, statically typed arrays. That is not a problem in itself maybe, and very effective for the coming years, but I think in the long run, this can never match the possibilities in Julia. In the python ecosystem, interoperability is organized with brilliant, but cumbersome constructs like the __array__ interface. That one works well, but who is familiar with __geo_interface__ interface? To me, these ideas are all efforts to get performance by using Python’s great flexibility to work around the imcompatiblity between dynamic typing and a real array type. In short, data comes in the form of big collections, and to work with those fast, I believe you need fixed-length data types. Julia has them.

Short is good

I really liked the conciseness of the . operator, letting the function hello below operate on an array without additional work. And I’m also getting used to omitting the return statement actually. These are just small things that make me love a language. The first 10 days I stubbornly included a return statement, following the Python zen; “Explicit is better than implicit”. The eleventh day I converted and left those prehistoric ideas behind me.

function hello(str)
    "Hello, $(str)!"
end

names = ["Gijs", "Simon", "Geert"]

hello.(names)

3-element Array{String,1}:
 "Hello, Gijs!"
 "Hello, Simon!"
 "Hello, Geert!"

Losing explicit imports

One thing that I like more in Python are the very explicit imports. Whenever you encounter a name in a python source file, you can know where it came from just by looking at the file. For example, when you want to use a DefaultDict in your code, you have two standard ways of getting access to the class, one way would be

from collections import DefaultDict

x = DefaultDict(0)

or, alternatively

import collections

x = collections.DefaultDict

Assuming you avoid from collections import *, which is recommended practice, you will be able to tell from which package the DefaultDict was imported. But in julia, using is more common than import, and there is no direct link between the imports and the names.

using DataStructures
using StatsBase

x = DefaultDict

DefaultDict

In the case above, you don’t know if you are using Base.DefaultDict or DataStructures.DefaultDict. In particular, the code will break if StatsBase starts exporting their own DefaultDict. Unlikely, but I like explicitness, and it just makes it easier to lookup definitions, both for me and for my editor.

In Julia, the standard practice is thus the other way around. The explicit imports actually exist and work.

import DataStructures

x = DataStructures.DefaultDict

DefaultDict

There was some interesting discussion on Reddit on this topic just this morning.

Splatted

I use the asterisk in python all the time, even doing something like this.

x = [*range(5)]

I like it better than the ellipsis in Julia.

x = collect(1:45)

+(x...)

Multiple dispatch

At first this may be mistaken for a small nicety, but I think it’s really really powerful, and exactly what you need to create flexible and useful systems for data munging. For example, below is the initializer for a DataFrame in the pandas library (source). It’s all type checking, and it’s just going to clean up so much with dynamic dispatch.

```def init( self, data=None, index: Optional[Axes] = None, columns: Optional[Axes] = None, dtype: Optional[Dtype] = None, copy: bool = False, ): if data is None: data = {} if dtype is not None: dtype = self._validate_dtype(dtype)

```     if isinstance(data, DataFrame):
            data = data._mgr

        if isinstance(data, (BlockManager, ArrayManager)):
            if index is None and columns is None and dtype is None and copy is False:
                # GH#33357 fastpath
                NDFrame.__init__(self, data)
                return

            mgr = self._init_mgr(
                data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
            )

        elif isinstance(data, dict):
            mgr = init_dict(data, index, columns, dtype=dtype)
        elif isinstance(data, ma.MaskedArray):
            import numpy.ma.mrecords as mrecords

            # masked recarray
            if isinstance(data, mrecords.MaskedRecords):
                mgr = masked_rec_array_to_mgr(data, index, columns, dtype, copy)

            # a masked array
            else:
                data = sanitize_masked_array(data)
                mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)

        elif isinstance(data, (np.ndarray, Series, Index)):
            if data.dtype.names:
                data_columns = list(data.dtype.names)
                data = {k: data[k] for k in data_columns}
                if columns is None:
                    columns = data_columns
                mgr = init_dict(data, index, columns, dtype=dtype)
            elif getattr(data, "name", None) is not None:
                mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
            else:
                mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)

        # For data is list-like, or Iterable (will consume into list)
        elif is_list_like(data):
            if not isinstance(data, (abc.Sequence, ExtensionArray)):
                data = list(data)
            if len(data) > 0:
                if is_dataclass(data[0]):
                    data = dataclasses_to_dicts(data)
                if treat_as_nested(data):
                    arrays, columns, index = nested_data_to_arrays(
                        data, columns, index, dtype
                    )
                    mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
                else:
                    mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
            else:
                mgr = init_dict({}, index, columns, dtype=dtype)
        # For data is scalar
        else:
            if index is None or columns is None:
                raise ValueError("DataFrame constructor not properly called!")

            if not dtype:
                dtype, _ = infer_dtype_from_scalar(data, pandas_dtype=True)

            # For data is a scalar extension dtype
            if is_extension_array_dtype(dtype):
                # TODO(EA2D): special case not needed with 2D EAs

                values = [
                    construct_1d_arraylike_from_scalar(data, len(index), dtype)
                    for _ in range(len(columns))
                ]
                mgr = arrays_to_mgr(values, columns, index, columns, dtype=None)
            else:
                values = construct_2d_arraylike_from_scalar(
                    data, len(index), len(columns), dtype, copy
                )

                mgr = init_ndarray(
                    values, index, columns, dtype=values.dtype, copy=False
                )

        # ensure correct Manager type according to settings
        manager = get_option("mode.data_manager")
        mgr = mgr_to_mgr(mgr, typ=manager)

        NDFrame.__init__(self, mgr)


```julia

Ubuntu tracker ignore directories

2020-02-20T00:00:00+00:00

Today I noticed tracker-store was eating a lot of CPU on my machine. So I digged a little into this program to figure out what it’s doing. I had no idea on this program, here’s how I figured out some things.

I figured the problem was with some datasets with many files, millions of files, that the indexer was at least looking at, allthough hopefully not reading them, allthough there were jpg images in there, perhaps the tracker would actually start indexing some metadata of those.

So I wanted tracker to ignore all directories named data, and for good measure, I wanted it to exclude venv directories as well, because I have a lot of those for different projects, and they contain a lot of python source files. After some googling, I found out that you can add a .trackerignore file to a directory that will work, but I didn’t want to start adding this file to all data or venv directories I will be creating in the feature.

1. The tracker tool

There is a tracker tool which can influence some things, for example, you can reset the index and pause the mining process.

usage: tracker [--version] [--help]
               <command> [<args>]

Available tracker commands are:
   daemon    Start, stop, pause and list processes responsible for indexing content
   extract   Extract information from a file
   info      Show information known about local files or items indexed
   index     Backup, restore, import and (re)index by MIME type or file name
   reset     Reset or remove index and revert configurations to defaults
   search    Search for content indexed or show content by type
   sparql    Query and update the index using SPARQL or search, list and tree the ontology
   sql       Query the database at the lowest level using SQL
   status    Show the indexing progress, content statistics and index state
   tag       Create, list or delete tags for indexed content

See “tracker help <command>” to read about a specific subcommand.

But I didn’t really find what I was looking for here.

2. Finding documentation

I found all files this package was using with the following command.

$ dpkg -L tracker

This shows many files and directories, among others:

/usr/lib/tracker
...
/usr/share/doc/tracker/AUTHORS
/usr/share/doc/tracker/NEWS.gz
/usr/share/doc/tracker/README.md.gz
/usr/share/doc/tracker/copyright
...
/usr/share/glib-2.0/schemas/org.freedesktop.Tracker.gschema.xml
...

Some documentation is available in the README.md file, which also points to https://wiki.gnome.org/Projects/Tracker/Documentation/GettingStarted. On that link I found you can view the settings with this oneliner.

3. Accessing tracker settings

$ gsettings list-recursively | grep -i org.freedesktop.Tracker | sort | uniq

Which shows approximately 40 records, among others;

org.freedesktop.Tracker.DB journal-chunk-size 50
...
org.freedesktop.Tracker.Miner.Files ignored-directories-with-content ['.trackerignore', '.git', '.hg', '.nomedia']
org.freedesktop.Tracker.Miner.Files ignored-files ['*~', '*.o', '*.la', '*.lo', '*.loT', '*.in', '*.csproj', '*.m4', '*.rej', '*.gmo', '*.orig', '*.pc', '*.omf', '*.aux', '*.tmp', '*.vmdk', '*.vm*', '*.nvram', '*.part', '*.rcore', '*.lzo', 'autom4te', 'conftest', 'confstat', 'Makefile', 'SCCS', 'ltmain.sh', 'libtool', 'config.status', 'confdefs.h', 'configure', '#*#', '~$*.doc?', '~$*.dot?', '~$*.xls?', '~$*.xlt?', '~$*.xlam', '~$*.ppt?', '~$*.pot?', '~$*.ppam', '~$*.ppsm', '~$*.ppsx', '~$*.vsd?', '~$*.vss?', '~$*.vst?', 'mimeapps.list', 'mimeinfo.cache', 'gnome-mimeapps.list', 'kde-mimeapps.list', '*.directory']
org.freedesktop.Tracker.Miner.Files index-on-battery-first-time true
...

I wanted tracker to ignore all directories called data and venv, since these have many files, and they shouldn’t be indexed.

$ gsettings get org.freedesktop.Tracker.Miner.Files ignored-directories

['po', 'CVS', 'core-dumps', 'lost+found']```

So finally I added the new entries, and then reset the whole index with the following commands.

4. TL;DR

$ gsettings set org.freedesktop.Tracker.Miner.Files ignored-directories "['po', 'CVS', 'core-dumps', 'lost+found', 'data', 'venv']"
$ tracker reset -r

And I learned more on the wonderful programs that are runnning on my pc. Hope it helps you!

Gijs Koot

Partial batch failure with SQS driven Lambda functions

Reading large tiles from S3 directly with `rasterio`

Partial reads

Adjusting age group for local vaccination rate (2)

Adjusting percentages for local vaccination rate

Adding log odds to combine statistics

Adding log odds to combine statistics

Logistic regression model

Calculating the age coefficient for 40-49

Using Futures and the ProcessPoolExecutor in python

When to use ProcessPoolExecutor

Basic multiprocessing with os.fork

Collecting return values

Dealing with exceptions in child processes

Pommodoro timer in Elisp

Popping balloons (2)

Expected value of a single balloon

A different distribution for \(M\)

Summary

Popping balloons

Popping balloons risk assessment game

Optimal strategy

Uniform prior

Summary

Completing the Advent of code 2020 with Julia

Advent of Code

Julia

Learning path

Workflow

It’s all about arrays

Short is good

Losing explicit imports

Splatted

Multiple dispatch

Ubuntu tracker ignore directories

1. The tracker tool

2. Finding documentation

3. Accessing tracker settings

4. TL;DR

When to use `ProcessPoolExecutor`

Basic multiprocessing with `os.fork`