Parallel processing pandas dataframes


From python 3.2 and forward an excellent module called concurrent.futures is available. It makes it very easy to do multi-threading or multi-processing:

The concurrent.futures module provides a high-level interface for asynchronously executing callables. The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.

Parallel execution of pandas dataframe

In a concrete problem I recently faced I had a large dataframe and a heavy function to execute on each row using a subset of columns from the dataframe. Usually, I would have used the apply method to work through the rows, but apply only uses 1 core of the available cores.

In the made up example below, I am using concurrent.futures to process the dataframe on all available cores. Of course this doesn’t make sense for simple operations as summation below, but for heavy calculations it can make a large impact to use all the available computing power. Also note that I am sending the rows in chunks of 10 to the executor – this reduces the overhead of returning the results.

Continue reading Parallel processing pandas dataframes

My Vim setup for Python programming

I have been using Vim now for many years, but usually with fairly vanilla setup as I did not want to depend on fancy features if had to work remotely on older machines. Those days are now gone and I can focus my hacking on somewhat recent Linux installations and using the latest Vim versions.

vimscreenshotThis screenshot shows the output of the vim-plug command :PlugUpdate and the 24 bit color scheme tender.

Vim-plug plugin manager

On a day-to-day basis I am editing Python 98% of my time, with some bash and LaTeX in the last 2%, so the plugins I have listed below are biased towards python. A major development in my Vim usage was the vim-plug plugin manager and the corresponding plugins. My full .vimrc is listed below, but can also be found on my github page: The most important change is the the plugin manager which needs a little more explanation.

This short snippet below is pure magic. On a complete blank Linux shell without a .vimrc or .vim folder this will download the vim-plug plugin manager and afterwards install all my plugins. All I have to do to set up my Vim environment is to download my .vimrc below and start vim. That’s it.

if empty(glob('~/.vim/autoload/plug.vim'))
  silent !curl -fLo ~/.vim/autoload/plug.vim --create-dirs
  autocmd VimEnter * PlugInstall | source $MYVIMRC
call plug#begin('~/.vim/plugged')
Plug 'tpope/vim-sensible'
Plug 'scrooloose/syntastic'
Plug 'nvie/vim-flake8'
Plug 'scrooloose/nerdtree', { 'on': 'NERDTreeToggle' }
Plug 'ctrlpvim/ctrlp.vim'
Plug 'godlygeek/tabular'
Plug 'jacoborus/tender'
Plug 'ervandew/supertab'
Plug 'sirver/ultisnips'
Plug 'honza/vim-snippets'
call plug#end()

Continue reading My Vim setup for Python programming

Growing a mdadm RAID by replacing disks


As it can be read in my related earlier post: Replacing a failed disk in a mdadm RAID I have a 4 disk RAID 5 setup which I initially populated with 1TB disk WD GREEN (cheap, but not really suited for NAS operation). After a few years I started fill up the file system, so I wanted to grow my RAID by upgrading the disks to WD RED 3TB disks. The WD RED disk are especially tailored to the NAS workload. The workflow of growing the mdadm RAID is done through the following steps:

  • Fail, remove and replace each of 1TB disk with a 3TB disk. After each disk I have to wait for the RAID to resync to the new disk.
  • I then have to grow the RAID to use all the space on each of the 3TB disks.
  • Finally, I have to grow the filesystem to use the available space on the RAID device.

The following is similar to my previous article Replacing a failed disk in a mdadm RAID, but I have included it hear for completness.

Removing the old drive

The enclosure I have does not support hot-swap and the disk have no separate lights for each disk, so I need a way to find out which of the disks to replace. Finding the serial number of the disk is fairly easy:

# hdparm -i /dev/sde | grep SerialNo
 Model=WDC WD10EARS-003BB1, FwRev=80.00A80, SerialNo=WD-WCAV5K430328

and luckily the Western Digital disks I have came with a small sticker which shows the serial on the disk. So now I know the serial number of the disk I want to replace, so before shutting down and replacing the disk I marked as failed in madam and removed from the raid:

mdadm --manage /dev/md0 --fail /dev/sde1
mdadm --manage /dev/md0 --remove /dev/sde1

Continue reading Growing a mdadm RAID by replacing disks

Rotating website backup using rsync over ssh


Recently the hosting company for website started supporting SSH access. This meant I could ditch the unsecure FTP transfers and do everything though SFTP and rsync over ssh. Beside making editing of files much easier this also allowed me to implement a rolling/rotating backup of the website. While it can be argued that such backup would never be needed as the hosting company surely has a safe storage solution I have personally experienced the loss of data from a server breakdown at the hosting company.

The python script

Below I have written a python script to automate the backup and keep the last 12 weeks of changes in separate folders with hard links in between. This means for website like mine (with low amount of changes) that the backup does not take up much more than the size of the websize + size of changes (which are small). The script defaults to 12 copies backups and I run the script through cron every week on my home Linux server. The script can also be run on the command line if needed with the syntax: user@host:/www/ /home/tjansson/backup/websites/host/

A cron line to run the script monthly on the first day of the month at 4:05 in the morning.

5 4 1 * * /home/tjansson/bin/ user@host:/www/ /home/tjansson/backup/websites/host/

On a final note it is assumed for this script to work through cron, that the ssh access is setup using keys and perhaps ssh-agent for passwordless access to the server.

#!/usr/bin/env python
import os
import argparse
import shutil
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='This script does rotating backup using rsync  ')
    parser.add_argument('source',         type=str,             help='The source. Example: user@host:/www/')
    parser.add_argument('backup_path',    type=str,             help='The backup path template. Example: /home/tjansson/backup/websites/host/')
    parser.add_argument('-c', '--copies', type=str, default=12, help='The maximum number of copies to save in the rotation. Default=12')
    parser.add_argument('-d', '--debug',  dest='debug', action='store_true', help='Turn on verbose debugging')
    args = parser.parse_args()
# Folder template
folder = '{}backup{}'.format(args.backup_path, '{}')
# Delete the oldest folder
folder_old = folder.format(args.copies)
if os.path.isdir(folder_old):
    if args.debug:
        print 'Removing the oldest folder: {}'.format(folder_old)
# Rotating backups
if args.debug:
    print 'Rotating backups'
for i in range(args.copies-1,-1,-1):
    folder_0 = folder.format(i)
    folder_1 = folder.format(i+1)
    if os.path.isdir(folder_0):
        if args.debug:
            print 'mv {} {}'.format(folder_0, folder_1)
        os.system('mv {} {}'.format(folder_0, folder_1))
#Execute the RSYNC
target = folder.format(0)
link   = folder.format(1)
if not os.path.isdir(target):
if not os.path.isdir(link):
    cmd = 'rsync -ah --delete -e ssh {source} {target}'.format(link=link, source=args.source, target=target)
    cmd = 'rsync -ah --delete -e ssh --link-dest="{link}" {source} {target}'.format(link=link, source=args.source, target=target)
if args.debug:
    print 'Rsyncing the latests changes'
    print cmd
os.system('touch {}'.format(target))

Further reading and inspiration to this post

Migrating a wordpress blog – mysql charset problems and backup script


The, now previous, hosting company of my wife’s blog had a major data corruption and completely lost a years worth of database entries and files. There was no communication before we found the problem ourselves, so we were very unhappy and decided to reconstruct the host on another site.

Luckily I had set up WordPress to send a complete database dump weekly as tar.gz balls, so no database entries were lost. All uploaded images and such was permanently lost, but reconstructing this is much easier than reconstructing posts and comments.

Charset problems moving the site to another webhotel

After creating a backup of the files left on the old host I made a local copy to my computer and another copy to the new webhotel. After the DNS changes had gone through and had imported the database dump on the new hosts the only thing left was to edit wp-config.php with the new database settings… or so I thought. It turned out that the all the tables of the database were in charset latin1_swedish_ci, but some of the posts contained utf8 characters as well. The result was the all Danish letters and many special characters in english looked garbled on the blog.

After searching the web for hours through variations over simple search and replace, which I did not find feasible, I finally found the holy grail – the ‘replace’ command as part of the mysql (now mariadb) project. The following command corrected all entries in the sql file from the mix of different charsets to a consistent utf8 output that rendered beautifully on the website:

replace "CHARSET=latin1" "CHARSET=utf8" "SET NAMES latin1" "SET NAMES utf8" < database.sql > database_uft8.sql

Continue reading Migrating a wordpress blog – mysql charset problems and backup script

Computers and physics