Computer programming¶

This page collects info and links (to this site or others) useful in computer programming and software development, including resources for various languages, editors, or testing tools, and notes/tips for using them effectively. These resources lean towards data science and data analysis applications.

General rules and philosophies¶

I like Rob Pike's 5 Rules of Programming:

Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is.

Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.

Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. (Even if n does get big, use Rule 2 first.)

Rule 4. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures.

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Pike's rules 1 and 2 restate Tony Hoare's famous maxim "Premature optimization is the root of all evil."

Ken Thompson rephrased Pike's rules 3 and 4 as "When in doubt, use brute force.".

Rules 3 and 4 are instances of the design philosophy KISS.

Rule 5 was previously stated by Fred Brooks in The Mythical Man-Month. Rule 5 is often shortened to "write stupid code that uses smart objects".

These were found here but are restated in various places around the internet.

Some techniques and conventions¶

Notes about data analysis techniques/conventions, independent of language/interface.

Sensor data notes on working with continuous sensor timeseries (from dataloggers, SNOTEL sites, etc.)
Data analysis workflow - Notes on collecting, storing, and moving data through the analysis process.

Text editing and data file handling¶

VIM is a great text editor. Below are a few resources on using it effectively.

A fairly complete Vim commands cheatsheet.
The Vim tips wiki
Seven Habits for effective text editing.
My Vim notes

An excellent general overview of text/data file handling in a Unix environment is provided by Unix for Poets, by Kenneth Ward Church. PDFs of this are all over the internet.

Other useful resources (including some on this wiki)¶

My textfile notes - various command-line ways of manipulating text.
My shell scripting notes, including Unix shell scripts and useful utilities.
BASH hackers site is helpful. Now archived here
sed is a text stream editor great for pattern matching and replacing
- See this tutorial
Awk is also very useful for manipulating text files.

Python¶

Python is a high-level, open-source programming language that, when combined with some numerical, scientific, and plotting packages, makes a very powerful tool for scientific computing and data analysis (on par with Matlab). Useful Python extensions for scientific computing are:

NumPy - provides n-dimensional array objects and other useful numeric extensions to Python
SciPy - provides a number of high-level mathematical tools for use in scientific computing (integration, optimization, fourier transforms...etc)
Matplotlib - a plotting library that provides publication quality plots and plotting routines that are similar to Matlab's.
IPython - an interactive shell that is designed to work well with NumPy, SciPy, and Matplotlib.
SciKits - add on toolkits that complement SciPy (various statistical models, timeseries analysis, machine-learning, image processing, etc.
The pandas library - provides high-performance, easy-to-use data structures (like data frames) and data analysis tools that sit on top of NumPy.

Official Python resources¶

The Python documentation page including tutorials and HowTo's
Python Language Reference - describing the syntax and core semantics of the language.
Python Standard Library - describing the standard library (modules, functions, etc) distributed with Python.
Coding in python should follow the Python Style Guide.
Official NumPy/SciPy documentation
- Additional NumPy/SciPy documentation
PyPlot documentation for the Matlab-like plotting framework in matplotlib.
Python package index - an index of many add-on tools discussed in this wiki.

My Python notes¶

Collected notes, tips, and tricks for using any of the Python tools above.

General Python notes on debugging, code structure, and other aspects of development.
Ipython
NumPy notes - Various notes on using the NumPy package.

Other¶

The Python Wiki Vim page is being archived...

MATLAB (and clones)¶

MATLAB is a proprietary programming language and IDE that is widely used in scientific and engineering computing.

Resources¶

Official MathWorks documentation
Function reference
MATLAB Central - the official user/developer community, including a file exchange.
My MATLAB notes

Clones of Matlab¶

There are a bunch of free/open-source clones of Matlab that have various levels of syntax compatibility.

GNU Octave - generally very compatible with Matlab, though some functions are missing.
SciLab
FreeMat

R¶

R is a free, open-source software environment for statistical computing and graphics.

R-project homepage
R manuals
knitr - a nice report generating engine for R
My R notes

Math and Stats tools¶

Many toolboxes are available, either standalone or in Python, R, and Matlab, for math and statistical applications. See the math toolbox page page.

Testing data analysis functions¶

Code used in data analysis can perform fairly complex operations on datasets and generate output that may be significantly changed from the original data. The code itself can also be fairly complex and its actual function may be difficult to discern just by reading the code or looking at the data. It is important to verify that the result of running this code is what is expected and that the output is accurate. Writing test functions that call data analysis code and analyze their output is a useful way to do this.