[<- back] to the pattern_classification repository
Sebastian Raschka
last updated: 09/04/2014
This is not meant to be a complete list of all Python libraries out there that are related to scientific computing and data analysis -- printed on paper and stacked one on top of the other, the stack could easily reach a height of 238,857 miles, the distance from Earth to Moon.
However, I would still be looking forward to additions and suggestions.
Please feel free to drop me a note via
twitter, email, or google+.
- Fundamental Libraries for Scientific Computing
- Math and Statistics
- Machine Learning
- Plotting and Visualization
- Data formatting and storage
Website: http://ipython.org/notebook.html
IPython is an alternative Python command line shell for interactive computing with lots of useful enhancements over the "default" Python interpreter.
The browser-based documents IPython Notebooks are a great environment for scientific computing: Not only to execute code, but also to add informative documentation via Markdown, HTML, LaTeX, embedded images, and inline data plots via e.g., matplotlib.
Website: http://www.numpy.org
NumPy is probably the most fundamental package for efficient scientific computing in Python through linear algebra routines. One of NumPy's major strength is that most operations are implemented as C/C++ and FORTRAN code for efficiency. At its core, NumPy works with multi-dimensional array objects that support broadcasting and lead to efficient, vectorized code.
Website: http://pandas.pydata.org
Pandas is a library for operating with table-like structures. It comes with a powerful DataFrame object, which is a multi-dimensional array object for efficient numerical operations similar to NumPy's ndarray with additional functionalities.
Website: http://scipy.org/scipylib/index.html
SciPy is a considered to be one of the core packages for scientific computing routines. As a useful expansion of the NumPy core functionality, it contains a broad range of functions for linear algebra, interpolation, integration, clustering, and many more.
Website: http://www.sympygamma.com
SymPy is a Python library for symbolic mathematical computations. It has a broad range of features, ranging from calculus, algebra, geometry, discrete mathematics, and even quantum physics. It also includes basic plotting functionality and print functions with LaTeX support.
Website: http://statsmodels.sourceforge.net
Statsmodel is a Python libarary that is centered around statistical data analysis mainly through linear models and includes a variety of statistical tests.
Website: http://scikit-learn.org/stable/
Scikit-learn is is probably the most popular general machine library for Python. It includes a broad range of different classifiers, cross-validation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful data-preprocessing module.
Website: http://www.shogun-toolbox.org
Shogun is a machine learning library that is focussed on large-scale kernel methods. Its particular strengths are Support Vector Machines (SVMs) and it comes with a range of different SVM implementations.
Website: http://pybrain.org
PyBrain (Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library) is a machine learning library that uses neural networks to focus on supervised learning, reinforcement learning, and evolutionary methods.
Website: http://deeplearning.net/software/pylearn2/
PyLearn2 is a machine learning research library - a library to study machine learning - focussed on deep and convolutional neural networks, restricted Boltzman machines, and auto-encoders.
Website: http://pymc-devs.github.io/pymc/index.html
The focus of PyMC is Bayesian statistics and comes with a broad range of algorithms (including Markov Chain Monte Carlo, MCMC) for model fitting.
Website: http://bokeh.pydata.org
Bokeh is a plottling library that is focussed on aesthetic layouts and interactivity to produce high-quality plots for web browsers.
Website: https://github.com/mikedewar/d3py
d3py is a plotting library to create interactive data visualizations based on d3.
Website: https://github.com/yhat/ggplot
ggplot is a port of R's popular ggplot2 library, which brings the alternative syntax and unique visualization style to Python.
Website: http://matplotlib.org
Matplotlib is Python's most popular and comprehensive plotting library that is especially useful in combination with NumPy/SciPy.
Website: https://plot.ly
Plotly is a plotting library that is focussed on adding interactivity to data visualizations and to share them via the web for collaborative data analysis.
Website: http://olgabot.github.io/prettyplotlib/
Prettyplotlib is a nice enhancement-library that turns matplotlib's default styles into beautiful, presentation-ready plots based on information design and color perception studies.
Website: http://web.stanford.edu/~mwaskom/software/seaborn/
Seaborn is based on matplotlib's core functionality and adds additional features (e.g., violin plots) and visual enhancements to create even more beautiful plots.
Website: https://csvkit.readthedocs.org
csvkit is also known as the "Swiss Army knife for comma-delimited data files" that offers additional functionality and features over Python's in-built csv module. It comes with several shell command-line tools, e.g., csvgrep, csvsort, etc., but of course it can also be imported as library in Python.
Website: http://www.pytables.org
PyTables is a library that combines HDF5 and NumPy for working with very large datasets efficiently. PyTables also makes use of C-extensions (via Cython) for fast data access and pulling data into NumPy or pandas arrays.
Website: https://docs.python.org/3.4/library/sqlite3.html
Although, the sqlite3 is part of Python's Standard Library, it is still worth mentioning this classic that provides a Python interface to SQLite databases. SQLitean open-source SQL database engine that is ideal for smaller workgroups, because it is a single locally stored database file (up to 140 Tb in size) that does not require -- in contrast to SQL -- any server infrastructure.