The intent of this page is to list some of the most commonly used Python modules, in the hope that it will provide useful recommendations for other programmers (especially beginners). Remember that in addition to the listings below, there are other directories of Python modules - see PublishingPythonModules for details. Another collection of library details can be found on the Libraries page.
Be warned that this list is subjective by its very nature - it is only intended as a helpful guide. It is not definitive in any way, nor should it discourage developers from developing their own modules.
StandardLibraryBackports - modules that make later standard library functionality available in earlier version
SQLAlchemy or SQLObject - Object oriented access to several different database systems
DatabaseInterfaces - Direct Python interfaces to relational and non-relational database backends
See also DatabaseProgramming for guidance on choosing a database backend system
CTypes - A package for calling the functions of dlls/shared libraries. Now included with Python 2.5 and up.
Cython is an extension language for the CPython runtime. It translates Python code to fast C code and supports calling external C and C++ code natively. As opposed to ctypes, it requires a C compiler to translate the generated code.
PyGame - Principal wrapper of the SDL library.
See also GameProgramming. A more comprehensive list of packages can be found on the PythonGameLibraries page.
GIS Web services - Packages to access to Google Maps, Yahoo! Maps⦠and more information
PyGtk - Bindings for the cross-platform Gtk toolkit.
PyQt - Bindings for the cross-platform Qt framework.
TkInter - The traditional Python user interface toolkit.
WxPython - wxWidgets bindings for Python supporting PythonCard, Wax and other frameworks.
PyjamasDesktop - Bindings and a framework for the cross-platform webkit.
GUI Programming is, in many cases, a matter of taste. See a more extensive list on the GuiProgramming page.
Ascii Table packages
Mutagen - Mutagen is a Python module to handle audio metadata. It supports FLAC, M4A, Musepack, MP3, Ogg FLAC, Ogg Speex, Ogg Theora, Ogg Vorbis, True Audio, and WavPack audio files. All versions of ID3v2 are supported, and all standard ID3v2.4 frames are parsed. It can read Xing headers to accurately calculate the bitrate and length of MP3s. ID3 and APEv2 tags can be edited regardless of audio format. It can also manipulate Ogg streams on an individual packet/page level. It only writes ID3v2.4 (introduced in 2000, but as of Windows 8 still not supported by MediaPlayer).
ID3Reader - "Id3reader.py is a Python module that reads ID3 metadata tags in MP3 files. It can read ID3v1, ID3v2.2, ID3v2.3, or ID3v2.4 tags. It does not write tags at all" (from site). Used in ID3Writer. Does not work with Python 3000. Not maintained since 2006.
PyID3 - ""(Appears to be inactive)""Module for manipulating ID3 informational tags in MP3 audio files. Not as good as ID3Writer, but no issues w/ genre, unlike ID3Writer. Not maintained since 2007.
pytagger - tag reader and writer implemented purely in Python. Supports ID3v1, ID3v1.1, ID3v2.2, ID3v2.3 and ID3v2.4
eyeD3 - is a Python module and program for processing ID3 tags. It can extract information such as bit rate, sample frequency, play time, etc. It supports ID3 v1.0/v1.1 and v2.3/v2.4.
hsaudiotag - Py3k - hsaudiotag is a pure Python library that lets you read metadata (bitrate, sample rate, duration and tags) from mp3, mp4, wma, ogg, flac and aiff files. It can only read tags, not write to them, but unlike more complete libraries (like Mutagen), it is BSD licensed.
pytaglib - Python 3.x and 2.x support - bindings to the C++ taglib library, reads and writes mp3, ogg, flac, mpc, speex, opus, WavPack, TrueAudio, wav, aiff, mp4 and asf files.
Python Imaging Library (PIL) - Supports many file formats, and provides powerful image processing and graphics capabilities.
pyqtgraph - Pure-python graphics library for scientific applications with image/video display, multidimensional image slicing, and interactive manipulation tools.
asyncoro - Asynchronous, concurrent programming framework with coroutines with thread-like interface
Gevent - Coroutine-based network library
TwistedMatrix - Event-driven networking framework
RPyC - Transparent RPC/distributed-computing framework
PyRO - powerful OO RPC
HTTPLib2 - A comprehensive HTTP client library that supports many features left out of other HTTP libraries, like httplib on the standard library.
Celery - Distributed task queue for out of band processing/RPC and more.
Psyco - Psyco can speed up the execution of any Python code (x86 only).
PyInstaller - Packages Python programs into stand-alone executables, under Windows, Linux and Irix.
py2app - Creates stand-alone apps (like py2exe for Mac)
PyObjC - Bridge between the Python and Objective-C. Most important usage of this is writing Cocoa GUI applications on Mac OS X in pure Python
PyWin32 - Python extensions for Windows.
Py2exe - Converts python scripts into executable windows programs, able to run without requiring a python installation.
Chaco - Creates interactive plots
gnuplot.py - Based on gnuplot
Matplotlib - Production quality output in a wide variety of formats
Plotly - Interactive, publication-quality, web based charts
PyX - Postscript and PDF output, (La)TeX integration
ReportLab includes a charting package
pyqtgraph - Pure-python plotting and graphics library based on PyQt and numpy.
The SciPy topical software page has a longer list.
http://docutils.sourceforge.net/docs/user/tools.html#rst2s5-py - Create HTML slides from .rst files
http://seld.be/notes/introducing-slippy-html-presentations - For your Python presentations in browser
See RdfLibraries for a list of available RDF processing solutions.
Visual Python - Offers real-time 3D output, is easily usable by novice programmers, excellent for physics.
SciPy - Includes modules for graphics and plotting, optimization, integration, special functions, signal and image processing, genetic algorithms, ODE solvers, and others.
Python Bindings for R - R is a well known, open source (GPL 2) statistical package. RPy has two versions. Version 2 is still in development but is already usable.
PyIMSL is a collection of Python wrappers to the mathematical and statistical algorithms in the IMSL C Numerical Library. Developers can use Python, PyIMSL and the IMSL C Numerical Library for rapid prototyping. PyIMSL Studio is a complete packaged, supported and documented development environment designed for deploying mathematics and statistics prototype models into production applications. PyIMSL Studio includes the PyIMSL wrappers, the IMSL C Numerical Library, a Python distribution and a selection of open source python modules useful for prototype analytical development. PyIMSL Studio is available for download at no charge for non-commercial use or for commercial evaluation.
Python Path - Wraps the functionality of the os.path module and provides something more convenient.
Requests - An improvement upon urllib etc., for sending HTTP requests.
Dateutil - Provides powerful extensions to the datetime module.
sh - Can call any external program as if it were a function.
DocOpt - Command line arguments parser, with declarative approach (docstring).
PyLibrary - Collection of Libraries useful for Python developers.
ThreadPool - Intuitive approach to threads, well-explained.
See the ParallelProcessing page for other multiprocessing or parallel processing approaches.
psutil - cross-platform library for retrieving information on running processes and system utilization (CPU, memory, disks, network) in Python.
Django - High-level web framework.
Pyramid - Turbogears, Pylons, Repoz.bfg merged as Pyramid.
TurboGears - Rapid web development megaframework.
Pylons - A lightweight web framework emphasizing flexibility and rapid development.
web2py - High-level framework for agile development.
Flask - microframework for Python based on Werkzeug, Jinja 2. (It's BSD licensed)
See a more complete list of topics on the WebProgramming page and frameworks on the WebFrameworks page.
ClientForm - " ClientForm is a Python module for handling HTML forms on the client side, useful for parsing HTML forms, filling them in and returning the completed forms to the server. It developed from a port of Gisle Aas' Perl module HTML::Form, from the libwww-perl library, but the interface is not the same." - from the website.
lxml.html has support for dealing with forms in HTML documents
See also the WebProgramming and WebFrameworks pages.
Beautiful Soup - HTML/XML parser designed for quick turnaround projects like screen-scraping, will accept bad markup.
PyQuery - implements jQuery in Python; faster than BeautifulSoup, apparently.
mxTidy - HTML cleanup tool. This is a library version of the popular HTML Tidy command-line application which will convert HTML (even badly formatted) into e.g. XHTML.
lxml.html is a very fast, easy-to-use and versatile library for handling (and fixing up) HTML
See also PythonXml for related tools.
openflow - A workflow engine for Zope 2.
Goflow - A workflow engine for Django, with same design as openflow.
ElementTree - The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. --Note: Python 2.5 and up has ElementTree in the Standard Library--
lxml is a very fast, easy-to-use and versatile library for XML handling that is mostly compatible with but much more feature-rich than ElementTree
Amara - Amara provides tools you can trust to conform with XML standards without losing the familiar Python feel. (see also the 1.x version)
PythonXml provides a list of available XML processing solutions.
1. Requests. The most famous http library written by kenneth reitz. It’s a must have for every python developer.
2. Scrapy. If you are involved in webscraping then this is a must have library for you. After using this library you won’t use any other.
3. wxPython. A gui toolkit for python. I have primarily used it in place of tkinter. You will really love it.
4. Pillow. A friendly fork of PIL (Python Imaging Library). It is more user friendly than PIL and is a must have for anyone who works with images.
5. SQLAlchemy. A database library. Many love it and many hate it. The choice is yours.
6. BeautifulSoup. I know it’s slow but this xml and html parsing library is very useful for beginners.
7. Twisted. The most important tool for any network application developer. It has a very beautiful api and is used by a lot of famous python developers.
8. NumPy. How can we leave this very important library ? It provides some advance math functionalities to python.
9. SciPy. When we talk about NumPy then we have to talk about scipy. It is a library of algorithms and mathematical tools for python and has caused many scientists to switch from ruby to python.
10. matplotlib. A numerical plotting library. It is very useful for any data scientist or any data analyzer.
11. Pygame. Which developer does not like to play games and develop them ? This library will help you achieve your goal of 2d game development.
12. Pyglet. A 3d animation and game creation engine. This is the engine in which the famous python port of minecraft was made
13. pyQT. A GUI toolkit for python. It is my second choice after wxpython for developing GUI’s for my python scripts.
14. pyGtk. Another python GUI library. It is the same library in which the famous Bittorrent client is created.
15. Scapy. A packet sniffer and analyzer for python made in python.
16. pywin32. A python library which provides some useful methods and classes for interacting with windows.
17. nltk. Natural Language Toolkit – I realize most people won’t be using this one, but it’s generic enough. It is a very useful library if you want to manipulate strings. But it’s capacity is beyond that. Do check it out.
18. nose. A testing framework for python. It is used by millions of python developers. It is a must have if you do test driven development.
19. SymPy. SymPy can do algebraic evaluation, differentiation, expansion, complex numbers, etc. It is contained in a pure Python distribution.
20. IPython. I just can’t stress enough how useful this tool is. It is a python prompt on steroids. It has completion, history, shell capabilities, and a lot more. Make sure that you take a look at it.
These are the basic libraries that transform Python from a general purpose programming language into a powerful and robust tool for data analysis and visualization. Sometimes called the SciPy Stack, they’re the foundation that the more specialized tools are built on.
What if your business doesn’t have the luxury of accessing massive datasets? For many businesses, the data they need isn’t something that can be passively gathered—it has to be extracted either from documents or webpages. The following tools are designed for a variety of related tasks, from mining valuable information from websites to turning natural language into data you can use.
The best and most sophisticated analysis is meaningless if you can’t communicate it to other people. These libraries build on matplotlib to enable you to easily create more visually compelling and sophisticated graphs, charts, and maps, no matter what kind of analysis you’re trying to do.
These libraries are just a small sample of the tools available to Python developers. If you’re ready to get your data science initiative up and running, you’re going to need the right team. Find a developer who knows the tools and techniques of statistical analysis, or a data scientist with the development skills to work in a production environment. Explore data scientists on Upwork, or learn more about the basics of Big Data.
Since the release of AWS Lambda (and others that have followed), all the rage has been about serverless architectures. These allow microservices to be deployed in the cloud, in a fully managed environment where one doesn’t have to care about managing any server, but is assigned stateless, ephemeral computing containers that are fully managed by a provider. With this paradigm, events (such as a traffic spike) can trigger the execution of more of these containers and therefore give the possibility to handle “infinite” horizontal scaling.
Zappa is the serverless framework for Python, although (at least for the moment) it only has support for AWS Lambda and AWS API Gateway. It makes building so-architectured apps very simple, freeing you from most of the tedious setup you would have to do through the AWS Console or API, and has all sort of commands to ease deployment and managing different environments.
Who said Python couldn’t be fast? Apart from competing for the best name of a software library ever, Sanic also competes for the fastest Python web framework ever, and appears to be the winner by a clear margin. It is a Flask-like Python 3.5+ web server that is designed for speed. Another library, uvloop, is an ultra fast drop-in replacement for asyncio’s event loop that uses libuv under the hood. Together, these two things make a great combination!
According to the Sanic author’s benchmark, uvloop could power this beast to handle more than 33k requests/s which is just insane (and faster than node.js). Your code can benefit from the new async/await syntax so it will look neat too; besides we love the Flask-style API. Make sure to give Sanic a try, and if you are using asyncio, you can surely benefit from uvloop with very little change in your code!
In line with recent developments for the asyncio framework, the folks from MagicStack bring us this efficient asynchronous (currently CPython 3.5 only) database interface library designed specifically for PostgreSQL. It has zero dependencies, meaning there is no need to have libpq installed. In contrast with psycopg2 (the most popular PostgreSQL adapter for Python) which exchanges data with the database server in text format, asyncpg implements PostgreSQL binary I/O protocol, which not only allows support for generic types but also comes with numerous performance benefits.
The benchmarks are clear: asyncpg is on average, at least 3x faster than psycopg2 (or aiopg), and faster than the node.js and Go implementations.
If you have your infrastructure on AWS or otherwise make use of their services (such as S3), you should be very happy that boto, the Python interface for AWS API, got a completely rewrite from the ground up. The great thing is that you don’t need to migrate your app all at once: you can use boto3 and boto (2) at the same time; for example using boto3 only for new parts of your application.
The new implementation is much more consistent between different services, and since it uses a data-driven approach to generate classes at runtime from JSON description files, it will always get fast updates. No more lagging behind new Amazon API features, move to boto3!
Do we even need an introduction here? Since it was released by Google in November 2015, this library has gained a huge momentum and has become the #1 trendiest GitHub Python repository. In case you have been living under a rock for the past year, TensorFlow is a library for numerical computation using data flow graphs, which can run over GPU or CPU.
We have quickly witnessed it become a trend in the Machine Learning community (especially Deep Learning, see our post on 10 main takeaways from MLconf), not only growing its uses in research but also being widely used in production applications. If you are doing Deep Learning and want to use it through a higher level interface, you can try using it as a backend for Keras (which made it to last years post) or the newer TensorFlow-Slim.
If you are into AI, you surely have heard about the OpenAI non-profit artificial intelligence research company (backed by Elon Musk et al.). The researchers have open sourced some Python code this year! Gym is a toolkit for developing and comparing reinforcement learning algorithms. It consists of an open-source library with a collection of test problems (environments) that can be used to test reinforcement learning algorithms, and a site and API that allows to compare the performance of trained algorithms (agents). Since it doesn’t care about the implementation of the agent, you can build them with the computation library of your choice: bare numpy, TensorFlow, Theano, etc.
We also have the recently released universe, a software platform for researching into general intelligence across games, websites and other applications. This fits perfectly with gym, since it allows any real-world application to be turned into a gym environment. Researchers hope that this limitless possibility will accelerate research into smarter agents that can solve general purpose tasks.
You may be familiar with some of the libraries Python has to offer for data visualization; the most popular of which are matplotlib and seaborn. Bokeh, however, is created for interactive visualization, and targets modern web browsers for the presentation. This means Bokeh can create a plot which lets you explore the data from a web browser. The great thing is that it integrates tightly with Jupyter Notebooks, so you can use it with your probably go-to tool for your research. There is also an optional server component, bokeh-server
, with many powerful capabilities like server-side downsampling of large dataset (no more slow network tranfers/browser!), streaming data, transformations, etc.
Make sure to check the gallery for examples of what you can create. They look awesome!
Sometimes, you want to run analytics over a dataset too big to fit your computer’s RAM. If you cannot rely on numpy or Pandas, you usually turn to other tools like PostgreSQL, MongoDB, Hadoop, Spark, or many others. Depending on the use case, one or more of these tools can make sense, each with their own strengths and weaknesses. The problem? There is a big overhead here because you need to learn how each of these systems work and how to insert data in the proper form.
Blaze provides a uniform interface that abstracts you away from several database technologies. At the core, the library provides a way to express computations. Blaze itself doesn’t actually do any computation: it just knows how to instruct a specific backend who will be in charge of performing it. There is so much more to Blaze (thus the ecosystem), as libraries that have come out of its development. For example, Dask implements a drop-in replacement for NumPy array that can handle content larger than memory and leverage multiple cores, and also comes with dynamic task scheduling. Interesting stuff.
There is a famous saying that there are only two hard problems in Computer Science: cache invalidation and naming things. I think the saying is clearly missing one thing: managing datetimes. If you have ever tried to do that in Python, you will know that the standard library has a gazillion modules and types: datetime
, date
, calendar
, tzinfo
, timedelta
, relativedelta
, pytz
, etc. Worse, it is timezone naive by default.
Arrow is “datetime for humans”, offering a sensible approach to creating, manipulating, formatting and converting dates, times, and timestamps. It is a replacement for the datetime
type that supports Python 2 or 3, and provides a much nicer interface as well as filling the gaps with new functionality (such as humanize
). Even if you don’t really need arrow, using it can greatly reduce the boilerplate in your code.
Expose your internal API externally, drastically simplifying Python API development. Hug is a next-generation Python 3 (only) library that will provide you with the cleanest way to create HTTP REST APIs in Python. It is not a web framework per se (although that is a function it performs exceptionally well), but only focuses on exposing idiomatically correct and standard internal Python APIs externally. The idea is simple: you define logic and structure once, and you can expose your API through multiple means. Currently, it supports exposing REST API or command line interface.
You can use type annotations that let hug not only generate documentation for your API but also provide with validation and clean error messages that will make your life (and your API user’s) a lot easier. Hug is built on Falcon’s high performance HTTP library, which means you can deploy this to production using any wsgi-compatible server such as gunicorn.
When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).
The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.
SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented — another coin in its pot.
Pandas is a Python package designed to do work with “labeled” and “relational” data simple and intuitive. Pandas is a perfect tool for data wrangling. It designed for quick and easy data manipulation, aggregation, and visualization.
There are two main data structures in the library:
“Series” — one-dimensional
“Data Frames”, two-dimensional
For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:
Here is just a small list of things that you can do with Pandas:
Another SciPy Stack core package and another Python Library that is tailored for the generation of simple and powerful visualizations with ease is Matplotlib. It is a top-notch piece of software which is making Python (with some help of NumPy, SciPy, and Pandas) a cognizant competitor to such scientific tools as MatLab or Mathematica.
However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.
With a bit of effort you can make just about any visualizations:
There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.
The library is supported by different platforms and makes use of different GUI kits for the depiction of resulting visualizations. Varying IDEs (like IPython) support functionality of Matplotlib.
There are also some additional libraries that can make visualization even easier.
Seaborn is mostly focused on the visualization of statistical models; such visualizations include heat maps, those that summarize the data but still depict the overall distributions. Seaborn is based on Matplotlib and highly dependent on that.
Another great visualization library is Bokeh, which is aimed at interactive visualizations. In contrast to the previous library, this one is independent of Matplotlib. The main focus of Bokeh, as we already mentioned, is interactivity and it makes its presentation via modern browsers in the style of Data-Driven Documents (d3.js).
Finally, a word about Plotly. It is rather a web-based toolbox for building visualizations, exposing APIs to some programming languages (Python among them). There is a number of robust, out-of-box graphics on the plot.ly website. In order to use Plotly, you will need to set up your API key. The graphics will be processed server side and will be posted on the internet, but there is a way to avoid it.
Scikits are additional packages of SciPy Stack designed for specific functionalities like image processing and machine learning facilitation. In the regard of the latter, one of the most prominent of these packages is scikit-learn. The package is built on the top of SciPy and makes heavy use of its math operations.
The scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems. The library combines quality code and good documentation, ease of use and high performance and is de-facto industry standard for machine learning with Python.
In the regard of Deep Learning, one of the most prominent and convenient libraries for Python in this field is Keras, which can function either on top of TensorFlow or Theano. Let’s reveal some details about all of them.
Firstly, let’s talk about Theano.
Theano is a Python package that defines multi-dimensional arrays similar to NumPy, along with math operations and expressions. The library is compiled, making it run efficiently on all architectures. Originally developed by the Machine Learning group of Université de Montréal, it is primarily used for the needs of Machine Learning.
The important thing to note is that Theano tightly integrates with NumPy on low-level of its operations. The library also optimizes the use of GPU and CPU, making the performance of data-intensive computation even faster.
Efficiency and stability tweaks allow for much more precise results with even very small values, for example, computation of log(1+x) will give cognizant results for even smallest values of x.
Coming from developers at Google, it is an open-source library of data flow graphs computations, which are sharpened for Machine Learning. It was designed to meet the high-demand requirements of Google environment for training Neural Networks and is a successor of DistBelief, a Machine Learning system, based on Neural Networks. However, TensorFlow isn’t strictly for scientific use in border’s of Google — it is general enough to use it in a variety of real-world application.
The key feature of TensorFlow is their multi-layered nodes system that enables quick training of artificial neural networks on large datasets. This powers Google’s voice recognition and object identification from pictures.
And finally, let’s look at the Keras. It is an open-source library for building Neural Networks at a high-level of the interface, and it is written in Python. It is minimalistic and straightforward with high-level of extensibility. It uses Theano or TensorFlow as its backends, but Microsoft makes its efforts now to integrate CNTK (Microsoft’s Cognitive Toolkit) as a new back-end.
The minimalistic approach in design aimed at fast and easy experimentation through the building of compact systems.
Keras is really eased to get started with and keep going with quick prototyping. It is written in pure Python and high-level in its nature. It is highly modular and extendable. Notwithstanding its ease, simplicity, and high-level orientation, Keras is still deep and powerful enough for serious modeling.
The general idea of Keras is based on layers, and everything else is built around them. Data is prepared in tensors, the first layer is responsible for input of tensors, the last layer is responsible for output, and the model is built in between.
The name of this suite of libraries stands for Natural Language Toolkit and, as the name implies, it used for common tasks of symbolic and statistical Natural Language Processing. NLTK was intended to facilitate teaching and research of NLP and the related fields (Linguistics, Cognitive Science Artificial Intelligence, etc.) and it is being used with a focus on this today.
The functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities identification, building corpus tree that reveals inter and intra-sentence dependencies, stemming, semantic reasoning. All of the building blocks allow for building complex research systems for different tasks, for example, sentiment analytics, automatic summarization.
It is an open-source library for Python that implements tools for work with vector space modeling and topic modeling. The library designed to be efficient with large texts, not only in-memory processing is possible. The efficiency is achieved by the using of NumPy data structures and SciPy operations extensively. It is both efficient and easy to use.
Gensim is intended for use with raw and unstructured digital texts. Gensim implements algorithms such as hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), as well as tf-idf, random projections, word2vec and document2vec facilitate examination of texts for recurring patterns of words in the set of documents (often referred as a corpus). All of the algorithms are unsupervised — no need for any arguments, the only input is corpus.
Scrapy is a library for making crawling programs, also known as spider bots, for retrieval of the structured data, such as contact info or URLs, from the web.
It is open-source and written in Python. It was originally designed strictly for scraping, as its name indicate, but it has evolved in the full-fledged framework with the ability to gather data from APIs and act as general-purpose crawlers.
The library follows famous Don’t Repeat Yourself in the interface design — it prompts its users to write the general, universal code that is going to be reusable, thus making building and scaling large crawlers.
The architecture of Scrapy is built around Spider class, which encapsulates the set of instruction that is followed by the crawler.
As you have probably guessed from the name, statsmodels is a library for Python that enables its users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis.
Among many useful features are descriptive and result statistics via the use of linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis models, various estimators.
The library also provides extensive plotting functions that are designed specifically for the use in statistical analysis and tweaked for good performance with big data sets of statistical data.
These are the libraries that are considered to be the top of the list by many data scientists and engineers and worth looking at them as well as at least familiarizing yourself with them.
And here are the detailed stats of Github activities for each of those libraries:
Of course, this is not the fully exhaustive list and there are many other libraries and frameworks that are also worthy and deserve proper attention for particular tasks. A great example is different packages of SciKit that focus on specific domains, like SciKit-Image for working with images.
So, if you have another useful library in mind, please let our readers know in the comments section.
While The Python Language Reference describes the exact syntax and semantics of the Python language, this library reference manual describes the standard library that is distributed with Python. It also describes some of the optional components that are commonly included in Python distributions.
Python’s standard library is very extensive, offering a wide range of facilities as indicated by the long table of contents listed below. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs.
The Python installers for the Windows platform usually include the entire standard library and often also include many additional components. For Unix-like operating systems Python is normally provided as a collection of packages, so it may be necessary to use the packaging tools provided with the operating system to obtain some or all of the optional components.
In addition to the standard library, there is a growing collection of several thousand components (from individual programs and modules to packages and entire application development frameworks), available from the Python Package Index.
and
, or
, not
int
, float
, complex
list
, tuple
, range
str
bytes
, bytearray
, memoryview
set
, frozenset
dict
string
— Common string operationsre
— Regular expression operationsdifflib
— Helpers for computing deltastextwrap
— Text wrapping and fillingunicodedata
— Unicode Databasestringprep
— Internet String Preparationreadline
— GNU readline interfacerlcompleter
— Completion function for GNU readlinedatetime
— Basic date and time typescalendar
— General calendar-related functionscollections
— Container datatypescollections.abc
— Abstract Base Classes for Containersheapq
— Heap queue algorithmbisect
— Array bisection algorithmarray
— Efficient arrays of numeric valuesweakref
— Weak referencestypes
— Dynamic type creation and names for built-in typescopy
— Shallow and deep copy operationspprint
— Data pretty printerreprlib
— Alternate repr()
implementationenum
— Support for enumerationsnumbers
— Numeric abstract base classesmath
— Mathematical functionscmath
— Mathematical functions for complex numbersdecimal
— Decimal fixed point and floating point arithmeticfractions
— Rational numbersrandom
— Generate pseudo-random numbersstatistics
— Mathematical statistics functionspathlib
— Object-oriented filesystem pathsos.path
— Common pathname manipulationsfileinput
— Iterate over lines from multiple input streamsstat
— Interpreting stat()
resultsfilecmp
— File and Directory Comparisonstempfile
— Generate temporary files and directoriesglob
— Unix style pathname pattern expansionfnmatch
— Unix filename pattern matchinglinecache
— Random access to text linesshutil
— High-level file operationsmacpath
— Mac OS 9 path manipulation functionsos
— Miscellaneous operating system interfacesio
— Core tools for working with streamstime
— Time access and conversionsargparse
— Parser for command-line options, arguments and sub-commandsgetopt
— C-style parser for command line optionslogging
— Logging facility for Pythonlogging.config
— Logging configurationlogging.handlers
— Logging handlersgetpass
— Portable password inputcurses
— Terminal handling for character-cell displayscurses.textpad
— Text input widget for curses programscurses.ascii
— Utilities for ASCII characterscurses.panel
— A panel stack extension for cursesplatform
— Access to underlying platform’s identifying dataerrno
— Standard errno system symbolsctypes
— A foreign function library for Pythonthreading
— Thread-based parallelismmultiprocessing
— Process-based parallelismconcurrent
packageconcurrent.futures
— Launching parallel taskssubprocess
— Subprocess managementsched
— Event schedulerqueue
— A synchronized queue classdummy_threading
— Drop-in replacement for the threading
module_thread
— Low-level threading API_dummy_thread
— Drop-in replacement for the _thread
modulesocket
— Low-level networking interfacessl
— TLS/SSL wrapper for socket objectsselect
— Waiting for I/O completionselectors
— High-level I/O multiplexingasyncio
— Asynchronous I/O, event loop, coroutines and tasksasyncore
— Asynchronous socket handlerasynchat
— Asynchronous socket command/response handlersignal
— Set handlers for asynchronous eventsmmap
— Memory-mapped file supportemail
— An email and MIME handling packagejson
— JSON encoder and decodermailcap
— Mailcap file handlingmailbox
— Manipulate mailboxes in various formatsmimetypes
— Map filenames to MIME typesbase64
— Base16, Base32, Base64, Base85 Data Encodingsbinhex
— Encode and decode binhex4 filesbinascii
— Convert between binary and ASCIIquopri
— Encode and decode MIME quoted-printable datauu
— Encode and decode uuencode fileshtml
— HyperText Markup Language supporthtml.parser
— Simple HTML and XHTML parserhtml.entities
— Definitions of HTML general entitiesxml.etree.ElementTree
— The ElementTree XML APIxml.dom
— The Document Object Model APIxml.dom.minidom
— Minimal DOM implementationxml.dom.pulldom
— Support for building partial DOM treesxml.sax
— Support for SAX2 parsersxml.sax.handler
— Base classes for SAX handlersxml.sax.saxutils
— SAX Utilitiesxml.sax.xmlreader
— Interface for XML parsersxml.parsers.expat
— Fast XML parsing using Expatwebbrowser
— Convenient Web-browser controllercgi
— Common Gateway Interface supportcgitb
— Traceback manager for CGI scriptswsgiref
— WSGI Utilities and Reference Implementationurllib
— URL handling modulesurllib.request
— Extensible library for opening URLsurllib.response
— Response classes used by urlliburllib.parse
— Parse URLs into componentsurllib.error
— Exception classes raised by urllib.requesturllib.robotparser
— Parser for robots.txthttp
— HTTP moduleshttp.client
— HTTP protocol clientftplib
— FTP protocol clientpoplib
— POP3 protocol clientimaplib
— IMAP4 protocol clientnntplib
— NNTP protocol clientsmtplib
— SMTP protocol clientsmtpd
— SMTP Servertelnetlib
— Telnet clientuuid
— UUID objects according to RFC 4122socketserver
— A framework for network servershttp.server
— HTTP servershttp.cookies
— HTTP state managementhttp.cookiejar
— Cookie handling for HTTP clientsxmlrpc
— XMLRPC server and client modulesxmlrpc.client
— XML-RPC client accessxmlrpc.server
— Basic XML-RPC serversipaddress
— IPv4/IPv6 manipulation libraryaudioop
— Manipulate raw audio dataaifc
— Read and write AIFF and AIFC filessunau
— Read and write Sun AU fileswave
— Read and write WAV fileschunk
— Read IFF chunked datacolorsys
— Conversions between color systemsimghdr
— Determine the type of an imagesndhdr
— Determine type of sound fileossaudiodev
— Access to OSS-compatible audio devicestyping
— Support for type hintspydoc
— Documentation generator and online help systemdoctest
— Test interactive Python examplesunittest
— Unit testing frameworkunittest.mock
— mock object libraryunittest.mock
— getting startedtest
— Regression tests package for Pythontest.support
— Utilities for the Python test suitesys
— System-specific parameters and functionssysconfig
— Provide access to Python’s configuration informationbuiltins
— Built-in objects__main__
— Top-level script environmentwarnings
— Warning controlcontextlib
— Utilities for with
-statement contextsabc
— Abstract Base Classesatexit
— Exit handlerstraceback
— Print or retrieve a stack traceback__future__
— Future statement definitionsgc
— Garbage Collector interfaceinspect
— Inspect live objectssite
— Site-specific configuration hookfpectl
— Floating point exception controlparser
— Access Python parse treesast
— Abstract Syntax Treessymtable
— Access to the compiler’s symbol tablessymbol
— Constants used with Python parse treestoken
— Constants used with Python parse treeskeyword
— Testing for Python keywordstokenize
— Tokenizer for Python sourcetabnanny
— Detection of ambiguous indentationpyclbr
— Python class browser supportpy_compile
— Compile Python source filescompileall
— Byte-compile Python librariesdis
— Disassembler for Python bytecodepickletools
— Tools for pickle developersposix
— The most common POSIX system callspwd
— The password databasespwd
— The shadow password databasegrp
— The group databasecrypt
— Function to check Unix passwordstermios
— POSIX style tty controltty
— Terminal control functionspty
— Pseudo-terminal utilitiesfcntl
— The fcntl
and ioctl
system callspipes
— Interface to shell pipelinesresource
— Resource usage informationnis
— Interface to Sun’s NIS (Yellow Pages)syslog
— Unix syslog library routinesI created these modules when I could not find existing ones with the same
functionality (but I can be wrong) or when it didn't match my needs.
They might be useful to other Pythoners.
Warning (March 2006): these modules are now pretty much outdated. glock.py
and getargs.py
might still be of some interest, though.
Package | Description |
---|---|
rgutils | Misc utilities modules (also detailed below) |
scf | Simple Corba Framework. |
The following modules are bundled in the package rgutils, but also available separately :
Module | Description |
---|---|
async.py | Asynchronous function calls utility (so far, only a timed out function call). |
getargs.py | (Yet Another getopt) Parse command line arguments (updated March 2006) |
process.py | Simple process management. Allows to launch, kill and see the status of a process independently from the platform (at least on Win32 and Unix). |
dataxfer.py | Functions for transfering arbitrary sized data in a distributed (client/server) environment via a FTP server. |
glock.py | Global (inter-process) mutex on Windows and Unix. |
pool.py | Resource pool management. |
message.py | Representation of an e-mail.
Used by imap.py and pop3.py . |
imap.py | Utilities for reading IMAP mail. |
pop3.py | Utilities for reading POP3 mail. |
fwdmail.py | Script for forwarding IMAP or POP mail. |
platform.py | Platform information.
This one was written by Marc-Andre Lemburg
(mal@lemburg.com), I have included it
here because it is used by some of my modules. NB: this module is now part of Python 2.3+ std distribution. |